Achieving Exact Cluster Recovery Threshold via Semidefinite Programming

11/24/2014 ∙ by Bruce Hajek, et al. ∙ University of Illinois at Urbana-Champaign University of Pennsylvania 0

The binary symmetric stochastic block model deals with a random graph of n vertices partitioned into two equal-sized clusters, such that each pair of vertices is connected independently with probability p within clusters and q across clusters. In the asymptotic regime of p=a n/n and q=b n/n for fixed a,b and n →∞, we show that the semidefinite programming relaxation of the maximum likelihood estimator achieves the optimal threshold for exactly recovering the partition from the graph with probability tending to one, resolving a conjecture of Abbe et al. Abbe14. Furthermore, we show that the semidefinite programming relaxation also achieves the optimal recovery threshold in the planted dense subgraph model containing a single cluster of size proportional to n.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The community detection problem refers to finding the underlying communities within a network using only the knowledge of the network topology [16, 31]. This paper considers the following probabilistic model for generating a network with underlying community structures: Suppose that out of a total of vertices, of them are partitioned into clusters of size , and the remaining

vertices do not belong to any clusters (called outlier vertices); a random graph

is generated based on the cluster structure, where each pair of vertices is connected independently with probability if they are in the same cluster or otherwise. In particular, an outlier vertex is connected to any other vertex with probability This random graph ensemble is known as the planted cluster model [10] with parameters and such that . In particular, we call and the in-cluster and cross-cluster edge density, respectively. The planted cluster model encompasses several classical planted random graph models including planted clique [5], planted coloring [4], planted dense subgraph [6], planted partition [11], and the stochastic block model [23], which have been widely used for studying the community detection and graph partitioning problem (see, e.g., [26, 12, 30, 9] and the references therein).

In this paper, we focus on the following particular cases:

  • Binary symmetric stochastic block model (assuming is even):

    (1)
  • Planted dense subgraph model:

    (2)

where and are fixed constants, and study the problem of exactly recovering the clusters (up to a permutation of cluster indices) from the observation of the graph .

Exact cluster recovery under the binary symmetric stochastic block model is studied in [1, 29] and a sharp recovery threshold is found.

Theorem 1 ([1, 29]).

Under the binary symmetric stochastic block model eq:SBMscaling, if ,111If , is also sufficient for exact recovery as shown by [29]. But for , since , the ER random graph contains isolated vertices with probability bounded away from zero and exact recovery is impossible. the clusters can be exactly recovered up to a permutation of cluster indices with probability converging to one; if , no algorithm can exactly recover the clusters with probability converging to one.

The optimal reconstruction threshold in thm:optimalSBM is achieved by the maximum likelihood (ML) estimator, which entails finding the minimum bisection of the graph, a problem known to be NP-hard in the worst case [18, Theorem 1.3]. Nevertheless, it has been shown that the optimal recovery threshold can be attained in polynomial time using a two-step procedure [1, 29]: First, apply the partial recovery algorithms in [28, 24] to correctly cluster all but vertices; Second, flip the cluster memberships of those vertices who do not agree with the majority of their neighbors. It remains open to find a simple direct approach to achieve the exact recovery threshold in polynomial time. It was proved in [1] that a semidefinite programming (SDP) relaxation of the ML estimator succeeds if . Backed by compelling simulation results, it was further conjectured in [1] that the SDP relaxation can achieve the optimal recovery threshold. In this paper, we resolve this conjecture in the positive.

In addition, we prove that the SDP relaxation achieves the optimal recovery threshold for the planted dense subgraph model eq:PDSscaling where the cluster size scales linearly in . This conclusion is in sharp contrast to the following computational barrier established in [22]: If grows and decay sublinearly in , attaining the statistical optimal recovery threshold is at least as hard as solving the planted clique problem (See sec:pds for detailed discussions).

Since the initial posting of this paper to arXiv, a number of interesting papers have been posted, some extending or improving our results. Another resolution of the conjecture in [1] was given in [8] independently. A sharp characterization of the threshold for exact recovery for a general class of stochastic block models is derived in [2], which includes the two cases considered in this paper as special cases. Extensions of this paper appear in [21], showing SDP provides exact recovery up to the information theoretic threshold for a fixed number of equal sized clusters or two unequal sized clusters. More recently, the preprint [3] shows similar optimality results of SDP for number of equal-sized clusters and [32] establishes optimality results of SDP for a fixed number of clusters with unequal sizes.

Notation

Let denote the adjacency matrix of the graph ,

denote the identity matrix, and

denote the all-one matrix. We write if is positive semidefinite and if all the entries of are non-negative. Let denote the set of all symmetric matrices. For , let

denote its second smallest eigenvalue. For any matrix

, let denote its spectral norm. For any positive integer , let . For any set , let denote its cardinality and denote its complement. We use standard big notations, e.g., for any sequences and , or if there is an absolute constant such that . Let

denote the Bernoulli distribution with mean

and

denote the binomial distribution with

trials and success probability . All logarithms are natural and we use the convention .

2 Stochastic block model

The cluster structure under the binary symmetric stochastic block model can be represented by a vector

such that if vertex is in the first cluster and otherwise. Let correspond to the true clusters. Then the ML estimator of for the case can be simply stated as

s.t.
(3)

which maximizes the number of in-cluster edges minus the number of out-cluster edges. This is equivalent to solving the NP-hard minimum graph bisection problem. Instead, let us consider its convex relaxation similar to the SDP relaxation studied in [19, 1]. Let . Then is equivalent to and if and only if . Therefore, eq:SBMML1 can be recast as

s.t.
(4)

Notice that the matrix is a rank-one positive semidefinite matrix. If we relax this condition by dropping the rank-one restriction, we obtain the following convex relaxation of eq:SBMML2, which is a semidefinite program:

s.t.
(5)

We remark that eq:SBMconvex does not rely on any knowledge of the model parameters except that ; for the case , we replace in eq:SBMconvex by . The SDP formulation introduced in [1] is slightly different from ours: it did not impose the constraint and the objective function is the inner product between a weighted adjacency matrix and

Let and . The following result establishes the optimality of the SDP procedure:

Theorem 2.

If , then as .

Remark 1.

It is worthy to note the relationship between our SDP relaxation and other related formulations that appeared previously in the literature. In fact, eq:SBMconvex coincides with the SDP studied in [14, p. 659] for MIN BISECTION, where a sufficient condition is obtained in [14, Lemma 19] that is not optimal. The SDP relaxation considered in [17, Equation (15)] for MAX BISECTION is equivalent to eq:SBMconvex with max replaced by min (used when ) and by . The proof of thm:SBMSharp shows that this more relaxed version also works. Finally, we note that the SDP used in [1] can be viewed as a penalized version of eq:SBMconvex with added to the objective function.

Remark 2.

thm:SBMSharp naturally extends to the semirandom model considered in [14], where after a graph is instantiated from the SBM, a monotone adversary can add in-cluster edges and delete cross-cluster edges arbitrarily. Although this process may appear to make the cluster structure more visible, such an adversarial model is known to foil many procedures based on degrees, local search or graph spectrum [14]. By design, the SDP eq:SBMconvex enjoys robustness against such a monotone adversary, which has already been observed in [14] and proved in [9, Lemma 1]. More specifically, let denote the adjacency matrix of the altered graph. Whenever is the unique maximizer of eq:SBMconvex, i.e., for any other feasible , then is also the unique maximizer of eq:SBMconvex with replaced by . To see this, note that, by Cauchy-Schwartz inequality, and imply that for all . Consequently, . Then , establishing the unique optimality of .

3 Planted dense subgraph model

In this section we turn to the planted dense subgraph model in the asymptotic regime eq:PDSscaling, where there exists a single cluster of size . To specify the optimal reconstruction threshold, define the following function: For , let

(6)

where if and . We show that if , exact recovery is achievable in polynomial-time via SDP with probability tending to one; if , any estimator fails to recover the cluster with probability tending to one regardless of the computational costs. The sharp threshold is plotted in Fig. 1 for various values of .

Figure 1: The solid curves show the recovery threshold for the planted dense subgraph model eq:PDSscaling for three values of For each there are two curves; recovery is not possible for in the open region between the curves and recovery is possible by the SDP for in the open region outside the two curves. Similarly, the two dashed curves correspond to the recovery threshold for the stochastic block model eq:SBMscaling.

We first introduce the maximum likelihood estimator and its convex relaxation. For ease of notation, in this section we use a vector , as opposed to used in sec:SBM for the SBM, as the indicator function of the cluster, such that if vertex is in the cluster and otherwise. Let be the indicator of the true cluster. Assuming , i.e., the vertices in the cluster are more densely connected, the ML estimation of is simply

s.t.
(7)

which maximizes the number of in-cluster edges. Due to the integrality constraints, it is computationally difficult to solve eq:PDSML1, which prompts us to consider its convex relaxation. Note that eq:PDSML1 can be equivalently222Here eq:PDSML1 and eq:PDSML2 are equivalent in the following sense: for any feasible for eq:PDSML1, is feasible for eq:PDSML2; for any feasible for eq:PDSML2, either or is feasible for eq:PDSML1. formulated as

s.t.
(8)

where the matrix is positive semidefinite and rank-one. Removing the rank-one restriction leads to the following convex relaxation of eq:PDSML2, which is a semidefinite program.

s.t.
(9)

We note that, apart from the assumption that , the only model parameter needed by the estimator eq:PDSCVX is the cluster size ; for the case , we replace in eq:PDSCVX by .

Let correspond to the true cluster and define . The recovery threshold for the SDP eq:PDSCVX is given as follows.

Theorem 3.

Under the planted dense subgraph model eq:PDSscaling, if

(10)

then as .

Next we prove a converse for thm:PlantedSharp which shows that the recovery threshold achieved by the SDP relaxation is in fact optimal.

Theorem 4.

If

(11)

then for any sequence of estimators , .

Under the planted dense subgraph model, our investigation of the exact cluster recovery problem thus far in this paper has been focused on the regime where the cluster size grows linearly with and , where the statistically optimal threshold can be attained by SDP in polynomial time. However, this need not be the case if grows sublinearly in . In fact, the exact cluster recovery problem has been studied in [10, 22] in the following asymptotic regime:

(12)

where and are fixed constants. The statistical and computational complexities of the cluster recovery problem depend crucially on the value of and (see [22, Figure 2] for an illustration):

  • : the planted cluster can be perfectly recovered in polynomial-time with high probability via the SDP relaxation eq:PDSCVX.333In fact, an even looser SDP relaxation than eq:PDSCVX has been shown to exactly recover the planted cluster with high probability for . See [10, Theorem 2.3].

  • : the planted cluster can be detected in linear time with high probability by thresholding the total number of edges, but it is conjectured to be computationally intractable to exactly recover the planted cluster.

  • : the planted cluster can be exactly recovered with high probability via ML estimation; however, no randomized polynomial-time solver exists conditioned on the planted clique hardness hypothesis.444Here the planted clique hardness hypothesis refers to the statement that for any fixed constants and , there exist no randomized polynomial-time tests to distinguish an Erdős-Rényi random graph and a planted clique model which is obtained by adding edges to vertices chosen uniformly from to form a clique. For various hardness results of problems reducible from the planted clique problem, see [22] and the references within.

  • : regardless of the computational costs, no algorithm can exactly recover the planted cluster with vanishing probability of error.

Consequently, assuming the planted clique hardness hypothesis, in the asymptotic regime of eq:scaling when (and, quite possibly, the entire range ), there exists a significant gap between the information limit (recovery threshold of the optimal procedure) and the computational limit (recovery threshold for polynomial-time algorithms). In contrast, in the asymptotic regime of eq:PDSscaling, the computational constraint imposes no penalty on the statistical performance, in that the optimal threshold can be attained by SDP relaxation in view of thm:PlantedSharp.

4 Proofs

In this section, we give the proofs of our main theorems. Our analysis of the SDP relies on two key ingredients: the spectrum of Erdős-Rényi random graphs and the tail bounds for the binomial distributions, which we first present.

4.1 Spectrum of Erdős-Rényi random graph

Let denote the adjacency matrix of an Erdős-Rényi random graph , where vertices and are connected independently with probability . Then . Let and assume for any constant . We aim to show that with high probability for some constant . To this end, we establish the following more general result where the entries need not be binary-valued.

Theorem 5.

Let

denote a symmetric and zero-diagonal random matrix, where the entries

are independent and -valued. Assume that , where for arbitrary constants and . Then for any , there exists such that for any , .

Let denote the Erdős-Rényi random graph model with the edge probability for all . Results similar to thm:adjconcentration have been obtained in [15] for the special case of for some sufficiently large . In fact, thm:adjconcentration can be proved by strengthening the combinatorial arguments in [15, Section 2.2]. Here we provide an alternative proof using results from random matrices and concentration of measures and a seconder-order stochastic comparison argument from [25].

Furthermore, we note that the condition in thm:adjconcentration is in fact necessary to ensure that (see app:adj for a proof). The condition can be dropped in the special case of .

Proof.

We first use the second-order stochastic comparison arguments from [25, Lemma 2]. Since , we have for all and hence . Let denote the adjacency matrix of a graph generated from . Then, for any , is stochastically smaller than under the convex ordering, i.e., for any convex function on .555This follows from for any , by the convexity of .

Since the spectral norm is a convex function and the coordinate random variables are independent (up to symmetry), it follows that

and thus

(13)

We next bound . Let denote an matrix with independent entries drawn from , which is the distribution of a Rademacher random variable multiplied with an independent Bernoulli with bias . Define as and for all . Let be an independent copy of . Let be a zero-diagonal symmetric matrix whose entries are drawn from and be an independent copy of . Let denote an zero-diagonal symmetric matrix whose entries are Rademacher and independent from and . We apply the usual symmetrization arguments:

(14)

where follow from the Jensen’s inequality; follows because has the same distribution as , where denotes the element-wise product; follow from the triangle inequality; follows from the fact that has the same distribution as . Then, we apply the result of Seginer [33] which characterized the expected spectral norm of i.i.d. random matrices within universal constant factors. Let , which are independent . Since is symmetric, [33, Theorem 1.1] and Jensen’s inequality yield

(15)

for some universal constant . In view of the following Chernoff bound for the binomial distribution [27, Theorem 4.4]:

for all , setting and applying the union bound, we have

(16)

where the last inequality follows from . Assembling eq:boundonA – eq:maxX, we obtain

(17)

for some positive constant depending only on . Since the entries of are valued in , Talagrand’s concentration inequality for 1-Lipschitz convex functions (see, e.g., [34, Theorem 2.1.13]) yields

for some absolute constants , which implies that for any , there exists depending on , such that

4.2 Tail of the Binomial Distribution

Let and for and , where for some as . We need the following tail bounds.

Lemma 1 ([1]).

Assume that and such that . Then

Lemma 2.

Let be such that and for some and . Then

(18)
(19)
Proof.

We use the following non-asymptotic bound on the binomial tail probability [7, Lemma 4.7.2]: For ,

(20)

where and is the binary divergence function. Then eq:binomupbound2 follows from eq:ash1 by noting that .

To prove eq:binomupbound1, we use the following bound on binomial coefficients [7, Lemma 4.7.1]:

(21)

where and is the binary entropy function. Note that the mode of is at , which is at least for sufficiently large . Therefore, is non-decreasing in for and hence

(22)

where and . Applying eq:ash2 to eq:mode yields

which is the desired eq:binomupbound1. ∎

4.3 Proofs for the stochastic block model

The following lemma provides a deterministic sufficient condition for the success of SDP eq:SBMconvex in the case .

Lemma 3.

Suppose there exist and such that satisfies , and

(23)

Then is the unique solution to eq:SBMconvex.

Proof.

The Lagrangian function is given by

where the Lagrangian multipliers are denoted by , , and . Then for any satisfying the constraints in eq:SBMconvex,

where holds because ; holds because by eq:SBMKKT. Hence, is an optimal solution. It remains to establish its uniqueness. To this end, suppose is an optimal solution. Then,

where holds because ; holds because and for all . In view of eq:SBMKKT, since , with , must be a multiple of . Because for all , . ∎

Proof of thm:SBMSharp.

The theorem is proved first for . Let with

(24)

and choose any . It suffices to show that satisfies the conditions in lmm:SBMKKT with probability .

By definition, for all , i.e., . Since , eq:SBMKKT holds, that is, . It remains to verify that and with probability at least which amounts to showing that

(25)

Note that and . Thus for any such that and ,

(26)

where holds since and . It follows from thm:adjconcentration that with probability at least for some positive constants depending only on . Moreover, note that each is equal in distribution to , where and are independent. Hence, lmm:binomialconcentration implies that

Applying the union bound implies that holds with probability at least . It follows from the assumption and eq:SBMPSDCheck that the desired eq:lambda2 holds, completing the proof in the case .

For the case , we replace the by in the SDP eq:SBMconvex, which is equivalent to substituting for in the original maximization problem, as well as the sufficient condition in lmm:SBMKKT. Set the dual variable according to eq:di-SBM with replacing and choose any . Then eq:SBMKKT still holds and eq:SBMPSDCheck changes to , where holds with probability at least by lmm:binomialconcentration and the union bound. Therefore, in view of thm:adjconcentration and the assumption , the desired eq:lambda2 still holds, completing the proof for the case . ∎

Remark 3.

For simplicity so far we have focused on the case where with being fixed constants. If we are strictly above the recover threshold, namely, , thm:SBMSharp shows that the probability of error is polynomially small in , which cannot be improved by MLE (though a better exponent is conceivable). Now, if we allow and to vary with , the following result for the optimal estimator (MLE) has been obtained in [29] that gives the second-order refinement of thm:optimalSBM: If , then clusters can be exactly recovered up to a permutation of cluster indices with probability converging to if and only if

(27)

where for two positive sequences denotes . Inspecting the proof of thm:SBMSharp and replacing lmm:binomialconcentration by the non-asymptotic version in [21, Lemma 2], one can strengthen the sufficient condition for the success of SDP eq:SBMconvex: as , provided that

(28)

for some universal constant , which is stronger than the optimal condition eq:mls. It is unclear whether eq:SDP-2nd is necessary, nor do we know if SDP requires to be positive to succeed.

4.4 Proofs for the planted densest subgraph model

Lemma 4.

Suppose there exist , with , , and such that satisfies , , and

(29)

Then is the unique solution to eq:PDSCVX.

Proof.

The Lagrangian function is given by

where , , with , and are the Lagrangian multipliers. Then, for any satisfying the constraints in eq:PDSCVX, It follows that