1 Introduction
The community detection problem refers to finding the underlying communities within a network using only the knowledge of the network topology [16, 31]. This paper considers the following probabilistic model for generating a network with underlying community structures: Suppose that out of a total of vertices, of them are partitioned into clusters of size , and the remaining
vertices do not belong to any clusters (called outlier vertices); a random graph
is generated based on the cluster structure, where each pair of vertices is connected independently with probability if they are in the same cluster or otherwise. In particular, an outlier vertex is connected to any other vertex with probability This random graph ensemble is known as the planted cluster model [10] with parameters and such that . In particular, we call and the incluster and crosscluster edge density, respectively. The planted cluster model encompasses several classical planted random graph models including planted clique [5], planted coloring [4], planted dense subgraph [6], planted partition [11], and the stochastic block model [23], which have been widely used for studying the community detection and graph partitioning problem (see, e.g., [26, 12, 30, 9] and the references therein).In this paper, we focus on the following particular cases:

Binary symmetric stochastic block model (assuming is even):
(1) 
Planted dense subgraph model:
(2)
where and are fixed constants, and study the problem of exactly recovering the clusters (up to a permutation of cluster indices) from the observation of the graph .
Exact cluster recovery under the binary symmetric stochastic block model is studied in [1, 29] and a sharp recovery threshold is found.
Theorem 1 ([1, 29]).
Under the binary symmetric stochastic block model eq:SBMscaling, if ,^{1}^{1}1If , is also sufficient for exact recovery as shown by [29]. But for , since , the ER random graph contains isolated vertices with probability bounded away from zero and exact recovery is impossible. the clusters can be exactly recovered up to a permutation of cluster indices with probability converging to one; if , no algorithm can exactly recover the clusters with probability converging to one.
The optimal reconstruction threshold in thm:optimalSBM is achieved by the maximum likelihood (ML) estimator, which entails finding the minimum bisection of the graph, a problem known to be NPhard in the worst case [18, Theorem 1.3]. Nevertheless, it has been shown that the optimal recovery threshold can be attained in polynomial time using a twostep procedure [1, 29]: First, apply the partial recovery algorithms in [28, 24] to correctly cluster all but vertices; Second, flip the cluster memberships of those vertices who do not agree with the majority of their neighbors. It remains open to find a simple direct approach to achieve the exact recovery threshold in polynomial time. It was proved in [1] that a semidefinite programming (SDP) relaxation of the ML estimator succeeds if . Backed by compelling simulation results, it was further conjectured in [1] that the SDP relaxation can achieve the optimal recovery threshold. In this paper, we resolve this conjecture in the positive.
In addition, we prove that the SDP relaxation achieves the optimal recovery threshold for the planted dense subgraph model eq:PDSscaling where the cluster size scales linearly in . This conclusion is in sharp contrast to the following computational barrier established in [22]: If grows and decay sublinearly in , attaining the statistical optimal recovery threshold is at least as hard as solving the planted clique problem (See sec:pds for detailed discussions).
Since the initial posting of this paper to arXiv, a number of interesting papers have been posted, some extending or improving our results. Another resolution of the conjecture in [1] was given in [8] independently. A sharp characterization of the threshold for exact recovery for a general class of stochastic block models is derived in [2], which includes the two cases considered in this paper as special cases. Extensions of this paper appear in [21], showing SDP provides exact recovery up to the information theoretic threshold for a fixed number of equal sized clusters or two unequal sized clusters. More recently, the preprint [3] shows similar optimality results of SDP for number of equalsized clusters and [32] establishes optimality results of SDP for a fixed number of clusters with unequal sizes.
Notation
Let denote the adjacency matrix of the graph ,
denote the identity matrix, and
denote the allone matrix. We write if is positive semidefinite and if all the entries of are nonnegative. Let denote the set of all symmetric matrices. For , letdenote its second smallest eigenvalue. For any matrix
, let denote its spectral norm. For any positive integer , let . For any set , let denote its cardinality and denote its complement. We use standard big notations, e.g., for any sequences and , or if there is an absolute constant such that . Letdenote the Bernoulli distribution with mean
anddenote the binomial distribution with
trials and success probability . All logarithms are natural and we use the convention .2 Stochastic block model
The cluster structure under the binary symmetric stochastic block model can be represented by a vector
such that if vertex is in the first cluster and otherwise. Let correspond to the true clusters. Then the ML estimator of for the case can be simply stated ass.t.  
(3) 
which maximizes the number of incluster edges minus the number of outcluster edges. This is equivalent to solving the NPhard minimum graph bisection problem. Instead, let us consider its convex relaxation similar to the SDP relaxation studied in [19, 1]. Let . Then is equivalent to and if and only if . Therefore, eq:SBMML1 can be recast as
s.t.  
(4) 
Notice that the matrix is a rankone positive semidefinite matrix. If we relax this condition by dropping the rankone restriction, we obtain the following convex relaxation of eq:SBMML2, which is a semidefinite program:
s.t.  
(5) 
We remark that eq:SBMconvex does not rely on any knowledge of the model parameters except that ; for the case , we replace in eq:SBMconvex by . The SDP formulation introduced in [1] is slightly different from ours: it did not impose the constraint and the objective function is the inner product between a weighted adjacency matrix and
Let and . The following result establishes the optimality of the SDP procedure:
Theorem 2.
If , then as .
Remark 1.
It is worthy to note the relationship between our SDP relaxation and other related formulations that appeared previously in the literature. In fact, eq:SBMconvex coincides with the SDP studied in [14, p. 659] for MIN BISECTION, where a sufficient condition is obtained in [14, Lemma 19] that is not optimal. The SDP relaxation considered in [17, Equation (15)] for MAX BISECTION is equivalent to eq:SBMconvex with max replaced by min (used when ) and by . The proof of thm:SBMSharp shows that this more relaxed version also works. Finally, we note that the SDP used in [1] can be viewed as a penalized version of eq:SBMconvex with added to the objective function.
Remark 2.
thm:SBMSharp naturally extends to the semirandom model considered in [14], where after a graph is instantiated from the SBM, a monotone adversary can add incluster edges and delete crosscluster edges arbitrarily. Although this process may appear to make the cluster structure more visible, such an adversarial model is known to foil many procedures based on degrees, local search or graph spectrum [14]. By design, the SDP eq:SBMconvex enjoys robustness against such a monotone adversary, which has already been observed in [14] and proved in [9, Lemma 1]. More specifically, let denote the adjacency matrix of the altered graph. Whenever is the unique maximizer of eq:SBMconvex, i.e., for any other feasible , then is also the unique maximizer of eq:SBMconvex with replaced by . To see this, note that, by CauchySchwartz inequality, and imply that for all . Consequently, . Then , establishing the unique optimality of .
3 Planted dense subgraph model
In this section we turn to the planted dense subgraph model in the asymptotic regime eq:PDSscaling, where there exists a single cluster of size . To specify the optimal reconstruction threshold, define the following function: For , let
(6) 
where if and . We show that if , exact recovery is achievable in polynomialtime via SDP with probability tending to one; if , any estimator fails to recover the cluster with probability tending to one regardless of the computational costs. The sharp threshold is plotted in Fig. 1 for various values of .
We first introduce the maximum likelihood estimator and its convex relaxation. For ease of notation, in this section we use a vector , as opposed to used in sec:SBM for the SBM, as the indicator function of the cluster, such that if vertex is in the cluster and otherwise. Let be the indicator of the true cluster. Assuming , i.e., the vertices in the cluster are more densely connected, the ML estimation of is simply
s.t.  
(7) 
which maximizes the number of incluster edges. Due to the integrality constraints, it is computationally difficult to solve eq:PDSML1, which prompts us to consider its convex relaxation. Note that eq:PDSML1 can be equivalently^{2}^{2}2Here eq:PDSML1 and eq:PDSML2 are equivalent in the following sense: for any feasible for eq:PDSML1, is feasible for eq:PDSML2; for any feasible for eq:PDSML2, either or is feasible for eq:PDSML1. formulated as
s.t.  
(8) 
where the matrix is positive semidefinite and rankone. Removing the rankone restriction leads to the following convex relaxation of eq:PDSML2, which is a semidefinite program.
s.t.  
(9) 
We note that, apart from the assumption that , the only model parameter needed by the estimator eq:PDSCVX is the cluster size ; for the case , we replace in eq:PDSCVX by .
Let correspond to the true cluster and define . The recovery threshold for the SDP eq:PDSCVX is given as follows.
Theorem 3.
Under the planted dense subgraph model eq:PDSscaling, if
(10) 
then as .
Next we prove a converse for thm:PlantedSharp which shows that the recovery threshold achieved by the SDP relaxation is in fact optimal.
Theorem 4.
If
(11) 
then for any sequence of estimators , .
Under the planted dense subgraph model, our investigation of the exact cluster recovery problem thus far in this paper has been focused on the regime where the cluster size grows linearly with and , where the statistically optimal threshold can be attained by SDP in polynomial time. However, this need not be the case if grows sublinearly in . In fact, the exact cluster recovery problem has been studied in [10, 22] in the following asymptotic regime:
(12) 
where and are fixed constants. The statistical and computational complexities of the cluster recovery problem depend crucially on the value of and (see [22, Figure 2] for an illustration):

: the planted cluster can be perfectly recovered in polynomialtime with high probability via the SDP relaxation eq:PDSCVX.^{3}^{3}3In fact, an even looser SDP relaxation than eq:PDSCVX has been shown to exactly recover the planted cluster with high probability for . See [10, Theorem 2.3].

: the planted cluster can be detected in linear time with high probability by thresholding the total number of edges, but it is conjectured to be computationally intractable to exactly recover the planted cluster.

: the planted cluster can be exactly recovered with high probability via ML estimation; however, no randomized polynomialtime solver exists conditioned on the planted clique hardness hypothesis.^{4}^{4}4Here the planted clique hardness hypothesis refers to the statement that for any fixed constants and , there exist no randomized polynomialtime tests to distinguish an ErdősRényi random graph and a planted clique model which is obtained by adding edges to vertices chosen uniformly from to form a clique. For various hardness results of problems reducible from the planted clique problem, see [22] and the references within.

: regardless of the computational costs, no algorithm can exactly recover the planted cluster with vanishing probability of error.
Consequently, assuming the planted clique hardness hypothesis, in the asymptotic regime of eq:scaling when (and, quite possibly, the entire range ), there exists a significant gap between the information limit (recovery threshold of the optimal procedure) and the computational limit (recovery threshold for polynomialtime algorithms). In contrast, in the asymptotic regime of eq:PDSscaling, the computational constraint imposes no penalty on the statistical performance, in that the optimal threshold can be attained by SDP relaxation in view of thm:PlantedSharp.
4 Proofs
In this section, we give the proofs of our main theorems. Our analysis of the SDP relies on two key ingredients: the spectrum of ErdősRényi random graphs and the tail bounds for the binomial distributions, which we first present.
4.1 Spectrum of ErdősRényi random graph
Let denote the adjacency matrix of an ErdősRényi random graph , where vertices and are connected independently with probability . Then . Let and assume for any constant . We aim to show that with high probability for some constant . To this end, we establish the following more general result where the entries need not be binaryvalued.
Theorem 5.
Let
denote a symmetric and zerodiagonal random matrix, where the entries
are independent and valued. Assume that , where for arbitrary constants and . Then for any , there exists such that for any , .Let denote the ErdősRényi random graph model with the edge probability for all . Results similar to thm:adjconcentration have been obtained in [15] for the special case of for some sufficiently large . In fact, thm:adjconcentration can be proved by strengthening the combinatorial arguments in [15, Section 2.2]. Here we provide an alternative proof using results from random matrices and concentration of measures and a seconderorder stochastic comparison argument from [25].
Furthermore, we note that the condition in thm:adjconcentration is in fact necessary to ensure that (see app:adj for a proof). The condition can be dropped in the special case of .
Proof.
We first use the secondorder stochastic comparison arguments from [25, Lemma 2]. Since , we have for all and hence . Let denote the adjacency matrix of a graph generated from . Then, for any , is stochastically smaller than under the convex ordering, i.e., for any convex function on .^{5}^{5}5This follows from for any , by the convexity of .
Since the spectral norm is a convex function and the coordinate random variables are independent (up to symmetry), it follows that
and thus(13) 
We next bound . Let denote an matrix with independent entries drawn from , which is the distribution of a Rademacher random variable multiplied with an independent Bernoulli with bias . Define as and for all . Let be an independent copy of . Let be a zerodiagonal symmetric matrix whose entries are drawn from and be an independent copy of . Let denote an zerodiagonal symmetric matrix whose entries are Rademacher and independent from and . We apply the usual symmetrization arguments:
(14) 
where follow from the Jensen’s inequality; follows because has the same distribution as , where denotes the elementwise product; follow from the triangle inequality; follows from the fact that has the same distribution as . Then, we apply the result of Seginer [33] which characterized the expected spectral norm of i.i.d. random matrices within universal constant factors. Let , which are independent . Since is symmetric, [33, Theorem 1.1] and Jensen’s inequality yield
(15) 
for some universal constant . In view of the following Chernoff bound for the binomial distribution [27, Theorem 4.4]:
for all , setting and applying the union bound, we have
(16) 
where the last inequality follows from . Assembling eq:boundonA – eq:maxX, we obtain
(17) 
for some positive constant depending only on . Since the entries of are valued in , Talagrand’s concentration inequality for 1Lipschitz convex functions (see, e.g., [34, Theorem 2.1.13]) yields
for some absolute constants , which implies that for any , there exists depending on , such that ∎
4.2 Tail of the Binomial Distribution
Let and for and , where for some as . We need the following tail bounds.
Lemma 1 ([1]).
Assume that and such that . Then
Lemma 2.
Let be such that and for some and . Then
(18)  
(19) 
Proof.
We use the following nonasymptotic bound on the binomial tail probability [7, Lemma 4.7.2]: For ,
(20) 
where and is the binary divergence function. Then eq:binomupbound2 follows from eq:ash1 by noting that .
To prove eq:binomupbound1, we use the following bound on binomial coefficients [7, Lemma 4.7.1]:
(21) 
where and is the binary entropy function. Note that the mode of is at , which is at least for sufficiently large . Therefore, is nondecreasing in for and hence
(22) 
where and . Applying eq:ash2 to eq:mode yields
which is the desired eq:binomupbound1. ∎
4.3 Proofs for the stochastic block model
The following lemma provides a deterministic sufficient condition for the success of SDP eq:SBMconvex in the case .
Lemma 3.
Suppose there exist and such that satisfies , and
(23) 
Then is the unique solution to eq:SBMconvex.
Proof.
The Lagrangian function is given by
where the Lagrangian multipliers are denoted by , , and . Then for any satisfying the constraints in eq:SBMconvex,
where holds because ; holds because by eq:SBMKKT. Hence, is an optimal solution. It remains to establish its uniqueness. To this end, suppose is an optimal solution. Then,
where holds because ; holds because and for all . In view of eq:SBMKKT, since , with , must be a multiple of . Because for all , . ∎
Proof of thm:SBMSharp.
The theorem is proved first for . Let with
(24) 
and choose any . It suffices to show that satisfies the conditions in lmm:SBMKKT with probability .
By definition, for all , i.e., . Since , eq:SBMKKT holds, that is, . It remains to verify that and with probability at least which amounts to showing that
(25) 
Note that and . Thus for any such that and ,
(26) 
where holds since and . It follows from thm:adjconcentration that with probability at least for some positive constants depending only on . Moreover, note that each is equal in distribution to , where and are independent. Hence, lmm:binomialconcentration implies that
Applying the union bound implies that holds with probability at least . It follows from the assumption and eq:SBMPSDCheck that the desired eq:lambda2 holds, completing the proof in the case .
For the case , we replace the by in the SDP eq:SBMconvex, which is equivalent to substituting for in the original maximization problem, as well as the sufficient condition in lmm:SBMKKT. Set the dual variable according to eq:diSBM with replacing and choose any . Then eq:SBMKKT still holds and eq:SBMPSDCheck changes to , where holds with probability at least by lmm:binomialconcentration and the union bound. Therefore, in view of thm:adjconcentration and the assumption , the desired eq:lambda2 still holds, completing the proof for the case . ∎
Remark 3.
For simplicity so far we have focused on the case where with being fixed constants. If we are strictly above the recover threshold, namely, , thm:SBMSharp shows that the probability of error is polynomially small in , which cannot be improved by MLE (though a better exponent is conceivable). Now, if we allow and to vary with , the following result for the optimal estimator (MLE) has been obtained in [29] that gives the secondorder refinement of thm:optimalSBM: If , then clusters can be exactly recovered up to a permutation of cluster indices with probability converging to if and only if
(27) 
where for two positive sequences denotes . Inspecting the proof of thm:SBMSharp and replacing lmm:binomialconcentration by the nonasymptotic version in [21, Lemma 2], one can strengthen the sufficient condition for the success of SDP eq:SBMconvex: as , provided that
(28) 
for some universal constant , which is stronger than the optimal condition eq:mls. It is unclear whether eq:SDP2nd is necessary, nor do we know if SDP requires to be positive to succeed.
4.4 Proofs for the planted densest subgraph model
Lemma 4.
Suppose there exist , with , , and such that satisfies , , and
(29) 
Then is the unique solution to eq:PDSCVX.
Proof.
The Lagrangian function is given by
where , , with , and are the Lagrangian multipliers. Then, for any satisfying the constraints in eq:PDSCVX, It follows that
Comments
There are no comments yet.