Community detection is the task of finding large groups of similar items inside a large relationship graph, where it is expected that related items are (in the assortative case) more likely to be linked together. The Stochastic Block Model (abbreviated in SBM) has been designed by Holland et al. [Hol83] to analyze the performance of algorithms for this task ; is consists in a random graph whose edge probabilities depend only on the community membership of their endpoints. Since then, a large numbed of articles has been devoted to the study of this model ; a survey of these results can be found in Abbe [Abb17], or in Fortunato [For10] for a more general view on community detection.
The sparse case, when edge probabilities are in , is known to be much harder to study than denser models ; the existence of a positive portion of isolated vertices makes complete reconstruction impossible, and studies usually focus on partial recovery of the community structure. Insights on this topic often come from statistical physics ; in the two-community case, Decelle et al. conjectured in [Dec11] the existence of a threshold for reconstruction, which was then proved in Mossel et al. [Mos15] for the first part, Massoulié [Mas13] and Mossel et al. [Mos13] for the converse part. Similarly, in the general case, a method was first presented in Krzakala et al. [Krz13] and then proven to work in Bordenave et al. [Bor15] – bar a technical condition – and Abbe and Sandon [Abb16].
Notably, in the sparse setting, the usual method relying on the eigenvectors of the adjacency matrix of fails due to the lack of separation of the eigenvalues. Consequently, a wide array of alternative spectral methods have been designed, relying on the spectrum of a matrix associated to . More precisely, the eigenvectors associated to the highest eigenvalues will often carry some information about the community structure of , enough for partial reconstruction. Examples include the path expansion matrix used in [Mas13], or the non-backtracking matrix in [Krz13].
Additionally, other types of methods can be used in this setting : for example, the semi-definite programming (or SDP) algorithm relaxes the problem into a convex optimization one, which can then be approximately solved (see for example Montanari and Sen [Mon16]).
An important feature of real-life networks that is missing from the SBM is the existence of small-scale regions of higher density, that arise from phenomena unrelated to the community structure. For this reason, a common variant of the SBM is the addition of small cliques to the generated random graph. Commonly-used spectral methods, for example those relying on the non-backtracking matrix in [Bor15], are known to fail in this setting, due to the apparition of localized eigenvectors, with no ties to the community structure, and corresponding to large eigenvalues – see Zhang [Zha16]
for a comparison of those methods, as well as a proposed heuristic to deal with those localized vectors by lowering their associated eigenvalues. SDP methods are the most studied for this problem, due to their natural stability ; in particular, Makarychev et al.[Mak16] show a reconstruction algorithm that is robust to the adversarial addition of edges, in the case of an arbitrary number of communities ; this was also shown independently by Moitra et al. [Moi16]. However, all the SDP methods mentioned here fail to reach the KS threshold by at least a large constant, with only [Mon16] approaching it as the average degree increases.
1.2 Setting and main results
Stochastic block model
Let be a given integer, and a probability vector. We consider a random graph as follows. The vertex set is taken to be , and each vertex is assigned a type sampled independently from distribution .
Given a symmetric matrix with positive coefficients, two vertices in are joined with an edge randomly and independently with probability
Following [Bor15], we introduce and define the mean progeny matrix ; the eigenvalues of are the same as those of and in particular are real. We denote them by
We shall make the following regularity assumptions : first,