Identifying clusters from relational data is one of fundamental problems in computer science. It has many applications such as analyzing social networks [NWS02], detecting protein-protein interactions [MPN99, CY06], finding clusters in Hi-C genomic data [CAT16], image segmentation [SM00], recommendation systems [LSY03, SC11] and many others. The goal is to find a community structure from relational measurements between data points.
Although many clustering problems are known to be NP-hard, typical data we encounter in applications are very different from the worst-case instances. This motivates us to study probabilistic models and average-case complexity for them. The stochastic block model (SBM) is one such model that has received much attention in the past few decades. In the SBM, we observe a random graph on the finite set of nodes where each pair of nodes is independently joined by an edge with probability only depending on the community membership of the endpoints.
It is natural to consider the community detection problem for higher-order relations. A number of authors have already considered problems of learning from complex relational data [ALZM05, Gov05, ABB06] and it has several applications such as folksonomy [GZCN09, ZGC09]Gov05], and network alignment problems for protein-protein interactions [MN12].
We consider a version of SBM for higher-order relations, which we call the stochastic block model for -uniform hypergraph (-HSBM): we observe a random -uniform hypergraph such that each set of nodes of size appears independently as an (hyper-)edge with probability only depending on the community labels of nodes in it. -HSBM was first introduced in [GD14] and investigated for its statistical limit in terms of detection [LML17], the minimax misclassification ratio [LCW17, CLW18], and as a testbed for algorithms including naive spectral method [GD15a, GD15b, GD17], spectral method along with local refinements [Abb18, CLW18, ALS18] and approximate-message passing algorithms [ACKZ15, LML17].
We focus on exact recovery, where our goal is to fully recover the community labels of the nodes from a random -uniform hypergraph drawn from the model. For exact recovery, the maximum a posteriori (MAP) estimator always outperforms any other estimators in the sense that it has the highest probability of correctly recovering the solution. We prove that for the -HSBM with two equal-sized and symmetric communities, exact recovery shows a sharp phase transition behavior, and moreover, the threshold can be characterized by the success of a certain type of local refinement. This type of phenomenon was mentioned as “local-to-global amplification” in [Abb18], and was proved in [ABH16] for the usual SBM with two symmetric communities (corresponds to -HSBM) and more generally in [AS15a] for SBMs with fixed number of communities. Our result can be regarded as a direct generalization of [AS15a] to -uniform hypergraphs.
Furthermore, we analyze a certain convex relaxation technique for the -HSBM. We consider an algorithm which uses a semidefinite relaxation, based on the “truncate-and-relax” idea in our previous work [KBG17]. We prove that our algorithm guarantees exact recovery with high probability in a parameter regime which is orderwise optimal.
We remark that in [Abb18] it was suggested that the local refinement methods together with an efficient partial recovery algorithm would imply the efficient exact recovery up to the information-theoretic threshold. An explicit algorithm exploiting this idea appears in [CLW18, ALS18] with a provable threshold for their algorithm to be successful. We note that the threshold of the algorithm of [CLW18] matches with the statistical threshold we derive, hence there is no gap between statistical and computational thresholds. On the other hand, we prove that our SDP-based algorithm does not achieve the statistical threshold when .
1.1 The Stochastic Block Model for graphs: An overview
Before we discuss the main topic of the paper, we start by discussing the usual stochastic block model to motivate our work.
The stochastic block model (SBM) has been one of the most fruitful research topics in community detection and clustering. One benefit of it is that, being a generative model we can formally study the probability of inferring the ground truth. While data from the real-world can behave differently, the SBM is believed to provide good insights in the field of community detection and has been studied for its sharp phase transition behavior [MNS13, AS15a, ABH16], computational vs. information-theoretic gaps [CX16, AS15c], and as a test bed for various algorithms such as spectral methods [Mas14, Vu14], semidefinite programs [ABH16, HWX16, JMRT16], belief-propagation methods [DKMZ11, AS15b, AS16], and approximate message-passing algorithms [VSMGA14, CLR16, DAM16, LKZ17]. We recommend [Abb18] for a survey of this topic.
For the sake of exposition, let us consider the symmetric SBM with two equal-sized clusters, also known as the planted bisection model. Let be a positive integer, and let and be real numbers in . The planted bisection model with parameter , and is a generative model which outputs a random graph on vertices such that (i) the bipartition of defining two equal-sized clusters is chosen uniformly at random, and (ii) each pair in is connected independently with probability if and are in the same cluster, or probability otherwise. Note that this model coincides with Erdős-Rényi random graph model , when and are equal.
The goal is to find the ground truth either approximately or exactly, given a sampled graph . We may ask the following questions regarding the quality of the solution.
(Exact recovery) When can we find exactly (up to symmetry) with high probability?
(Almost exact recovery) Can we find a bipartition such that the vanishing portion of the vertices are mislabeled?
(Detection) Can we find a bipartition such that the portion of mislabeled vertices is less than for some positive constant ?
There are a number of works regarding these questions in the algorithmic point of view or in the sense of statistical achievability. The following is a short list of the states-of-the-art works regarding the model:
We further note that those sharp phase transition behaviours and algorithms achieving the threshold are found for general stochastic block models [AS15c, Abb18], [LML17, LKZ17]. This paper focuses on exact recovery.
1.2 The Stochastic Block Model for hypergraphs
The stochastic block model for hypergraphs (HSBM) is a natural generalization of the SBM for graphs which was first introduced in [GD14]. Informally, the HSBM can be thought as a generative model which returns a hypergraph with unknown clusters, and each hyperedge appears in the hypergraph independently with the probability depending on the community labels of the vertices involved in the hyperedge.
In [GD14], the authors consider the HSBM under the setting that the hypergraph generated by the model is -uniform and dense. They consider a spectral algorithm on a version of hypergraph Laplacian, and prove that the algorithm exactly recovers the partition for with probability . Subsequently, the same authors extended their results to sparse, non-uniform (but bounded order) setting, studying partial recovery [GD15b, GD15a, GD17].
We note that sparsity is an important factor to address in recovery problems of different types: exact recovery, almost exact recovery, and detection. In the case of the SBM for graphs, we recall that the average degree must be to assure exact recovery, and the average degree must be to assure detection. Conversely, the point of the sharp phase transition lies exactly in those regimes. We may expect similar behaviour for the -uniform HSBM. For exact recovery, it was confirmed that the phase transition occurs in the regime of logarithmic average degree, by analyzing the optimal minimax risk of -uniform HSBM [LCW17, CLW18]. For detection, the phase transition occurs in the regime of constant average degree [ACKZ15]. The authors of [ACKZ15] proposed a conjecture specifying the exact threshold point, based on the performance of belief-propagation algorithm. Also, such results for the weighted HSBM were independently proved in [ALS18]
and a exact threshold of the censored block model for uniform hypergraphs was classified in[ALS16].
In this paper, we consider a specific -uniform HSBM with two equal-sized clusters. Let us remark that in the SBM case, we had two parameters and where the probability that an edge appears in the graph is or depending on whether and are in the same cluster or not. For an hyperedge of size greater than 2, there are different ways to generalize this notion, but we will focus on a simple model that the probability that a set of size appears as a hyperedge depends on whether is completely contained in a cluster or not.
Let be a positive even number and let be the set of vertices of the hypergraph . Let be an integer. Let and be numbers between 0 and 1, possibly depending on . We denote the collection of size subsets of by . The -HSBM with parameters , , and , denoted , is a model which samples a -uniform hypergraph on the vertex set according to following rules.
is a vector inchosen uniformly at random, among those with the equal number of ’s and ’s. We may think and as community labels.
Each in appears independently as an hyperedge with probability
We say is in-cluster with respect to for the first case, and cross-cluster w.r.t. for the other case.
Our goal is to find the clusters from a given hypergraph generated from the model. We specially focus on exact recovery, formally defined as follows.
We say exact recovery in is possible if there exists an estimator which only fails to recover up to a global sign flip with vanishing probability, i.e.,
On the other hand, we say exact recovery in is impossible if any estimator fails to recover up to a global sign flip with probability , i.e.,
We remark that must be connected for exact recovery to be successful. In Erdős-Rényi (ER) model for random hypergraphs, it is known that a random hypergraph from the ER model is connected with high probability only if the expected average degree is at least for some 111The proof for this result is a direct adaptation of the proof in [Bol98] for , i.e., random graph model. See [COMS07, BCOK10, CKK15] for phase transitions regarding giant components, which justifies the regime for partial recovery and detection.. Together with the works in [ALS18] and [CLW18], this motivates us to work on the parameter regime where
for some constant and .
1.3 Main results
We first establish a sharp phase transition behaviour for exact recovery in the stochastic block model for -uniform hypergraphs. We will assume that the parameter is a fixed positive integer not depending on , and edge probabilities decay as
where and are fixed positive constants. Asymptotics in this paper are based on growing to infinity, unless noted otherwise.
Exact recovery in is possible if , and impossible if where .
In case of exact recovery, the maximum a posteriori (MAP) estimator achieves the minimum error probability. The MAP estimator corresponds to the maximum-likelihood (ML) estimator in this model since the partition is chosen from a uniform distribution. Hence, it is sufficent to analyze the performance of the ML estimator to prove Theorem1.
On the other hand, we ask whether there exists an efficient algorithm which recover the hidden partition achieving the information-theoretic threshold. Note that the ML estimator (which achieves the minimum error probability) is given by
This is in general hard to compute. For example, when and , it reduces to find a balanced bipartition with the minimum number of edges crossing given a graph , also known as MIN-BISECTION problem which is NP-hard. However, there is a simple and efficient algorithm which works up to the threshold of the ML estimator in case of . This algorithm is based on a standard semidefinite relaxation of MIN-BISECTION [GW95].
For general -HSBM, we propose an efficient algorithm using a “truncate-and-relax” strategy. Given a -uniform hypergraph on the vertex set , let us define a weighted graph on the same vertex set where the weights are given by
for each . Let be an optimal solution of
which is equivalent to finding the min-bisection of the weighted graph . Now, consider the following semidefinite program:
This program is a relaxation of the min-bisection problem above, since for any feasible in the original problem corresponds to a feasible solution in the relaxed problem.
The ML estimator attempts to maximize the function
over the vectors in the hypercube with equal number of ’s and ’s. We can write as a multilinear polynomial in , since for all . Let be the quadratic part of . Then, maximizing is equivalent to find the min-bisection of . This justifies our term truncate-and-relax, as in our previous work [KBG17].
Now, let be the solution of (1.1). We prove that this estimator correctly recovers the hidden partition with high probability up to a threshold which is order-wise optimal.
Suppose . Then is equal to with probability if where
It is natural to ask whether this analysis is tight. The proof proceeds by constructing a dual solution which certifies that is the unique optimum of (1.1) with high probability. Following [Ban16], the dual solution (if exists) is completely determined by which has the form of a “Laplacian” matrix. Precisely, the major part of the proof is devoted to prove that the matrix of size with entries
is positive-semidefinite with high probability. We use the Matrix Bernstein inequality to prove that the fluctuation
is smaller compared to the minimum eigenvalue ofw.h.p., under the assumption . However, we believe that it can be improved by a direct analysis of . Numerical simulations and discussions which supports our belief can be found in Section 5.
Finally, we complement Theorem 2 by providing a lower bound of the truncate-and-relax algorithm. Recall that the algorithm tries to find a solution in the relaxed problem (1.1). It implies that if the min-bisection of is not the correct partition , then the truncate-and-relax algorithm will also return a solution which is not equal to . Hence, we have
We find a sharp threshold for the estimator recovering or successfully.
Suppose . Let be defined as following:
If , then is not equal to neither nor with probability . On the other hand, if , then is either of or with probability .
We note that
hence for any . Figure 1 shows the relations between , and for .
Theorem 3 and the discussion above implies that the truncate-and-relax algorithm fails with probability if . We conjecture that this is the correct threshold of the performance of the algorithm. In future work, we will attempt to prove this conjecture by improving the matrix concentration bound as discussed above.
If , then with probability .
2 Maximum-likelihood estimator
Recall that is a maximizer of the likelihood probability (ties are broken arbitrarily). Let for .
For brevity, let us first introduce a few notations. Let . Let be a vector in where
for each . Let be a -uniform hypergraph on the vertex set with the edge set . Let be the vector in such that
for each . Note that
Hence, is equal to the number of in-cluster edges in with respect to the partition .
The ML estimator tries to find the “best” partition with equal number of ’s and ’s. Intuitively, if , i.e., in-cluster edges appears more likely than cross-cluster edges (we call such case assortative), then the best partition will correspond to which maximizes the number of in-cluster edges w.r.t. . On the other hand, if (we call such case disassortative) then the best partition will corresponds to the minimizer, respectively. The following proposition confirms this intuition. We defer the proof to Section B.1 in the appendix.
The ML estimator is the maximizer (minimizer, respectively) of if (if , respectively) over all such that .
3 Sharp phase transition in
Informally, we are going to argue that the event for the ground truth being the best guess (i.e. is the global optimum of the likelihood function) can be approximately decomposed into the events that is unimprovable by flipping the label of for . This type of phenomenon was called local-to-global amplification in [Abb18] which seems to hold for more general classes of the graphical model.
Let be the probability that the ML estimator fails to recover the hidden partition, i.e.,
As we have seen in the previous section, the ML estimator is a maximizer of over the choices of such that . Thus, is equal to the probability that happens for some satisfying the balance condition .
3.1 Lower bound
We first prove the impossibility part of Theorem 1. For concreteness, we focus on the assortative case, i.e., but the proof can be easily adapted for the disassortative case.
Before we prove the lower bound, let us consider the usual stochastic block model for graphs which corresponds to in order to explain the intuition of the proof. Given a sample , partition and a vertex , let us define the in-degree of as
and the out-degree of as
We will omit the subscript if the context is clear.
Suppose that there are vertices and from different clusters such that the in-degree of each vertex is smaller than the out-degree of each vertex. In this case, swapping the label of and will yield a new balanced partition with greater number of in-cluster edges, hence the ML estimator will fail to recover . Now, suppose that
for all . If those events were independent, we would get
and vise versa for . It would imply that there is a “bad” pair with probability hence the ML estimator fails with probability . We remark that this argument is not mathematically because the in-degrees of vertex and (as well as out-degrees of them) are not independent as they share a variable indicating whether is an edge or not. However, we can overcome it by conditioning on highly probable event which makes those events independent, as in [ABH16].
We extend the definitions of in-degree and out-degree for the -HSBM as
Observe that they coincide with the corresponding definition for the usual SBM (). We note that the sum of in-degree and out-degree is not equal to the degree of , the number of hyperedges in containing when . We extended those definitions in this way because any edge which is neither in-cluster nor cross-cluster but is in-cluster does not contribute on when we flip the sign of the label of .
Now, note that the in-degree and the out-degree of
are independent binomial random variables with different parameters. To estimate the probability
we provide a tight estimate for the tail probability of a weighted sum of independent binomial variables in Section A. Precisely we prove that
as long as vanishes as grows, where
As we discussed, if then the tail probability is of order and it implies that the ML estimator fails with probability .
Let . If , then .
Let and . For and , let us define to be the vector obtained by flipping the signs of and . By definition, is balanced. We are going to prove that with high probability there exist and such that . For simplicity, let and .
For , let be the event such that
holds. Then, implies that . Hence
We recall that if for were mutually independent, we can exactly express the right-hand side as
but unfortunately it is not the case. To see this, let us fix and . Then, we have
They share variables for satisfying . The expected contribution of those variables is , so we may expect
In the similar spirit, we are going to prove that for an appropriate choice of , the events and are approximately independent, so
Together with the tight estimate on , it would give us a good lower bound on .
Let be a set of size where . We will choose later to be poly-logarithmically decaying function in . Let be the set of such that contains at least two vertices in . We would like to condition on the values of , which captures all dependency occurring among ’s for .
Let be a positive number depending on which we will choose later, and let be the event that the inequality
holds. For each , let be the event that the inequality
is satisfied. We claim that . It follows from the direct calculation, as if we assume , then
Note that only depends on the set of variables , which are mutually disjoint for . Also, is disjoint with any of those sets of variables. Hence, events and are mutually independent, and we get
We claim that
for appropriate choice of and . This immediately implies that as desired.
Let us first prove . Let be the random variable defined as
for . We have
by a union bound. Note that
Using a standard Chernoff bound, we get the following lemma. For completeness, we include the proof in the appendix (Section A.1).
Let be a sum of independent Bernoulli variables such that where . Let be a positive number which decays to 0 as grows, with . Then,
Letting and , we get
and so .
Now, we would like to prove that
by showing that
for any . This implies that
and since we assumed that and , we get
and similarly as desired.
To estimate the probability that happens, let and be random variables defined as
Recall that is the event that holds.
Let be a binomial random variable from and be a binomial random variable from where , and . Let and let be a positive number vanishing as grows. Then,
3.2 Upper bound
We are going to use a union bound to prove the upper bound. Let and be vectors in . The Hamming distance between and (denoted ) is defined as the number of such that . Note that if and are balanced, then
hence is even.
Let us fix and let be a -uniform random hypergraph generated by the model under the ground truth . We note that the distribution of the random variable is invariant under the permutation of preserving , hence it only depends on . Hence, there is a quantity which satisfies
for any with . Moreover, since our model is invariant under a global sign flip.
Recall that the ML estimator fails to recover if and only if
for some balanced which is neither nor . We remark that we count the equality as a failure, which will only make larger. By union bound, we have
We note that there is a one-to-one correspondence between a balanced and a pair of sets where
and we must have since is balanced. Hence, the number of balanced ’s with is equal to . We have
Now, let us formally state the main result of this section.