Clustering of data is one of the central problems, and it arises in many fields of science and engineering. Among many related problems, community recovery in graphs has received considerable attention with applications in numerous domains such as social networks [3, 4, 5], computational biology , and machine learning [7, 8]. The goal of the problem is to cluster data points into different communities based on pairwise information. Among a variety of models for the community recovery problem, the stochastic block model (SBM)  and the censored block model (CBM)  have received significant attention in recent years. In SBM, two data points in the same communities are more likely to be connected by an edge than the other edges. In the case of CBM, each measurement returns the modulo- sum of the values assigned to the two nodes, possibly corrupted by Bernoulli noise.
While these models reflect interactions between a pair of two nodes, there are numerous applications in which interactions occur across more than two nodes. One such application is a folksonomy, a social network in which users can annotate items with different tags . In this application, the graph consists of nodes corresponding to different users, different items, and different tags. When user annotate item with tag , one can view this as a hyperedge connecting node , node and node . Therefore, in order to cluster nodes of such a graph based on such interactions, one needs a model that can capture such three-way interactions. Another application is molecular biology, in which multi-way interactions between distinct systems capture complex molecular interactions 
. There are also a broad range of applications in other domains including computer vision, VLSI circuits , and categorical databases .
These applications naturally motivate us to investigate a hypergraph setting in which measurements are of multi-way information type. Specifically, we consider a simple yet practically-relevant model, which we name the generalized censored block model (GCBM). In the GCBM, the data points are modeled as nodes in a hypergraph, and their interactions are encoded as hyperedges between the nodes. As an initial effort, we focus on a simple setting in which there are two communities: each node taking either 0 or 1 depending on its affiliation. More concretely, we consider a random -uniform hypergraph in which each hyperedge connecting a set of
nodes exists with probabilityand takes a function of the values assigned to the nodes. In this work, inspired by applications in machine learning and channel coding, we study the following two types of measurements:
the homogeneity measurement that reveals whether or not the nodes are in the same community; and
the parity measurement that reveals the modulo- sum of the affiliation of the nodes.
Further, we study both the noiseless case and the noisy case.
I-a Main contributions
Specialized to the case, the above two measurement models reduce to the CBM, in which the information-theoretic limit on the expected number of edges required for exact recovery is characterized as [16, 17]. On the other hand, the information-theoretic limits for the case of arbitrary has not been settled. This precisely sets the goal of our paper: We seek to characterize the information-theoretic limits on the sample complexity for exact recovery under the two models. A summary of our findings is as follows. For a fixed constant , the information-theoretic limits are:
(the homogeneity measurement case) if is a fixed constant; and
(the parity measurement case) if is a fixed constant.
For the parity measurement case, we also characterize the information-theoretic limits for a more general setting where can arbitrarily scale with .
(the parity measurement case) if ; and
(the parity measurement case) if .
These results provide some interesting implications to relevant applications such as subspace clustering and channel coding. In particular, the results offer concrete guidelines as to how to choose that minimizes sample complexity while ensuring successful clustering. See details in Sec. II-A and Sec. III.
I-B Related work
I-B1 The case
The exact recovery problem in standard graphs () has been studied in great generality. In SBM, both the fundamental limits and computationally efficient algorithms are investigated initially for the case of two communities [18, 17, 19], and recently for the case of an arbitrary number of communities . In CBM,  characterizes the sample complexity limit, and  develops a computationally efficient algorithm that achieves the limit.
Another important recovery requirement is detection, which asks whether one can recover the clusters better than a random guess. The modern study of the detection problem in SBM is initiated by a paper by Decelle et al. 
, which conjectures phase transition phenomena for the detection problem111In the paper, it is also conjectured that an information-computation gap might exist for the case of more than communites (). This conjecture is also extensively studied in [22, 23, 24, 25], and is recently settled in .. This conjecture is initially tackled for the case of two communities. The impossibility of the detection below the conjectured threshold is established in , and it is proved in [28, 29, 30] that the conjectured threshold can be achieved efficiently. The conjecture for the arbitrary number of communities is recently settled by Abbe and Sandon . For another line of researches, minimax-optimal rates are derived in , and algorithms that achieve the rates are developed in . We refer to a recent survey by Abbe  for more exhaustive information.
I-B2 The homogeneity measurement case
Recently, [34, 35] consider a general model that includes our model as a special case (to be detailed in Sec. II), and provide an upper bound on sample complexity for almost exact recovery, which allows a vanishing fraction of misclassified nodes. Applying their results to our model, their upper bound reduces to . Whether or not the sufficient condition is also necessary has been unknown. In this work, we show that it is not the case, demonstrating that the minimal sample complexity even for exact recovery is .
I-B3 The parity measurement case
The parity measurement case has been explored by  in the context of random constraint satisfaction problems. The case of has been well-studied: it is shown that the maximum likelihood decoder succeeds if . Unlike the prior result which only considers the case of , we cover an arbitrary constant , and characterize the sharp threshold on the sample complexity.
Abbe-Montanari  relate the parity measurement model to a channel coding problem in which random LDGM codes with a constant right-degree are employed. By proving the concentration phenomenon of the mutual information between channel input and output, they demonstrate the existence of phase transition for an even . Our results span any fixed , and hence fully settle the phase transition (see Sec. III).
I-B4 The stochastic block model for hypergraphs
There are several works which study the community recovery under SBM for hypergraphs. In , the authors explore the case of two equal-sized communities222Actually, the main model in the paper is the bipartite stochastic block model, which is not a hypergraph model. However, the result for the hypergraph case follows as a corollary (see Theorem 5 therein).. Specializing it to our model, one can readily show that detection is possible if . Moreover,  recently conjectures phase transition thresholds for detection. Lastly,  derives the minimax-optimal error rates, and generalizes the results in  to the hypergraph case.
I-B5 Other relevant problems
Community recovery in hypergraphs bears similarities to other inference problems, in which the goal is to reconstruct data from multiple queries. Those problems include crowdsourced clustering [42, 43], group testing  and data exactration from histogram-type information [45, 46]. Here, one can make a connection to our problem by viewing each query as a hyperedge measurement. However, a distinction lies in the way that queries are collected. For instance, an adaptive measurement model is considered in the crowdsourced setting [42, 43] unlike our non-adaptive setting in which hyperedges are sampled uniformly at random. Histogram-type information acts as a query in [44, 45, 46].
I-C Paper organization
Sec. II introduces the considered model; in Sec. III, our main results are presented along with some implications; in Sec. IV, V and VI, we provide the proofs of the main theorems; Sec. VII presents experimental results that corroborate our theoretical findings and discuss interesting aspects in view of applications; and in Sec. VIII, we conclude the paper with some future research directions.
For any two sequences and : if there exists a positive constant such that ; if there exists a positive constant such that ; if ; if ; and or if there exist positive constants and such that .
For a set and an integer , we denote Let denote . Let be the
standard unit vector. Letbe the all-zero-vector and be the all-one-vector. We use to denote an indicator function. Let be the Kullback-Leibler (KL) divergence between and , i.e., . We shall use to indicate the natural logarithm. We use to denote the binary entropy function.
Ii Generalized censored block models
Consider a collection of nodes
, each represented by a binary variable, . Let be the ground-truth vector. Let denote the size of a hyperedge. Samples are obtained as per a measurement hypergraph where . We assume that each element in belongs to independently with probability . Sample complexity is defined as the number of hyperedges in a random measurement hypergraph, which is concentrated around in the limit of . Each sampled edge is associated with a noisy binary measurement :
where is some binary-valued function, denotes modulo-2 sum, and
is a random variable with noise rate. For the choice of , we focus on the two cases:
the homogeneity measurement:
the parity measurement:
Let We remark that when , this reduces to CBM .
The goal of this problem is to recover from . In this work, we will focus on the case of even
since the case of oddreadily follows from the even case . When is even, the conditional distribution of is equal to that of . Hence, given a recovery scheme , the probability of error is defined as
We intend to characterize the minimum sample complexity, above which there exists a recovery algorithm such that as tends to infinity, and under which for all algorithms.
Ii-a Relevant applications
Ii-A1 Subspace clustering and the homogeneity measurement
Subspace clustering is a popular problem of which the task is to cluster data points that approximately lie in a union of lower-dimensional affine spaces. The problem arises in a variety of applications such as motion segmentation  and face clustering , where data points corresponding to the same class (tracked points on a moving object or faces of a person) lie on a single lower-dimensional subspace; for details, see  and references therein. A common procedure of the existing algorithms for subspace clustering [37, 50, 51, 52] begins construction of a -th order affinity tensor () whose entries represent similarities between every data points. Since this construction incurs a complexity that scales like , sampling-based approaches are proposed in [36, 37, 13].
A similarity between data points in prior works [36, 37, 13] is defined such that it tends to if all of the points are on the same subspace and otherwise. Hence, restricted to the two-subspace case, one can view a similarity over a -tuple as a homogeneity measurement 333In subspace clustering, similarities can be sometimes noisy in that even though the data points are from the same (different) subspace, similarity can be (). Note that in (1) precisely captures this noise.. By setting the probability of each entry being sampled as , one can relate this to our homogeneity measurement model; see Fig. 1 for visual illustration.
Ii-A2 Channel coding and the parity measurement
The community recovery problem has an inherent connection with channel coding problems [16, 18]. To see this, consider a communication setting which employs random LDGM codes with a constant right-degree . To make a connection, we begin by constructing a random -uniform hypergraph with nodes, where each edge of size appears with probability . Given the input sequence of information bits, we then concatenate the parity bits with respect to the sampled hyperedges to form a codeword of average length . Note that the expected code rate is . The noisy measurement can be mapped to the output of a binary symmetric channel (BSC) with crossover probability , when fed by the codeword. A recovery algorithm corresponds to the decoder which wishes to infer the information bits from the received signals. One can then see that recovering communities in hypergraphs is equivalent to the above channel coding problem; see Fig. 2 for visual illustration.
Iii Main results
Iii-a The homogeneity measurement case
Fix and . Under the homogeneity measurement case (),
See Sec. IV.∎
We first make a comparison to the result in . While  models a fairly general similarity measurement, it considers a more relaxed performance metric, so called almost exact recovery, which allows a vanishing fraction of misclassified nodes; and provides a sufficient condition on sample complexity under the setting . On the other hand, we identify the sufficient and necessary condition for exact recovery, thereby characterizing the fundamental limit. Specializing their result to the model of our interest, the sufficient condition in  reads , which comes with an extra factor gap to the optimality.
One interesting observation in Theorem 1 is that the sample complexity limit is proportional to . This suggests that the amount of information that one hyperedge reveals on average is approximately bits. To understand why this is the case, consider a setting in which and an hyperedge is observed. The case of implies , in which there are only two uncertain cases (all zeros and all ones), i.e., the bits of information are revealed. On the other hand, the case of provides much less information as it rules out only two possible cases ( and ) out of possible candidates. This amounts to roughly bits. Since occurs with probability , the amount of information that one hyperedge can carry on average should read about .
Relying on the connection to subspace clustering elaborated in Sec. II-A, one can make an interesting implication from Theorem 1. The result offers a detailed guideline as to how to choose for sample-efficient subspace clustering. In the case where the measurement quality reflected in is irrelevant of the number of data points involved in a measurement, the limit increases in . In practical applications, however, may depend on . Actually, the quality of similarity measure can improve as more data points get involved, making decrease as increases. In this case, choosing as small as possible minimizes but may make too large. Hence, there might be a sweet spot on that minimizes the sample complexity. It turns out this is indeed the case in practice. Actually we identify such optimal for motion segmentation application; see Sec. VII-A for details.
Iii-B The parity measurement case
Fix and . Under the parity measurement case (),
Notice that for a fixed and , the minimum sample complexity is proportional to , hence decreases in unlike the homogeneity measurement case.
In view of the connection made in Sec. II-A, a natural question that arises in the context of channel coding is to ask how far the rate of the random LDGM code is from the capacity of the BSC channel. The connection can help immediately answer the question. We see from Theorem 2 that the rate of the LDGM code is
This suggests that the code rate increases in . Note that as long as is constant, the rate vanishes, being far from the capacity of BSC channel . On the other hand, it is not clear as to whether or not the random LDGM code can achieve a non-vanishing code rate possibly by increasing the value of . To check this, we explore the case where can scale with . By symmetry, it suffices to consider the case . Moreover, to avoid pathological cases where fluctuates as increases, we assume that is a monotone function.
Fix , a monotone function of such that , and . Under the parity measurement case (),
(upper bound) if
(2) (3) (4) (5)
See Sec. VI.∎
To see what these results mean, consider the two cases: and . In the case , the theorem says that for a fixed ,
where and . This suggests that as long as grows asymptotically larger than , we can achieve an order-wise tight sample complexity that is linear in . On the other hand, in the case , the theorem asserts that
This implies that one cannot achieve the linear-order sample complexity if grows slower than . The implication of the above two can be formally stated as follows.
For , reliable recovery is impossible with linear-order sample complexity, while it is possible for .
From this, we see that the random LDGM code can achieve a constant rate as soon as .
Iv Proof of Theorem 1
The achievability and converse proofs are streamlined with the help of Lemmas 1 and 2, of which the proofs are left in Appendix A. For illustrative purpose, we focus on the noisy case and assume that is even. For a vector , we define
Let be the maximum likelihood (ML) decoder. One can easily verify that
where ties are randomly broken.
Iv-a Achievability proof
We intend to prove that
under the claimed condition. Let be the ground-truth vector. Without loss of generality, assume that the first coordinates are ’s and the next coordinates are ’s, where .
Let denote the collection of all vectors whose coordinates are different from that of in many positions among the first coordinates and in many positions among the next coordinates. Note that and . Thus, a decoding algorithm is successful if and only if the output . Let . We also define
which is a representative vector of .
Using these notations and the union bound, we get:
where the step () follows from the fact that the ML decoder outputs if .
To compare with , we define the set of distinctive hyperedges, i.e., the set of hyperedges such that :
and . By definition, for , if ; otherwise. Hence, if and only if . This leads to:
To give a tight upper bound on (11), one needs a tight lower bound on the size of the set of distinctive hyperedges, i.e., . It turns out that bounding when requires non-trivial combinatorial counting. Note that this was not the case when since can be exactly computed via simple counting. Indeed, one of our main technical contributions lies in the derivation of tight bounds on , which we detail below.
The number of distinctive hyperedges can be calculated as follows:
Consider a hyperedge such that . That is, the hyperedge is connected only to a subset of the first nodes or only to a subset of the last nodes. That is, or . Consider the first case, i.e., . In order for this hyperedge to be distinctive, i.e., , at least one element of must be in , and at least one element of must be in . Thus, the total number of such distinctive hyperedges is . Similarly, one can count the number of distinctive hyperedges for the case : . By considering the opposite case where and , one can also obtain the remaining two terms, proving the statement. ∎
By symmetry, we see that . Hence,
In order to bound , for a fixed constant , we define the following index sets: and . Then,
where () follows from the hypothesis that and . Then it is easy to show that (18):
where () follows from the fact that .
Now we consider (19). The following lemma gives a tight lower bound on for this case:
For and ,
See Sec. A-A. ∎
where () follows due to , and Lemma 1. A straightforward computation yields , so the claimed condition
Under the claimed condition, we get:
Iv-B Converse proof
Let be the collection of -dimensional vectors, each consisting of number of ’s and number of ’s. Moreover, let be the random vector sampled uniformly at random over . For any scheme , by definition of , we see that
Relying on this inequality, our proof strategy is to show that the left hand side is strictly bounded away from . Note that the infimum in the left hand side is achieved by :
By letting , we obtain