1.1. Hypergraph clustering
The stochastic block model (SBM) is a generative model for random graphs with community structures which serves as a useful benchmark for the task of recovering community structure from graph data. It is natural to have an analogous model for random hypergraphs as a testing ground for hypergraph clustering algorithms.
1.2. Hypergraph stochastic block models
The hypergraph stochastic block model, first introduced in [GD14] is a generalization of the SBM for hypergraphs. We define the hypergraph stochastic block model (HSBM) as follows for -uniform hypergraphs.
Definition 1.1 (Hypergraph).
A -uniform hypergraph is a pair where is a set of vertices and is a set of subsets with size of , called hyperedges.
Definition 1.2 (Hypergraph stochastic block model (HSBM)).
Let be a partition of the set into sets of size (assume is divisible by ), each is called a cluster. For constants , we define the -uniform hypergraph SBM as follows:
For any set of distinct vertices , generate a hyperedge with probability if the vertices are in the same cluster in . Otherwise, generate the hyperedge with probability . We denote this distribution of random hypergraphs as .
Hypergraphs are closely related to symmetric tensors. We give a definition of symmetric tensors below, see[KB09] for more details on tensors.
Definition 1.3 (Symmetric tensor).
Let be an order- tensor. We call is symmetric if for any and any permutation in the symmetric group of order .
Formally, we can use a random symmetric tensor to represent a random hypergraph drawn from this model. We construct an adjacency tensor of as follows. For any distinct vertices that are in the same cluster,
For any distinct vertices , if any two of them are not in the same cluster, we have
We set if any two of the indices in coincide, and we set for any permutation . Hence, is symmetric, and is determined entirely by the index set , regardless of order. If we have two distinct index sets
, then the random variablesand are independent. Furthermore, we may abuse notation and write in place of , where .
The HSBM recovery problem is to find the ground truth clusters either approximately or exactly, given a sample hypergraph from . We may ask the following questions about the quality of the solutions; see [Abb18] for further details:
Exact recovery (strong consistency): Find exactly (up to a permutation) with probability .
Almost exact recovery (weak consistency): Find a partition such that portion of the vertices are mislabeled.
Detection: Find a partition such that the portion of mislabelled vertices is for some positive .
For exact recovery with two blocks, it was shown that the phase transition occurs in the regime of logarithmic average degree in[LCW17, CLW18b, CLW18a] by analyzing the minimax risk, and the exact threshold was given in [KBG18], by a generalization of the techniques in [ABH16]. An exact recovery threshold of the censored block model for hypergraphs was characterized in [ALS16]. For detection, [ACKZ15] proposed a belief propagation algorithm and conjectured a threshold point.
Several methods have been considered for exact recovery of HSBMs. In [GD14]
, the authors used spectral clustering based on the hypergraph’s Laplacian to recover HSBMs that are dense and uniform. Subsequently, they extended their results to sparse, non-uniform hypergraphs, for exact, almost exact and partial recovery[GD15a, GD15b, GD17]. Spectral methods along with local refinements were considered in [CLW18a, ALS16]. A semidefinite programing approach was analyzed in [KBG18].
1.3. This paper
In this paper, we focus on exact recovery. Rather than dealing with sparsity as in [KBG18], we approach the problem from a different direction: we attempt to construct algorithms that succeed on dense hypergraphs (with constant) when the number of blocks increases with . Our algorithm works when , which is believed to be the barrier for exact recovery in the dense graph case with clusters of equal size [CSX14, Ame14, OH11]. To the best of our knowledge, our algorithm is the first to guarantee exact recovery when the number of clusters . In addition, in contrast to [KBG18, CLW18a, ALS18, GD17], our algorithm is purely spectral. While we focus on the dense case, our algorithm can be adapted to the sparse case as well; however, it does not perform as well as previously known algorithms [KBG18, LCW17, CLW18b, CLW18a, GD17] in the sparse regime.
Our main result is the following:
Let be constant. For sufficiently large , there exists a deterministic, polynomial time algorithm which exactly recovers -uniform HSBMs with probability if .
See Theorem 3.1 below for precise statement.
Our algorithm compares favorably with other known algorithms for HSBM recovery in the dense case with clusters; see Section 11. It is based on the iterated projection technique developed in [CFR17, Col18] for the graph case. We apply this approach to the adjacency matrix of the random hypergraph —the symmetric matrix whose -th entry is the number of hyperedges containing both and (or 0 if ). In the process, we prove a concentration result for the spectral norm of , which may be of independent interest (Theorem 5.4).
2. Simple counting algorithm
Before we introduce our spectral algorithm for recovering HSBMs, let us observe that one can recover HSBMs by simply counting the number of hyperedges containing pairs of vertices: with high probability, pairs of vertices in the same cluster will be contained in more hyperedges than pairs in different clusters. However, we will see that our spectral algorithm provides better performance guarantees than this simple counting algorithm.
Let be sampled from , where and for . Then Algorithm 1 recovers with probability if
For each , is the sum of independent Bernoulli random variables of expectation either or . Thus, it follows from a straightforward application of Hoeffding’s inequality that
with probability if and are in the same cluster and
with probability if and are in different clusters. Taking a union bound over all pairs, these bounds hold for all pairs with probability . Thus, as long as the lower bound in (2.1) is greater than the upper bound in (2.2), for each the vertices with highest will be the other vertices in ’s cluster. ∎
In particular, if we bound the binomial coefficient by , we see that
are both sufficient conditions for recovery, where and are absolute constants.
3. Spectral algorithm and main results
Our main result is that Algorithm 2 below recovers HSBMs with high probability, given certain conditions on and . It is an adaptation of the iterated projection algorithm for the graph case [CFR17, Col18]. The algorithm can be broken down into three main parts:
Delete the recovered cluster and recurse on the remaining vertices (Step 7).
The remainder of this paper is devoted to proving correctness of this algorithm.
Let be sampled from , where and are constant, and for . If and
then for sufficiently large , Algorithm 2 exactly recovers with probability .
Sections 4-6 introduce the linear algebra tools necessary for the proof; Section 7 shows that Step 4 with high probability produces a a set with small symmetric difference with one of the clusters; Section 8 proves that Step 6 with high probability recovers one of the clusters exactly; and Section 9 proves inductively that the algorithm with high probability recovers all clusters.
3.1. Performance guarantees
Observe that if the numerator is less than the denominator in (3.1), then we can upper bound the left hand side by
Thus, Theorem 3.1 guarantees that we can recover w.h.p. if
(this is a slightly stronger condition than (3.1)). Recall that for nonnegative integers we can bound the binomial coefficient by . Using these bounds and solving for , we get the following as a corollary to Theorem 3.1:
Theorem 3.2 (Dense case).
Let be sampled from , where and , and are constant, and for . If
then Algorithm 2 recovers w.h.p., where is an absolute constant.
Note that Theorem 3.1 requires and to be constant, but is allowed to vary with . However, we want the failure probability to be , so we require . In particular, this is satisfied if is constant.
3.2. Running time
In contrast to the graph case, in which the most expensive step is constructing the projection operator (which can be done in time via truncated SVD [GVL96, Gu15]), for the running time of Algorithm 2 is dominated by constructing the adjacency matrix , which takes time (the same amount of time it takes to simply read the input hypergraph). Thus, the overall running time of Algorithm 2 is .
4. Reduction to random matrices
Working with random tensors is hard [HL13], since we do not have as many linear algebra and probability tools as for random matrices. It would be convenient if we could work with matrices instead of tensors. We propose to analyze the following adjacency matrix of a hypergraph, originally defined in [FL96].
Definition 4.1 (Adjacency matrix).
Let be a random hypergraph generated from and let be the adjacency tensor of . For any hyperedge , let be the entry in corresponding to . We define the adjacency matrix of by
Thus, is the number of hyperedges in that contains vertices . Note that in the summation (4.1), each hyperedge is counted once.
From our definition, is symmetric, and for . However, the entries in are not independent. This presents some difficulty, but we can still get information about the clusters from this adjacency matrix .
5. Eigenvalues and concentration of spectral norms
It is easy to see that for ,
then is a symmetric matrix of rank
. We have the following eigenvalues for. Note that we are using the convention for a self-adjoint matrix .
The eigenvalues of are
Hence by a shifting, we have the following eigenvalues for .
The eigenvalues of are
We can use an -net chaining argument to prove the following concentration inequality for the spectral norm of .
Definition 5.3 (-net).
An -net for a compact metric space is a finite subset of such that for each point , there is a point with .
Let be the spectral norm of a matrix, we have
with probability at least .
Consider the centered matrix , then each entry is a centered random variable. Let . Let be the unit sphere in (using the -norm). By the definition of the spectral norm,
Let be an -net on . Then for any , there exists some such that . Then we have
For any , if we take the supremum over , we have
Now we fix an first, and prove a concentration inequality for . Let be the hyperedge set in a complete -uniform hypergraph on . We have
Let , then
where are independent. Note that , so we have
By Hoeffding’s inequality,
From Cauchy’s inequality, we have
Taking , we have
Since (see Corollary 4.2.11 in [Ver18] for example), we can take and by a union bound, we have
6. Dominant eigenspaces and projectors
Our recovery algorithm is based on the dominant rank- projector of the adjacency matrix .
Definition 6.1 (Dominant eigenspace).
Note that by this definition, if , then actually has dimension , but that will never be the case in this analysis.
Definition 6.2 (Dominant rank- projector).
If is a Hermitian real symmetric matrix, the dominant rank- projector of , denoted , is the orthogonal projection operator onto .
is a rank-, self-adjoint operator which acts as the identity on . It has eigenvalues equal to 1 and equal to 0. If is an orthonormal basis for , then
where denotes either the transpose or conjugate transpose of , depending on whether we are working over or .
Let us define to be the incidence matrix of ; i.e.,
Thus, it is our goal to reconstruct given .
denote the indicator vector for clusterand the all ones matrix. Then we can write
for some constants . Thus, is an orthonormal basis for the column space of both and , and hence, in accordance with (6.1),
Now, observe that the eigenvalues of are those of shifted down by , and is an eigenvector of iff. it is an eigenvector of ; hence, the dominant -dimensional eigenspace of is the same as the column space of , and therefore . ∎
Thus, gives us all the information we need to reconstruct . Unfortunately, a SBM recovery algorithm doesn’t have access to or
(if it did the problem would be trivial), but the following theorem shows that the random matrixis a good approximation to and thus reveals the underlying rank- structure of :
Assume (5.1) holds. Then
for any .
Let be symmetric. Suppose that the largest eigenvalues of both are at least , and the remaining eigenvalues of both are at most , where . Then
7. Constructing an approximate cluster
In this section we show how to use to construct an “approximate cluster”, i.e. a set with small symmetric difference with one of the clusters. We will show that
If and is large, then must have large intersection with some cluster (Lemma 7.1)
Such a set exists among the sets , where is the indices of the largest entries in column of , along with itself (Lemma 7.2).
The intuition is that if , then
and this quantity is maximized when comes mostly from a single cluster .
Lemmas 7.1 and 7.2 below are essentially the same as Lemmas 18 and 17 in [CFR17]. As as in the graph case (Theorem 6.3), we can import their proofs directly from the graph case. However, we present a simpler proof for Lemma 7.1.
Assume (5.1) holds. Let and . Then for some .
By Theorem 6.4,
And by the triangle inequality,
We will show that in order for this to hold, must have large intersection with some cluster.
Fix such that . Assume by way of contradiction that for all . Observe that by Theorem 6.3
Let and consider the optimization problem
Solving for , this implies that
Thus, if we choose we have a contradiction. Let us choose to be as large as possible, . Then it must be the case that for some . Note that for the proof to go through we require , which is satisfied if . ∎
This lemma gives us a way to identify an “approximate cluster” using only ; however, it would take time to try all sets of size . However, if we define to be along with the indices of the largest entries of column of (as in Step 3 of Algorithm 2), then Lemma 7.2 below will show that one of these sets satisfies the conditions of Lemma 7.1; thus, we can produce an approximate cluster in polynomial time by taking the that maximizes .
The proof is exactly the same as that of [CFR17, Lemma 17].
8. Exact recovery by counting hyperedges
Suppose we have a set such that for some ( denotes symmetric difference). In the graph case () we can use to recover exactly w.h.p. as follows:
Show that w.h.p. for any will have at least neighbors in , while any will have at most neighbors in . This follows from a simple Hoeffding argument.
Show that that, if these bounds hold for any , then (deterministically) any will have at least neighbors in , while any will have at most neighbors in . Thus, we can use number of vertices in to distinguish between vertices in and vertices in other clusters.
See [CFR17, Lemmas 19-20] for details. The reason we cannot directly apply a Hoeffding argument to is that depends on the randomness of the instance , thus the number of neighbors a vertex has in is not the sum of fixed random variables.
To generalize to hypergraphs with , an obvious analogue of the notion of number of neighbors a vertex has in a set is to define the random variable
i.e. the number of hyperedges containing and vertices from . When this is simply the number of neighbors has in . We get the following analogue to [CFR17, Lemma 19]:
Consider cluster and vertex , and let . If , then for sufficiently large and ,