Exact Recovery in the Hypergraph Stochastic Block Model: a Spectral Algorithm

11/16/2018 ∙ by Sam Cole, et al. ∙ University of Manitoba University of Washington 0

We consider the exact recovery problem in the hypergraph stochastic block model (HSBM) with k blocks of equal size. More precisely, we consider a random d-uniform hypergraph H with n vertices partitioned into k clusters of size s = n / k. Hyperedges e are added independently with probability p if e is contained within a single cluster and q otherwise, where 0 ≤ q < p ≤ 1. We present a spectral algorithm which recovers the clusters exactly with high probability, given mild conditions on n, k, p, q, and d. Our algorithm is based on the adjacency matrix of H, which we define to be the symmetric n × n matrix whose (u, v)-th entry is the number of hyperedges containing both u and v. To the best of our knowledge, our algorithm is the first to guarantee exact recovery when the number of clusters k=Θ(√(n)).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

1.1. Hypergraph clustering

Clustering is an important topic in data mining, network analysis, machine learning and computer vision

[JMF99]. Many clustering methods are based on graphs, which represent pairwise relationships among objects. However, in many real-world problems, pairwise relations are not sufficient, while higher order relations between objects cannot be represented as edges on graphs. Hypergraphs can be used to represent more complex relationships among data, and they have been shown empirically to have advantages over graphs [ZHS07, PM07]. Thus, it is of practical interest to develop algorithms based on hypergraphs that can handle higher-order relationships among data, and much work has already been done to that end; see, for example [ZHS07, LS12, Vaz09, GD15b, BP09, HSJR13, AIV15]. Hypergraph clustering has found a wide range of applications [HKKM98, DBRL12, BG05, GG13, KNKY11].

The stochastic block model (SBM) is a generative model for random graphs with community structures which serves as a useful benchmark for the task of recovering community structure from graph data. It is natural to have an analogous model for random hypergraphs as a testing ground for hypergraph clustering algorithms.

1.2. Hypergraph stochastic block models

The hypergraph stochastic block model, first introduced in [GD14] is a generalization of the SBM for hypergraphs. We define the hypergraph stochastic block model (HSBM) as follows for -uniform hypergraphs.

Definition 1.1 (Hypergraph).

A -uniform hypergraph is a pair where is a set of vertices and is a set of subsets with size of , called hyperedges.

Definition 1.2 (Hypergraph stochastic block model (HSBM)).

Let be a partition of the set into sets of size (assume is divisible by ), each is called a cluster. For constants , we define the -uniform hypergraph SBM as follows:

For any set of distinct vertices , generate a hyperedge with probability if the vertices are in the same cluster in . Otherwise, generate the hyperedge with probability . We denote this distribution of random hypergraphs as .

Hypergraphs are closely related to symmetric tensors. We give a definition of symmetric tensors below, see

[KB09] for more details on tensors.

Definition 1.3 (Symmetric tensor).

Let be an order- tensor. We call is symmetric if for any and any permutation in the symmetric group of order .

Formally, we can use a random symmetric tensor to represent a random hypergraph drawn from this model. We construct an adjacency tensor of as follows. For any distinct vertices that are in the same cluster,

For any distinct vertices , if any two of them are not in the same cluster, we have

We set if any two of the indices in coincide, and we set for any permutation . Hence, is symmetric, and is determined entirely by the index set , regardless of order. If we have two distinct index sets

, then the random variables

and are independent. Furthermore, we may abuse notation and write in place of , where .

The HSBM recovery problem is to find the ground truth clusters either approximately or exactly, given a sample hypergraph from . We may ask the following questions about the quality of the solutions; see [Abb18] for further details:

  1. Exact recovery (strong consistency): Find exactly (up to a permutation) with probability .

  2. Almost exact recovery (weak consistency): Find a partition such that portion of the vertices are mislabeled.

  3. Detection: Find a partition such that the portion of mislabelled vertices is for some positive .

For exact recovery with two blocks, it was shown that the phase transition occurs in the regime of logarithmic average degree in

[LCW17, CLW18b, CLW18a] by analyzing the minimax risk, and the exact threshold was given in [KBG18], by a generalization of the techniques in [ABH16]. An exact recovery threshold of the censored block model for hypergraphs was characterized in [ALS16]. For detection, [ACKZ15] proposed a belief propagation algorithm and conjectured a threshold point.

Several methods have been considered for exact recovery of HSBMs. In [GD14]

, the authors used spectral clustering based on the hypergraph’s Laplacian to recover HSBMs that are dense and uniform. Subsequently, they extended their results to sparse, non-uniform hypergraphs, for exact, almost exact and partial recovery

[GD15a, GD15b, GD17]. Spectral methods along with local refinements were considered in [CLW18a, ALS16]. A semidefinite programing approach was analyzed in [KBG18].

1.3. This paper

In this paper, we focus on exact recovery. Rather than dealing with sparsity as in [KBG18], we approach the problem from a different direction: we attempt to construct algorithms that succeed on dense hypergraphs (with constant) when the number of blocks increases with . Our algorithm works when , which is believed to be the barrier for exact recovery in the dense graph case with clusters of equal size [CSX14, Ame14, OH11]. To the best of our knowledge, our algorithm is the first to guarantee exact recovery when the number of clusters . In addition, in contrast to [KBG18, CLW18a, ALS18, GD17], our algorithm is purely spectral. While we focus on the dense case, our algorithm can be adapted to the sparse case as well; however, it does not perform as well as previously known algorithms [KBG18, LCW17, CLW18b, CLW18a, GD17] in the sparse regime.

Our main result is the following:

Theorem 1.4.

Let be constant. For sufficiently large , there exists a deterministic, polynomial time algorithm which exactly recovers -uniform HSBMs with probability if .

See Theorem 3.1 below for precise statement.

Our algorithm compares favorably with other known algorithms for HSBM recovery in the dense case with clusters; see Section 11. It is based on the iterated projection technique developed in [CFR17, Col18] for the graph case. We apply this approach to the adjacency matrix of the random hypergraph —the symmetric matrix whose -th entry is the number of hyperedges containing both and (or 0 if ). In the process, we prove a concentration result for the spectral norm of , which may be of independent interest (Theorem 5.4).

2. Simple counting algorithm

Before we introduce our spectral algorithm for recovering HSBMs, let us observe that one can recover HSBMs by simply counting the number of hyperedges containing pairs of vertices: with high probability, pairs of vertices in the same cluster will be contained in more hyperedges than pairs in different clusters. However, we will see that our spectral algorithm provides better performance guarantees than this simple counting algorithm.

Given , , number of clusters , and cluster size :

  1. For each pair of vertices , compute number of hyperedges containing and .

  2. For each vertex , let be the set of vertices containing and the vertices with highest (breaking ties arbitrarily). It will be shown that w.h.p.  will be the cluster containing .

Algorithm 1
Theorem 2.1.

Let be sampled from , where and for . Then Algorithm 1 recovers with probability if

Proof.

For each , is the sum of independent Bernoulli random variables of expectation either or . Thus, it follows from a straightforward application of Hoeffding’s inequality that

(2.1)

with probability if and are in the same cluster and

(2.2)

with probability if and are in different clusters. Taking a union bound over all pairs, these bounds hold for all pairs with probability . Thus, as long as the lower bound in (2.1) is greater than the upper bound in (2.2), for each the vertices with highest will be the other vertices in ’s cluster. ∎

In particular, if we bound the binomial coefficient by , we see that

and

are both sufficient conditions for recovery, where and are absolute constants.

3. Spectral algorithm and main results

Our main result is that Algorithm 2 below recovers HSBMs with high probability, given certain conditions on and . It is an adaptation of the iterated projection algorithm for the graph case [CFR17, Col18]. The algorithm can be broken down into three main parts:

  1. Construct an “approximate cluster” using spectral methods (Steps 1-4)

  2. Recover the cluster exactly from the approximate cluster by counting hyperedges (Steps 5-6)

  3. Delete the recovered cluster and recurse on the remaining vertices (Step 7).

Given , , number of clusters , and cluster size :

  1. Let be the adjacency matrix of (as defined in Section 4).

  2. Let be the dominant rank- projector of (as defined in Section 6).

  3. For each column of , let be the entries other than in non-increasing order. Let , i.e., the indices of the greatest entries of column of , along with itself.

  4. Let , where , i.e. the column with maximum . It will be shown that has small symmetric difference with some cluster with high probability (Section 7).

  5. For all , let be the number of hyperedges such that and , i.e., the number of hyperedges containing and vertices from .

  6. Let be the vertices with highest . It will be shown that with high probability (Section 8).

  7. Delete from and repeat on the remaining sub-hypergraph. Stop when there are vertices left.

Algorithm 2

The remainder of this paper is devoted to proving correctness of this algorithm.

Theorem 3.1.

Let be sampled from , where and are constant, and for . If and

(3.1)

then for sufficiently large , Algorithm 2 exactly recovers with probability .

Sections 4-6 introduce the linear algebra tools necessary for the proof; Section 7 shows that Step 4 with high probability produces a a set with small symmetric difference with one of the clusters; Section 8 proves that Step 6 with high probability recovers one of the clusters exactly; and Section 9 proves inductively that the algorithm with high probability recovers all clusters.

3.1. Performance guarantees

Observe that if the numerator is less than the denominator in (3.1), then we can upper bound the left hand side by

Thus, Theorem 3.1 guarantees that we can recover w.h.p. if

(this is a slightly stronger condition than (3.1)). Recall that for nonnegative integers we can bound the binomial coefficient by . Using these bounds and solving for , we get the following as a corollary to Theorem 3.1:

Theorem 3.2 (Dense case).

Let be sampled from , where and , and are constant, and for . If

then Algorithm 2 recovers w.h.p., where is an absolute constant.

Note that Theorem 3.1 requires and to be constant, but is allowed to vary with . However, we want the failure probability to be , so we require . In particular, this is satisfied if is constant.

Thus, we see that Algorithm 2 beats Algorithm 1 by a factor of . However, observe that the guarantee for Algorithm 1 approaches that of Algorithm 2 as increases. See Section 11 for comparison with other known algorithms.

3.2. Running time

In contrast to the graph case, in which the most expensive step is constructing the projection operator (which can be done in time via truncated SVD [GVL96, Gu15]), for the running time of Algorithm 2 is dominated by constructing the adjacency matrix , which takes time (the same amount of time it takes to simply read the input hypergraph). Thus, the overall running time of Algorithm 2 is .

4. Reduction to random matrices

Working with random tensors is hard [HL13], since we do not have as many linear algebra and probability tools as for random matrices. It would be convenient if we could work with matrices instead of tensors. We propose to analyze the following adjacency matrix of a hypergraph, originally defined in [FL96].

Definition 4.1 (Adjacency matrix).

Let be a random hypergraph generated from and let be the adjacency tensor of . For any hyperedge , let be the entry in corresponding to . We define the adjacency matrix of by

(4.1)

Thus, is the number of hyperedges in that contains vertices . Note that in the summation (4.1), each hyperedge is counted once.

From our definition, is symmetric, and for . However, the entries in are not independent. This presents some difficulty, but we can still get information about the clusters from this adjacency matrix .

5. Eigenvalues and concentration of spectral norms

It is easy to see that for ,

Let

then is a symmetric matrix of rank

. We have the following eigenvalues for

. Note that we are using the convention for a self-adjoint matrix .

Lemma 5.1.

The eigenvalues of are

Hence by a shifting, we have the following eigenvalues for .

Lemma 5.2.

The eigenvalues of are

We can use an -net chaining argument to prove the following concentration inequality for the spectral norm of .

Definition 5.3 (-net).

An -net for a compact metric space is a finite subset of such that for each point , there is a point with .

Theorem 5.4.

Let be the spectral norm of a matrix, we have

(5.1)

with probability at least .

Proof.

Consider the centered matrix , then each entry is a centered random variable. Let . Let be the unit sphere in (using the -norm). By the definition of the spectral norm,

Let be an -net on . Then for any , there exists some such that . Then we have

For any , if we take the supremum over , we have

Therefore

(5.2)

Now we fix an first, and prove a concentration inequality for . Let be the hyperedge set in a complete -uniform hypergraph on . We have

Let , then

where are independent. Note that , so we have

By Hoeffding’s inequality,

(5.3)

From Cauchy’s inequality, we have

(5.4)

Therefore from (5.3) and (5.4),

Taking , we have

(5.5)

Since (see Corollary 4.2.11 in [Ver18] for example), we can take and by a union bound, we have

(5.6)

So we have from (5.2), (5.6)

6. Dominant eigenspaces and projectors

Our recovery algorithm is based on the dominant rank- projector of the adjacency matrix .

Definition 6.1 (Dominant eigenspace).

If is a Hermitian real symmetric matrix, the dominant

-dimensional eigenspace of

, denoted , is the subspace of or

spanned by eigenvectors of

corresponding to its largest eigenvalues.

Note that by this definition, if , then actually has dimension , but that will never be the case in this analysis.

Definition 6.2 (Dominant rank- projector).

If is a Hermitian real symmetric matrix, the dominant rank- projector of , denoted , is the orthogonal projection operator onto .

is a rank-, self-adjoint operator which acts as the identity on . It has eigenvalues equal to 1 and equal to 0. If is an orthonormal basis for , then

(6.1)

where denotes either the transpose or conjugate transpose of , depending on whether we are working over or .

Let us define to be the incidence matrix of ; i.e.,

Thus, it is our goal to reconstruct given .

Theorem 6.3.

Let , and be defined as in Sections 4 and 5. Then

Proof.

Let

denote the indicator vector for cluster

and the all ones matrix. Then we can write

for some constants . Thus, is an orthonormal basis for the column space of both and , and hence, in accordance with (6.1),

Now, observe that the eigenvalues of are those of shifted down by , and is an eigenvector of iff. it is an eigenvector of ; hence, the dominant -dimensional eigenspace of is the same as the column space of , and therefore . ∎

Thus, gives us all the information we need to reconstruct . Unfortunately, a SBM recovery algorithm doesn’t have access to or

(if it did the problem would be trivial), but the following theorem shows that the random matrix

is a good approximation to and thus reveals the underlying rank- structure of :

Theorem 6.4.

Assume (5.1) holds. Then

and

for any .

To prove Theorem 6.4, we use the following Lemma from  [Col18, Lemma 4].

Lemma 6.5.

Let be symmetric. Suppose that the largest eigenvalues of both are at least , and the remaining eigenvalues of both are at most , where . Then

(6.2)
(6.3)
Proof of Theorem 6.4.

Apply Lemma 6.5 with , and

Note that in order for this to work we need , i.e.

7. Constructing an approximate cluster

In this section we show how to use to construct an “approximate cluster”, i.e. a set with small symmetric difference with one of the clusters. We will show that

  • If and is large, then must have large intersection with some cluster (Lemma 7.1)

  • Such a set exists among the sets , where is the indices of the largest entries in column of , along with itself (Lemma 7.2).

The intuition is that if , then

and this quantity is maximized when comes mostly from a single cluster .

Lemmas 7.1 and 7.2 below are essentially the same as Lemmas 18 and 17 in [CFR17]. As as in the graph case (Theorem 6.3), we can import their proofs directly from the graph case. However, we present a simpler proof for Lemma 7.1.

Lemma 7.1.

Assume (5.1) holds. Let and . Then for some .

Proof.

By Theorem 6.4,

And by the triangle inequality,

(7.1)

We will show that in order for this to hold, must have large intersection with some cluster.

Fix such that . Assume by way of contradiction that for all . Observe that by Theorem 6.3

(7.2)

Let and consider the optimization problem

s.t.

It is easy to see that the maximum occurs when for some , for all , and the maximum is . Thus, by (7.1) and (7.2) we have

Solving for , this implies that

Thus, if we choose we have a contradiction. Let us choose to be as large as possible, . Then it must be the case that for some . Note that for the proof to go through we require , which is satisfied if . ∎

This lemma gives us a way to identify an “approximate cluster” using only ; however, it would take time to try all sets of size . However, if we define to be along with the indices of the largest entries of column of (as in Step 3 of Algorithm 2), then Lemma 7.2 below will show that one of these sets satisfies the conditions of Lemma 7.1; thus, we can produce an approximate cluster in polynomial time by taking the that maximizes .

Lemma 7.2.

Assume (5.1) holds. For , let be defined as in Step 3 of Algorithm 2. Then there exists a column such that

The proof is exactly the same as that of [CFR17, Lemma 17].

Lemmas 7.1 and 7.2 together prove that, as long as (5.1) holds, Steps 2-4 successfully construct a set such that and for some . In the following section we will see how to recover exactly from .

8. Exact recovery by counting hyperedges

Suppose we have a set such that for some ( denotes symmetric difference). In the graph case () we can use to recover exactly w.h.p. as follows:

  1. Show that w.h.p. for any will have at least neighbors in , while any will have at most neighbors in . This follows from a simple Hoeffding argument.

  2. Show that that, if these bounds hold for any , then (deterministically) any will have at least neighbors in , while any will have at most neighbors in . Thus, we can use number of vertices in to distinguish between vertices in and vertices in other clusters.

See [CFR17, Lemmas 19-20] for details. The reason we cannot directly apply a Hoeffding argument to is that depends on the randomness of the instance , thus the number of neighbors a vertex has in is not the sum of fixed random variables.

To generalize to hypergraphs with , an obvious analogue of the notion of number of neighbors a vertex has in a set is to define the random variable

i.e. the number of hyperedges containing and vertices from . When this is simply the number of neighbors has in . We get the following analogue to [CFR17, Lemma 19]:

Lemma 8.1.

Consider cluster and vertex , and let . If , then for sufficiently large and ,

(8.1)

with probability