The representation of data as graphs, with the vertices as entities and the edges as relationships between the entities, is now ubiquitous in many application domains: for example, social networks, in which vertices represent individual actors or organizations 
; neuroscience, in which vertices are neurons or brain regions; and document analysis, in which vertices represent authors or documents . This representation has proven invaluable in describing and modeling the intrinsic and complex structure that underlies these data.
In understanding the structure of large, complex graphs, a central task is that of identifying and classifying local, lower-dimensional structure, and more specifically, consistently and scalably estimating subgraphs and subcommunities. In disciplines as diverse as social network analysis and neuroscience, many large graphs are believed to be composed of loosely connected smaller graph primitives, whose structure is more amenable to analysis. For example, the widely-studied social network Friendster111available from http://snap.stanford.edu/data, which has approximately 60 million users and 2 billion edges, is believed to consist of over 1 million communities at local-scale. Insomuch as the communication structure of these social communities both influences and is influenced by the function of the social community, we expect there to be repeated structure across many of these communities (see Section 5). As a second motivating example, the neuroscientific cortical column conjecture [4, 5] posits that the neocortex of the human brain employs algorithms composed of repeated instances of a limited set of computing primitives. By modeling certain portions of the cortex as a hierarchical random graph, the cortical column conjecture can be interpreted as a problem of community detection and classification within a graph. While the full data needed to test the cortical column conjecture is not yet available , it nonetheless motivates our present approach of theoretically-sound robust hierarchical community detection and community classification.
Community detection for graphs is a well-established field of study, and there are many techniques and methodologies available, such as those based on maximizing modularity and likelihood [7, 8, 9], random walks [10, 11]
, and spectral clustering and partitioning[12, 13, 14, 15, 16, 17]
. While many of these results focus on the consistency of the algorithms—namely, that the proportion of misclassified vertices goes to zero—the key results in this paper give guarantees on the probability ofperfect clustering, in which no vertices at all are misclassified. As such, they are similar in spirit to the results of  and represent a considerable improvement of our earlier clustering results from . As might be expected, though, the strength of our results depends on the average degree of the graph, which we require to grow at least at order . We note that weak or partial recovery results are available for much sparser regimes, e.g., when the average degree stays bounded as the number of vertices increases (see, for example, the work of ). A partial summary of various consistency results and sparsity regimes in which they hold is given in Table I. Existing theoretical results on clustering have also been centered primarily on isolating fine-grained community structure in a network. A major contribution of this work, then, is a formal delineation of hierarchical structure in a network and a provably consistent algorithm to uncover communities and subgraphs at multiple scales.
|Average degree||Method||Notion of recovery||References|
|semidefinite programming, backtracking random walks||weak recovery||[19, 20, 21]|
|spectral clustering||weak consistency||[13, 22, 14]|
|modularity maximization||strong consistency|||
|spectral clustering||strong consistency|||
Moreover, existing community detection algorithms have focused mostly on uncovering the subgraphs. Recently, however, the characterization and classification of these subgraphs into stochastically similar motifs has emerged as an important area of ongoing research. Network comparison is a nascent field, and comparatively few techniques have thus far been proposed; see [23, 24, 25, 26, 27, 28, 29]. In particular, in , the authors exhibit a consistent nonparametric test for the equality of two generating distributions for a pair of random graphs. The method is based on first embedding the networks into Euclidean space followed by computing distances between the density estimates of the resulting embeddings. This hypothesis test will play a central role in our present methodology; see Section 2.
In the present paper, we introduce a robust, scalable methodology for community detection and community comparison in graphs, with particular application to social networks and connectomics. Our techniques build upon previous work in graph embedding, parameter estimation, and multi-sample hypothesis testing (see [14, 18, 29, 28]). Our method proceeds as follows. First, we generate a low-dimensional representation of the graph , cluster to detect subgraphs of interest , and then employ the nonparametric inference techniques of  to identify heterogeneous subgraph structures. The representation of a network as a collection of points in Euclidean space allows for a single framework which combines the steps of community detection via an adapted spectral clustering procedure (Algorithm 2) with network comparison via density estimation. Indeed, the streamlined clustering algorithm proposed in this paper, Algorithm 2, is well-suited to our hierarchical framework, whereas classical -means may be ill-suited to the pathologies of this model. As a consequence, we are able to present in this paper a unified inference procedure in which community detection, motif identification, and larger network comparison are all seamlessly integrated.
We focus here on a hierarchical version of the classical stochastic block model [30, 31], in which the larger graph is comprised of smaller subgraphs, each themselves approximately stochastic blockmodels. We emphasize that our model and subsequent theory rely heavily on an affinity assumption at each level of the hierarchy, and we expect our model to be a reasonable surrogate for a wide range of real networks, as corroborated by our empirical results. In our approach, we aim to infer finer-grained structure at each level of our hierarchy, in effect performing a “top-down” decomposition. (For a different generative hierarchical model, in which successive-level blocks and memberships are the inference taks, see .) We recall that the stochastic blockmodel (SBM) is an independent-edge random graph model that posits that the probability of connection between any two vertices is a function of the block memberships (i.e., community memberships) of the vertices. As such, the stochastic blockmodel is commonly used to model community structure in graphs. While we establish performance guarantees for this methodology in the setting of hierarchical stochastic blockmodels (HSBM), we demonstrate the wider effectiveness of our algorithm for simultaneous community detection and classification in the Drosophila connectome and the very-large scale social network Friendster, which has approximately 60 million users and 2 billion edges.
We organize the paper as follows. In Section 2, we provide the key definitions in our model, specifically for random dot product graphs, SBM graphs, and HSBM graphs. We summarize recent results on networks comparison from , which is critical to our main algorithm, Algorithm 1. We also present our novel clustering procedure, Algorithm 2. In Section 3, we demonstrate how, under mild model assumptions, Algorithm 1 can be applied to asymptotically almost surely perfectly recover the motif structure in a two-level HSBM, see Theorem 9. In Section 4, we consider a HSBM with multiple levels and discuss the recursive nature of Algorithm 1. We also extend Theorem 9 to the multi-level HSBM and show, under mild model assumptions, Algorithm 1 again asymptotically almost surely perfectly recovers the hierarchical motif structure in a multi-level HSBM. In Section 5, we demonstrate that Algorithm 1 can be effective in uncovering statistically similar subgraph structure in real data: first, in the Drosophila connectome, in which we uncover two repeated motifs; and second, in the Friendster social network, in which we decompose the massive network into 15 large subgraphs, each with hundreds of thousands to millions of vertices. We identify motifs among these Friendster subgraphs, and we compare two subgraphs belonging to different motifs. We further analyze a particular subgraph from a single motif and demonstrate that we can identify structure at the second (lower) level. In Section 6, we conclude by remarking on refinements and extensions of this approach to community detection.
We situate our approach in the context of hierarchical stochastic blockmodel graphs. We first define the stochastic blockmodel as a special case of the more general random dot product graph model , which is itself a special case of the more general latent position random graph . We next describe our canonical hierarchical stochastic blockmodel, which is a stochastic blockmodel that is endowed with a natural hierarchical structure.
Notation: In what follows, for a matrix we shall use the notation to denote the -th row of , and to denote the -th column of . For a symmetric matrix we shall denote the (ordered) spectrum of via
We begin by defining the random dot product graph.
Definition 1 (-dimensional Random Dot Product Graph (RDPG)).
Let be a distribution on a set such that for all We say that is an instance of a random dot product graph (RDPG) if with , and is a symmetric hollow matrix satisfying
We note that non-identifiability is an intrinsic property of random dot product graphs. Indeed, for any matrix
and any orthogonal matrix, the inner product between any rows of is identical to that between the rows of
. Hence, for any probability distributionon and unitary operator , the adjacency matrices and are identically distributed.
The stochastic blockmodel can be framed in the context of random dot product graphs as follows.
We say that an vertex graph is a (positive semidefinite) stochastic blockmodel (SBM) with blocks if the distribution is a mixture of point masses,
where satisfies , and the distinct latent positions are given by . In this case, we write and we refer to as the block probability matrix of . Moreover, any stochastic blockmodel graphs where the block probability matrix is positive semidefinite can be formulated as a random dot product graphs where the point masses are the rows of .
Many real data networks exhibit hierarchical community structure (for social network examples, see [35, 36, 37, 38, 39, 40, 32]; for biological examples, see [4, 6, 5]). To incorporate hierarchical structure into the above RDPG and SBM framework, we first consider SBM graphs endowed with the following specific hierarchical structure.
Definition 3 (2-level Hierarchical stochastic blockmodel (HSBM)).
We say that is an instantiation of a -dimensional 2-level hierarchical stochastic blockmodel with parameters if can be written as the mixture
where satisfies , and for each , is itself a mixture of point mass distributions
where satisfies . The distinct latent positions are then given by . We then write
Simply stated, an HSBM graph is an RDPG for which the vertex set can be partitioned into subgraphs—where denotes the matrix whose rows are the latent positions characterizing the block probability matrix for subgraph —each of which is itself an SBM.
Throughout this manuscript, we will make a number of simplifying assumptions on the underlying HSBM in order to facilitate theoretical developments and ease exposition.
Assumption 1. (Affinity HSBM) We further assume that for the distinct latent positions, if we define
Simply stated, we require that within each subgraph, the connections are comparatively dense, and between two subgraphs, comparatively sparse.
Assumption 2. (Subspace structure) To simplify exposition, and to assure the condition that for and we impose additional structure on the matrix of latent positions in the HSBM. To wit, we write explicitly as
where ( being the Hadamard product) and the entries of are chosen to make the off block-diagonal elements of the corresponding edge probability matrix bounded above by the absolute constant . Moreover, to ease exposition in this 2-level setting, we will assume that for each , so that . In practice, the subspaces pertaining to the individual subgraphs need not be the same rank, and the subgraphs need not have the same number of blocks (see Section 5 for examples of and varying across subgraphs).
Note that can be viewed as a SBM graph with blocks; . However, in this paper we will consider blockmodels with statistically similar subgraphs across blocks, and in general, such models can be parameterized by far fewer than blocks. In contrast, when the graph is viewed as an RDPG, the full dimensions may be needed, because our affinity assumption necessitates a growing number of dimensions to accommodate the potentially growing number of subgraphs. Because latent positions associated to vertices in different subgraphs must exhibit near orthogonality, teasing out the maximum possible number of subgraphs for a given embedding dimension is, in essence, a cone-packing problem; while undoubtedly of interest, we do not pursue this problem further in this manuscript.
Given a graph from this model satisfying Assumptions 1 and 2, we use Algorithm 1 to uncover the hidden hierarchical structure. Furthermore, we note that Algorithm 1 can be applied to uncover hierarchical structure in any hierarchical network, regardless of HSBM model assumptions. However, our theoretical contributions are proven under HSBM model assumptions.
A key component of this algorithm is the computation of the adjacency spectral embedding , defined as follows.
Given an adjacency matrix of a -dimensional RDPG(), the adjacency spectral embedding (ASE) of into is given by where
is the spectral decomposition of , is the diagonal matrix with the (ordered)
largest eigenvalues ofon its diagonal, and
is the matrix whose columns are the corresponding orthonormal eigenvectors of.
It is proved in [14, 41] that the adjacency spectral embedding provides a consistent estimate of the true latent positions in random dot product graphs. The key to this result is a tight concentration, in Frobenius norm, of the adjacency spectral embedding, , about the true latent positions . This bound is strengthened in , wherein the authors show tight concentration, in norm, of about . The concentration provides a significant improvement over results that employ bounds on the Frobenius norm of the residuals between the estimated and true latent positions, namely . The Frobenius norm bounds are potentially sub-optimal for subsequent inference, because one cannot rule out that a diminishing but positive proportion of the embedded points contribute disproportionately to the global error.
However, the norm concentration result in  relies on the assumption that the eigenvalues of are distinct , which is often violated in the setting of repeated motifs for an HSBM. One of the main contributions of this paper is a further strengthening of the results of : in Theorem 5, we prove that concentrates about in norm with far less restrictive assumptions on the eigenstructure of .
In this paper, if is a sequence of events, we say that occurs asymptotically almost surely if as ; more precisely, we say that occurs asymptotically almost surely if for any fixed , there exists such that if and satisfies , then is at least . The theorem below asserts that the norm of the differences between true and estimated latent positions is of a certain order asymptotically almost surely. In the appendix, we state and prove a generalization of this result in the non-dense regime.
Let where the second moment matrix is of rank . Let be the event that there exists a rotation matrix such that
where is some fixed constant. Then occurs asymptotically almost surely.
We stress that because of this bound on the norm, we have far greater control of the errors in individual rows of the residuals than possible with existing Frobenius norm bounds. One consequence of this control is that an asymptotically perfect clustering procedure for will yield an equivalent asymptotically almost surely perfect clustering of . This insight is the key to proving Lemma 6, see the appendix for full detail. A further consequence of Theorem 5
, in the setting of random dot product graphs without a canonical block structure, is that one can choose a loss function with respect to which ASE followed by a suitable clustering yields optimal clusters[41, 18]. This implies that meaningful clustering can be pursued even when no canonical hierarchical structure exists.
Having successfully embedded the graph into through the adjacency spectral embedding, we next cluster the vertices of , i.e., rows of . For each , we define
to be the matrix whose rows are the rows in corresponding to the latent positions in the rows of . Our clustering algorithm proceeds as follows. With Assumptions 1 and 2, and further assuming that is known, we first build a “seed” set as follows. Initialize to be a random sampling of rows of For each , let be such that
If then add to , and remove from ; i.e.,
If then set Iterate this procedure until all rows of have been considered. We show in Proposition 19 in the appendix that is composed of exactly one row from each . Given the seed set , we then initialize clusters via for each Lastly, for , assign to if
As encapsulated in the next lemma, this procedure, summarized in Algorithm 2, yields an asymptotically perfect clustering of the rows of for HSBM’s under mild model assumptions.
Let satisfying Assumptions 1 and 2, and suppose further that . Then asymptotically almost surely,
where is the true assignment of vertices to subgraphs, and is the assignment given by our clustering procedure above.
Under only our “affinity assumption”—namely that —-means cannot provide a provably perfect clustering of vertices. This is a consequence of the fact that the number of clusters we seek is far less than the total number of distinct latent positions. As a notional example, consider a graph with two subgraphs, each of which is an SBM with two blocks. The representation of such a graph in terms of its latent positions is illustrated in Figure 1. We are interested in clustering the vertices into subgraphs, i.e., we want to assign the points to their corresponding cones (depicted via the shaded light blue and pink areas). If we denote by , , and the fraction of red, green, and blue colored points, respectively, then a -means clustering of the colored points into two clusters might, depending on the distance between the points and , yield two clusters with cluster centroids inside the same cone – thereby assigning vertices from different subgraphs to the same cluster. That is to say, if the subgraphs’ sizes in Figure 1 are sufficiently unbalanced, then -means clustering could yield a clustering in which the yellow, green, and blue colored points are assigned to one cluster, and the red colored points are assigned to another cluster. In short, -means is not a subspace clustering algorithm, and the subspace and affinity assumptions made in our HSBM formulation (Assumptions 1, 2, 3 and 4) render -means suboptimal for uncovering the subgraph structure in our model. Understanding the structure uncovered by -means in our HSBM setting, while of interest, is beyond the scope of this manuscript.
Note that being small ensures that the subgraphs of interest, namely the ’s, lie in nearly orthogonal subspaces of . Our clustering procedure is thus similar in spirit to the subspace clustering procedure of .
In what follows, we will assume that , the number of induced SBM subgraphs in , and are known a priori. In practice, however, we often need to estimate both (prior to embedding) and (prior to clustering). To estimate
, we can use singular value thresholding to estimate from a partial SCREE plot. While we can estimate via traditional techniques—i.e., measuring the validity of the clustering provided by Algorithm 2 over a range of via silhouette width (see [44, Chapter 3])—we propose an alternate estimation procedure tuned to our algorithm. For each , we run Algorithm 2 with , and repeat this procedure times. For each , and each compute
If the true is greater than or equal to , then we expect to be small by construction. If is bigger than the true , then at least two of the vectors in would lie in the same subspace; i.e., their dot product would be large. Hence, we would expect the associated to be large. We employ standard “elbow-finding” methodologies  to find the value of for which goes from small to large, and this will be our estimate of . As Algorithm 2 has running time linear in , with a bounded number of Monte Carlo iterates, this estimation procedure also has running time linear in .
Post-clustering, a further question of interest is to determine which of those induced subgraphs are structurally similar. We define a motif as a collection of distributionally “equivalent”—in a sense that we will make precise in Definition 7—RDPG graphs. An example of a HSBM graph with blocks in motifs is presented in Figure 2.
More precisely, we define a motif—namely, an equivalence class of random graphs—as follows.
Let and . We say that and are of the same motif if there exists a unitary transformation such that .
To detect the presence of motifs among the induced subgraphs , we adopt the nonparametric test procedure of  to determine whether two RDPG graphs have the same underlying distribution. The principal result of that work is the following:
Let and be -dimensional random dot product graphs. Consider the hypothesis test
Denote by and the adjacency spectral embedding of and , respectively. Define the test statistic as follows:
where is a radial basis kernel, e.g., . Suppose that and . Then under the
null hypothesis of
. Then under the null hypothesis of,
and as , where is any orthogonal matrix such that . In addition, under the alternative hypothesis of , there exists an orthogonal matrix , depending on and but independent of and , such that
and as .
Theorem 8 allows us to formulate the problem of detecting when two graphs and belong to the same motif as a hypothesis test. Furthermore, under appropriate conditions on (conditions satisfied when is a Gaussian kernel with bandwidth for fixed ), the hypothesis test is consistent for any two arbitrary but fixed distributions and , i.e., as if and only if .
We are presently working to extend results on the consistency of adjacency spectral embedding and two-sample hypothesis testing (i.e., Theorem 8 and ) from the current setting of random dot product graphs to more general random graph models, with particular attention to scale-free and small-world graphs. However, the extension of these techniques to more general random graphs is beset by intrinsic difficulties. For example, even extending motif detection to general latent position random graphs is confounded by the non-identifiability inherent to graphon estimation. Complicating matters further, there are few random graph models that are known to admit parsimonious sufficient statistics suitable for subsequent classical estimation procedures.
3 Detecting hierarchical structure in the HSBM
Combining the above inference procedures, our algorithm, as depicted in Algorithm 1, proceeds as follows. We first cluster the adjacency spectral embedding of the graph to obtain the first-order, large-scale block memberships. We then employ the nonparametric test procedure outlined in  to determine similar induced subgraphs (motifs) associated with these blocks. We iterate this process to obtain increasingly refined estimates of the overall graph structure. In Step 6 of Algorithm 1, we recurse on a representative subgraph (e.g., the largest subgraph) within each motif; embedding the subgraph into (not ) as Step 1 of Algorithm 1. Ideally, we would leverage the full collection of subgraphs from each motif in this recursion step. However, the subgraphs within a motif may be of differing orders and meaningfully averaging or aligning them (see ) requires novel regularization which, though interesting, is beyond the scope of the present manuscript.
Before presenting our main theorem in the 2-level setting, Theorem 9, we illustrate the steps of our method in the analysis of the 2-level synthetic HSBM graph depicted in Figure 2. The graph has 4100 vertices belonging to 8 different blocks of size with three distinct motifs. The block probability matrices corresponding to these motifs are given by
and the inter-block edge probability is bounded by .
The algorithm does indeed detect three motifs, as depicted in Figure 3. The figure presents a heat map depiction of , and the similarity of the communities is represented on the spectrum between white and red, with white representing highly similar communities and red representing highly dissimilar communities. From the figure, we correctly see there are three distinct motif communities, , , and , corresponding to stochastic blockmodels with the following block probability matrices
We note that even though the vertices in the HSBM are perfectly clustered into the subgraphs (i.e., for , for all ), the actual ’s differ slightly from their estimates, but this difference is quite small.
The performance of Algorithm 1 in this simulation setting can be seen as a consequence of Theorem 9 below, in which we prove that under modest assumptions on an underlying 2-level hierarchical stochastic block model, Algorithm 2 yields a consistent estimate of the dissimilarity matrix
Suppose is a hierarchical stochastic blockmodel satisfying Assumptions 1 and 2. Suppose that is fixed and the correspond to different motifs, i.e., the set has distinct elements. Given the assumptions of Theorem 5 and Lemma 6, the procedure in Algorithm 1 yields perfect estimates of and of asymptotically almost surely.
With assumptions as in Theorem 9, any level test using corresponds to an at most level test using . In this case, asymptotically almost surely, the -values of entries of corresponding to different motifs will all converge to as , and the -values of entries of corresponding to the same motifs will all be bounded away from as . This immediately leads to the following corollary.
With assumptions as in Theorem 9, clustering the matrix of -values associated with yields a consistent clustering of into motifs.
Theorem 9 provides a proof of concept inference result for our algorithm for graphs with simple hierarchical structure, and we will next extend our setting and theory to a more complex hierarchical setting.
4 Multilevel HSBM
In many real data applications (see for example, Section 5), the hierarchical structure of the graph extends beyond two levels. We now extend the HSBM model of Definition 3—which, for ease of exposition, was initially presented in the 2-level hierarchical setting—to incorporate more general hierarchical structure. With the HSBM of Definition 3 being a 2-level HSBM (or 2-HSBM), we inductively define an -level HSBM (or -HSBM) for as follows.
Definition 11 (-level Hierarchical stochastic blockmodel -Hsbm).
We say that is an instantiation of a -dimensional -level HSBM if the distribution can be written as
has support on the rows of where for each , has support on the rows of .
For each , an RDPG graph drawn according to RDPG() is an -level HSBM with with at least one such .
Simply stated, an -level HSBM graph is an RDPG (in fact, it is an SBM with potentially many more than blocks) for which the vertex set can be partitioned into subgraphs—where denotes the matrix whose rows are the latent positions characterizing the block probability matrix for subgraph —each of which is itself an -level HSBM with with at least one such .
As in the 2-level case, to ease notation and facilitate theoretical developments, in this paper we will make the following assumptions on the more general -level HSBM. Letting be an instantiation of a -dimensional -level HSBM, we further assume:
Assumption 3: (Multilevel affinity) For each the constants
Assumption 4: (Subspace structure) For each , has support on the rows of , which collectively satisfy
where ( again being the Hadamard product). For each , an RDPG graph drawn according to RDPG() where has support on the rows of and is an at most -level HSBM (with at least one subgraph being an -level HSBM). In addition, we assume similar subspace structure recursively at every level of the hierarchy.
As was the case with Assumption 2, Assumption 4 plays a crucial role in our algorithmic development. Indeed, under this assumption we can view successive levels of the hierarchical HSBM as RDPG’s in successively smaller dimensions. Indeed, it is these RDPG() which we embed in Step 3 of Algorithm 1, and we embed them into the smaller rather than . For example, suppose is an -level HSBM, and has subgraphs each of which is an -level HSBM. Furthermore suppose that each of these subgraphs itself has subgraphs each of which is an -level HSBM, and so on. If the SBM’s at the lowest level are all -dimensional, then can be viewed as an
-dimensional RDPG. In practice, to avoid this curse of dimensionality, we could embed each subgraph at levelinto dimensions and still practically uncover the subgraph (but not the motif!) structure. This assumption also reinforces the affinity structure of the subgraphs, which is a key component of our theoretical developments.
In the -level HSBM setting, we can provide theoretical results on the consistency of our motif detection procedure, Algorithm 1. As it happens, in this simpler setting, the algorithm terminates after Step 6; that is, after clustering the induced subgraphs into motifs. There is no further recursion on these motifs. We next extend Theorem 9 to the multi-level HSBM setting as follows. In the following theorem, for an RDPG , let be the ASE of and let be the true latent positions of ; i.e.,
With notation as above, let be an instantiation of a -dimensional, -level HSBM with fixed. Given Assumptions 3 and 4, further suppose that for each every -level HSBM subgraph, , of satisfies
It follows then that for all such , the procedure in Algorithm 1 simultaneously yields perfect estimates , of asymptotically almost surely. It follows then that for for each such , yield consistent estimates of , which allows for the asymptotically almost surely perfect detection of the motifs.
We note here that in Theorem 12, and the total number of subgraphs at each level of the hierarchy are fixed with respect to . As increases, the size of each subgraph at each level is also increasing (linearly in ), and therefore any separation between and at level will be sufficient to perfectly separate the subgraphs asymptotically almost surely. The proof of the above theorem then follows immediately from Theorem 9 and induction on , and so is omitted.
Theorem 12 states that, under modest assumptions, Algorithm 1 yields perfect motif detection and classification at every level in the hierarchy. From a technical viewpoint, this theorem relies on a norm bound on the residuals of about (see 15), which is crucial to the perfect recovery of precisely the large-scale subraphs. This bound, in turn, only guarantees this perfect recovery of when the average degree is at least of order . We surmise that for subsequent inference tasks that are more robust to the identification of the large-scale subgraphs, results can be established in sparser regimes.
Morever, when applying this procedure to graphs which violate our HSBM model assumptions (for example, when applying the procedure to real data), we encounter error propagation inherent to recursive procedures. In Algorithm 1, there are three main sources of error propagation: errorful clusterings; the effect of these errorfully-inferred subgraphs on ; and subsequent clustering and analysis within these errorful subgraphs. We briefly address these three error sources below.
First, finite-sample clustering is inherently errorful and misclustered vertices contribute to degradation of power in the motif detection test statistic. While we prove the asymptotic consistency of our clustering procedure in Lemma 6, there are a plethora of other graph clustering procedures we might employ in the small-sample setting, including modularity-based methods such as Louvain  and fastgreedy , and random walk-based methods such as walktrap . Understanding the impact that the particular clustering procedure has on subsequent motif detection is crucial, as is characterizing the common properties of misclustered vertices; e.g., in a stochastic block model, are misclustered vertices overwhelmingly likely to be low-degree?
Second, although testing based on is asymptotically robust to a modest number of misclustered vertices, namely vertices, the finite-sample robustness of this test statistic remains open. Lastly, we need to understand the robustness properties of further clustering these errorfully observed motifs. In , the authors propose a model for errorfully observed random graphs, and study the subsequent impact of the graph error on vertex classification. Adapting their model and methodology to the framework of spectral clustering will be essential for understanding the robustness properties of our algorithm, and is the subject of present research.
We next apply our algorithm to two real data networks: the Drosophila connectome from  and the Friendster social network.
5.1 Motif detection in the Drosophila Connectome
The cortical column conjecture suggests that neurons are connected in a graph which exhibits motifs representing repeated processing modules. (Note that we understand that there is controversy surrounding the definition and even the existence of “cortical columns”; our consideration includes “generic” recurring circuit motifs, and is not limited to the canonical Mountcastle-style column .) While the full cortical connectome necessary to rigorously test this conjecture is not yet available even on the scale of fly brains, in  the authors were able to construct a portion of the Drosophila fly medulla connectome which exhibits columnar structure.
This graph is constructed by first constructing the full connectome between 379 named neurons (believed to be a single column) and then sparsely reconstructing the connectome between and within surrounding columns via a semi-automated procedure. The resulting connectome222available from the open connectome project http://openconnecto.me/graph-services/download/ (see fly) has 1748 vertices in its largest connected component, the adjacency matrix of which is visualized in the upper left of Figure 5. We visualize our Algorithm 1 run on this graph in Figure 5. First we embed the graph into ( chosen according the the singular value thresholding method applied to a partial SCREE plot; see Remark 3) and, to alleviate sparsity concerns, project the embedding onto the sphere. The resulting points are then clustered into clusters ( chosen as in Remark 3) of sizes , , and vertices. These clusters are displayed in the upper right of Figure 5. We then compute the corresponding matrix after re-embedding each of these clusters (bottom of Figure 5). In the heat map representation of , the similarity of the communities is represented on the spectrum between white and red, with white representing highly similar communities and red representing highly dissimilar communities. For example, the bootstrapped -value (from bootstrap samples) associated with is