# On clustering network-valued data

Community detection, which focuses on clustering nodes or detecting communities in (mostly) a single network, is a problem of considerable practical interest and has received a great deal of attention in the research community. While being able to cluster within a network is important, there are emerging needs to be able to cluster multiple networks. This is largely motivated by the routine collection of network data that are generated from potentially different populations. These networks may or may not have node correspondence. When node correspondence is present, we cluster networks by summarizing a network by its graphon estimate, whereas when node correspondence is not present, we propose a novel solution for clustering such networks by associating a computationally feasible feature vector to each network based on trace of powers of the adjacency matrix. We illustrate our methods using both simulated and real data sets, and theoretical justifications are provided in terms of consistency.

## Authors

• 7 publications
• 23 publications
• 27 publications
• ### Overlapping community detection in networks via sparse spectral decomposition

We consider the problem of estimating overlapping community memberships ...
09/20/2020 ∙ by Jesús Arroyo, et al. ∙ 4

• ### Convex Relaxation for Community Detection with Covariates

Community detection in networks is an important problem in many applied ...
07/10/2016 ∙ by Bowei Yan, et al. ∙ 0

• ### Robust and computationally feasible community detection in the presence of arbitrary outlier nodes

Community detection, which aims to cluster N nodes in a given graph into...
04/23/2014 ∙ by T. Tony Cai, et al. ∙ 0

• ### On the Theoretical Properties of the Network Jackknife

We study the properties of a leave-node-out jackknife procedure for netw...
04/19/2020 ∙ by Qiaohui Lin, et al. ∙ 0

• ### A Compressive Sensing Approach to Community Detection with Applications

The community detection problem for graphs asks one to partition the n v...
08/30/2017 ∙ by Ming-Jun Lai, et al. ∙ 0

• ### Learning Influence-Receptivity Network Structure with Guarantee

Traditional works on community detection from observations of informatio...
06/14/2018 ∙ by Ming Yu, et al. ∙ 0

• ### How Many Communities Are There?

Stochastic blockmodels and variants thereof are among the most widely us...
12/04/2014 ∙ by Diego Franco Saldana, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A network, which is used to model interactions or communications among a set of agents or nodes, is arguably among one of the most common and important representations for modern complex data. Networks are ubiquitous in many scientific fields, ranging from computer networks, brain networks and biological networks, to social networks, co-authorship networks and many more. Over the past few decades, great advancement has been made in developing models and methodologies for inference of networks. There are a range of rigorous models for networks, starting from the relatively simple Erdös-Rényi model [13], stochastic blockmodels and their extensions [16, 18, 5], to infinite dimensional graphons [29, 14]. These models are often used for community detection, i.e. clustering the nodes in a network. Various community detection algorithms or methods have been proposed, including modularity-based methods [22], spectral methods [26], likelihood-based methods [7, 12, 6, 3], and optimization-based approaches like those based on semidefinite programming [4], etc.

The majority of the work in the community detection literature including the above mentioned focus on finding communities among the nodes in a single network. While this is still a very important problem with many open questions, there is an emerging need to be able to detect clusters among multiple network-valued objects, where a network itself is a fundamental unit of data. This is largely motivated by the routine collection of populations or subpopulations of network-valued data objects. Technological advancement and the explosion of complex data in many domains has made this a somewhat common practice.

There has been some notable work on graph kernels in the Computer Science literature [28, 27]. In these works the goal is to efficiently compute different types of kernel similarity matrices or their approximations between networks. In contrast, we ask the following statistical questions. Can we cluster networks consistently

from a mixture of graphons, when 1) there is node correspondence and 2) when there isn’t. The first situation arises when one has a dynamic network over time, or multiple instantiations of a network over time. If one thinks of them as random samples from a mixture of graphons, then can we cluster them? Note that this is not answered by methods which featurize graphs using different statistics. Our work proposes a simple and general framework for the first question - viewing the data as coming from a mixture model on graphons. This is achieved by first obtaining a graphon estimate of each of the networks, constructing a distance matrix based on the graphon estimates, and then performing spectral clustering on the resulting distance matrix. We call this algorithm Network Clustering based on Graphon Estimates (NCGE).

The second situation arises when one is interested in global properties of a network. This setting is closer to that of graph kernels. Say we have co-authorship networks from Computer Science and High Energy Physics. Are these different types of networks? There has been a lot of empirical and algorithmic work on featurizing networks or computing kernels between networks. But most of these features require expensive computation. We propose a simple feature based on traces of powers of the adjacency matrix for this purpose which is very cheap to compute as it involves only matrix multiplication. We cluster these features and call this method Network Clustering based on Log Moments (NCLM).

We provide some theoretical guarantees for our algorithms in terms of consistency, in addition to extensive simulations and real data examples. The simulation results show that our algorithms clearly outperform the naive yet popular method of clustering (vectorized) adjacency matrices in various settings. We also show that in absence of node correspondence, our algorithm is consistently better and faster than methods which featurize networks with different global statistics and graphlet kernels. We also show our performance on a variety of real world networks, like separating out co-authorship networks form different domains and ego networks.

The rest of the paper is organized as follows. In Section 2 we briefly describe graphon-estimation methods and other related work. Next, in Section 3 we formally describe our setup and introduce our algorithms. Section 4.1 contains some theory for these algorithms. In Section 5 we provide simulations and real data examples. We conclude with a discussion in Section 6. Proofs of the main results and some further details are relegated to the appendices.

## 2 Related work

The focus of this paper is on 1) clustering networks which have node correspondence based on estimating the underlying graphon and 2) clustering networks without node correspondence based on global properties of the networks. We present the related work in two parts: first we cite two such methods of obtaining graphon estimates, which we will use in our first algorithm. Second, we present existing results that summarize a network using different statistics and compare those to obtain a measure of similarity.

A prominent estimator of graphons is the so called Universal Singular Value Thresholding (USVT) estimator proposed by

[10]. The main idea behind USVT is to essentially approximate the rank of the population matrix by thresholding the singular values of the observed matrix at an universal threshold, and then compute an approximation of the population using the top singular values and vectors.

A recent work [31]

proposes a novel, statistically consistent and computationally efficient approach for estimating the link probability matrix by neighborhood smoothing. Typically for large networks USVT is a lot more scalable than the neighborhood-smoothing approach. There are several other methods for graphon estimations, e.g., by fitting a stochastic blockmodel

[25]. These methods can also be used in our algorithm.

In [11]

, a graph-based method for change-point detection is proposed, where an independent sequence of observations are considered. These are generated i.i.d. under the null hypothesis, whereas under the alternative, after a change point, the underlying distribution changes. The goal is to find this change point. The observations can be high-dimensional vectors or even networks, with the latter bearing some resemblance with our first framework. Essentially the authors of

[11]

present a statistically consistent clustering of the observations into “past” and “future”. We remark here that our graphon-based clustering algorithm suggests an alternative method for change point detection, namely by looking at the second eigenvector of the distance matrix between estimated graphons. Another related work is due to

[15] which aims to extend the classical large sample theory to model network-valued objects.

For comparing global properties of networks, there has been many interesting works which featurize networks based on global features [2]. In the Computer Science literature, graph kernels have gained much attention [28, 27]. In these works the goal is to efficiently compute different types of kernel similarity matrices or their approximations between networks.

## 3 A framework for clustering networks

Let be a binary random network or graph with nodes. Denote by its adjacency matrix, which is an by symmetric matrix with binary entries. That is, , where if there is an observed edge between nodes and , and otherwise. All the diagonal elements of are structured to be zero (i.e. ). We assume the following random Bernoulli model with:

 Aij∣Pij∼Bernoulli(Pij),i

where is the link probability between nodes and . We denote the link probability matrix as . The edge probabilities are often modeled using the so-called graphons. A graphon is a nonnegative bounded, measurable symmetric function . Given such an , one can use the model

 Pij=f(ξi,ξj), (2)

where are

i.i.d. uniform random variables

on . In fact, any (infinite) exchangeable network arises in this way (by Aldous-Hoover representation [1, 17]).

Intuitively speaking, one wishes to model a discrete network using some continuous object . Our current work focuses on the problem of clustering networks. Unlike in a traditional setup, where one observes a single network (with potentially growing number of nodes) and the goal is often to cluster the nodes, here we observe multiple networks and are interested in clustering these networks viewed as fundamental data units.

### 3.1 Node correspondence present

A simple and natural model for this is what we call the graphon mixture model for obvious reasons: there are only (fixed) underlying graphons giving rise to link probability matrices and we observe networks sampled i.i.d. from the mixture model

 Pmix(A)=K∑i=1qiPΠi(A), (3)

where the ’s are the mixing proportions and is the probability of observing the adjacency matrix when the link probability matrix is given by . Consider nodes, and independent networks , which define edges between these nodes. We propose the algorithm the following simple and general algorithm (Algorithm 1) for clustering them:

We will from now on denote the above algorithm with the different graphon estimation (‘blackbox’) approaches as follows: the algorithm with USVT a blackbox will be denoted by CL-USVT and the one with the neighborhood smoothing method will be denoted by CL-NBS. We will compare these two algorithms with the CL-NAIVE method which does not estimate the underlying graphon, but uses the vectorized binary string representation of the adjacency matrices, and clusters those (in the spirit of [11]).

### 3.2 Node correspondence absent

We will use certain graph statistics to construct a feature vector. The basic statistics we choose are the trace of powers of the adjacency matrix, suitably normalized and we call them graph moments:

 mk(A)=trace(A/n)k. (4)

These statistics are related to various path/subgraph counts. For example, is the normalized count of the total number of edges, is the normalized triangle count of . Higher order moments are actually counts of closed walks (or directed circuits). Figure 1 shows the circuits corresponding to .

The reason we use graph moments instead of subgraph counts is that the latter are quite difficult to compute and present day algorithms work only for subgraphs up to size . On the contrary, graph moments are easy to compute as they only involve matrix multiplication.

While it may seem that this is essentially the same as comparing the eigenspectrum, it is not clear how many eigenvalues one should use. Even if one could estimate the number of large eigenvalues using an USVT type estimator, the length is different for different networks. The trace takes into account the relative magnitudes of

naturally. In fact, we tried (see Section 5) using the top few eigenvalues as the sole features; but the results were not as satisfactory as using .

We now present our second algorithm (Algorithm 2) Network Clustering with Log Moments (NCLM). In step 1, for some positive integer , we compute . Our feature map here is . For step 2, we use the Euclidean norm, i.e. .

Note: The rationale behind taking a logarithm of the graph moments is that, if we have two graphs with the same degree density but different sizes, then the degree density will not play any role in the the distance (which is necessary because the degree density will subdue any other difference otherwise). The parameter counts, in some sense, the effective number of eigenvalues we are using.

## 4 Theory

We will only mention our main results and discuss some of the consequences here. All the proofs and further details can be found in the appendix, in Section A.

### 4.1 Results on NCGE

We can think of as estimating .

###### Theorem 4.1.

Suppose has rank . Let (resp. ) be the matrix whose columns correspond to the leading eigenvectors (corresponding to the largest-in-magnitude eigenvalues) of (resp. ). Let be the -th smallest eigenvalue value of

in magnitude. Then there exists an orthogonal matrix

such that

 ∥^V^O−V∥2F≤64Tγ2∑i∥^Pi−Pi∥2F.
###### Corollary 4.2.

Assume for some absolute constants the following holds for each :

 ∥^Pi−Pi∥2Fn2≤Cin−α(logn)β, (5)

either in expectation or with high probability (). Then in expectation or with high probability () we have that

 ∥^V^O−V∥2F≤64CTT2n2−α(logn)βγ2. (6)

where . (If there are (fixed, not growing with ) underlying graphons, the constant does not depend on .) Table 1 reports values of , for various graphon estimation procedures (under assumptions on the underlying graphons, that are described in the appendix, in Section A.

While it is hard to obtain an explicit bound on in general, let us consider a simple equal weight mixture of two graphons to illustrate the relationship between and separation between graphons. Let the distance between the population graphons be We have , where the population matrix be has . Here , where is the matrix of all ones, and the th row of the binary matrix has a single one at position if network is sampled from . The eigenvalues of this matrix are and . Thus in this case . As a result (6) becomes

 ∥^V^O−V∥2F≤256CTn−α(logn)βd2. (7)

Let us look at a more specific case of blockmodels with the same number () of clusters of equal sizes () to gain some insight into . Let be a binary matrix of memberships such that if node within a blockmodel comes from cluster . Consider two blockmodels with and with , where

is the identity matrix of order

(here the only difference between the models come from link formation probabilities within/between blocks, the blocks remaining the same). In this case

 d2=∥Π1−Π2∥2Fn2=1m(p−p′)2+(1−1m)(q−q′)2.

The bound (6) can be turned into a bound on the proportion of “misclustered” networks, defined appropriately. There are several ways to define misclustered nodes in the context of community detection in stochastic blockmodels that are easy to analyze with spectral clustering (see, e.g., [26, 19]). These definitions work in our context too. For example, if we use Definition 4 of [26] and denote by the set of misclustered networks, then from the proof of their Theorem 1, we have

 |M|≤8mT∥^V^O−V∥2F,

where is the maximum number of networks coming from any of the graphons.

### 4.2 Results on NCLM

We first establish concentration of The proof uses Talagrand’s concentration inequality, which requires additional results on Lipschitz continuity and convexity. This is obtained via decomposing into a linear combination of convex-Lipschitz functions.

###### Theorem 4.3 (Concentration of moments).

Let be the adjacency matrix of an inhomogeneous random graph with link-probability matrix . Then for any . Let . Then

 P(|ψk(A)−Eψk(A)|>t)≤4exp(−(t−4√2)2/16).

As a consequence of this, we can show that concentrates around .

###### Theorem 4.4 (Concentration of gJ(A)).

Let , where , , and . Then and for any satisfying , we have

 P(∥gJ(A)−¯gJ(A)∥≥δJ3/2log(1/ρ))≤JC1e−C2n2ρ2J.

We expect that will be a good population level summary for many models. In general, it is hard to show an explicit separation result for However, in simple models, we can do explicit computations to show separation. For example, in a two parameter blockmodel , with equal block sizes, we have , and so on. Thus we see that if , then should be able to distinguish between such blockmodels (i.e. different , ). Note: After this paper was submitted to NIPS-2017, we came to know of a concurrent work [21] that provides a topological/combinatorial perspective on the expected graph moments . Theorem 1 in [21] shows that under some mild assumptions on the model (satisfied, for example, by generalized random graphs with bounded kernels as long as the average degree grows to infinity), (# of closed -walks) will be asymptotic to (# of closed -walks that trace out a -cycle) plus (# of closed -walks that trace out a -tree). For even , if the degree grows fast enough -cycles tend to dominate, whereas for sparser graphs trees tend to dominate. From this and our concentration results, we can expect NCLM to be able to tell apart graphs which are different in terms the counts of these simpler closed -walks. Incidentally, the authors of [21] also show that the expected count of closed non-backtracking walks of length is dominated by walks tracing out -cycles. Thus if one uses counts of closed non-backtracking -walks (i.e. moments of the non-backtracking matrix) instead of just closed -walks as features, one would expect similar performance on denser networks, but in sparser settings it may lead to improvements because of the absence of the non-informative trees in lower order even moments.

## 5 Simulation study and data analysis

In this section, we describe the results of our experiments with simulated and real data to evaluate the performance of NCGE and NCLM. We measure performance in terms of clustering error which is the minimum normalized hamming distance between the estimated label vector and all permutations of the true label assignment. Clustering accuracy is one minus clustering error.

### 5.1 Node correspondence present

We provide two simulated data experiments555Code used in this paper is publicly available at https://github.com/soumendu041/clustering-network-valued-data. for clustering networks with node correspondence. In each experiment twenty 150-node networks were generated from a mixture of two graphons, 13 networks from the first and the other 7 from the second. We also used a scalar multiplier with the graphons to ensure that the networks are not too dense. The average degree for all these experiments were around 20-25. We report the average error bars from a few random runs.

First we generate a mixture of graphons from two blockmodels, with probability matrices with . We use and and measure clustering accuracy as the multiplicative error is increased from to . We compare CL-USVT, CL-NBS and CL-NAIVE and the results are summarized in Figure 2(A). We have observed two things. First, CL-USVT and CL-NBS start distinguishing the graphons better as increases (as the theory suggests). Second, the naive approach does not do a good job even when increases.

In the second simulation, we generate the networks from two smooth graphons and , where (here corresponds to the graphon 3 appearing in Table 1 of [31]). As is seen from Figure 2(B), here also CL-USVT and CL-NBS outperform the naive algorithm by a huge margin. Also, CL-NBS is consistently better than CL-USVT. This may have happened because we did our experiments on somewhat sparse networks, where USVT is known to struggle.

### 5.2 Node correspondence absent

We show the efficacy of our approach via two sets of experiments. We compare our log-moment based method NCLM with three other methods. The first is Graphlet Kernels [27] with , and graphlets, denoted by GK3, GK4 and GK5 respectively. In the second method, we use six different network-based statistics to summarize each graph; these statistics are the algebraic connectivity, the local and global clustering coefficients [24], the distance distribution [20] for hops, the Pearson correlation coefficient [23] and the rich-club metric [32]. We also compare against graphs summarized by the top eigenvalues of (TopEig). These are detailed in the appendix, in Section B.

For each distance matrix we compute with NCLM, GraphStats and TopEig, we calculate a similarity matrix where is learned by picking the value within a range which maximizes the relative eigengap . It would be interesting to have a data dependent range for . We are currently working on cross-validating the range using the link prediction accuracy on held out edges.

For each matrix we calculate the top eigenvectors, and do -means on them to get the final clustering. We use ; however, as we will see later in this subsection, for GK3, GK4, and GK5 we had to use a smaller which boosted their clustering accuracy.

First we construct four sets of parameters for the two parameter blockmodel (also known as the planted partition model): , , , and . Note that the first two settings differ only in the density parameter . The second two settings differ in the within and across cluster probabilities. The first two and second two differ in . For each parameter setting, we generate two sets of 20 graphs, one with and the other with .

For choosing , we calculate the moments for a large ; compute a kernel similarity matrix for each choice of and report the one with largest relative eigengap between the and eigenvalue. In Figure 3 we plot this measure of separation (= ) against the value of .

We see that the eigengap increases and levels off after a point. However, as increases, the computation time increases. We report the accuracy of , whereas also returns the same in 48 seconds.

We see that NCLM performs the best. For GK3, GK4 and GK5, if one uses the top two eigenvectors , and clusters those into 4 clusters (since there are four parameter settings), the errors are respectively , and . This means that for clustering one needs to estimate the effective rank of the graphlet kernels as well. TopEig performs better than GraphStats, which has trouble separating out and . Note: Intuitively one would expect that, if there is node correspondence between the graphs, clustering based on graphon estimates would work better, because it aims to estimate the underlying probabilistic model for comparison. However, in our experiments we found that a properly tuned NCLM matched the performance of NCGE. This is probably because a properly tuned NCLM captures the global features that distinguish two graphons. We leave it for future work to compare their performance theoretically.

### 5.3 Real Networks

We cluster about fifty real world networks. We use 11 co-authorship networks between 15,000 researchers from the High Energy Physics corpus of the arXiv, 11 co-authorship networks with 21,000 nodes from Citeseer (which had Machine Learning in their abstracts), 17 co-authorship networks (each with about 3000 nodes) from the NIPS conference and finally 10 Facebook ego networks

. Each co-authorship network is dynamic, i.e. a node corresponds to an author in that corpus and this node index is preserved in the different networks over time. The ego networks are different in that sense, since each network is obtained by looking at the subgraph of Facebook induced by the neighbors of a given central or “ego” node. The sizes of these networks vary between 350 to 4000.

Table 3 summarizes the performance of different algorithms and their running time to compute distance between the graphs. We use the different sources of networks as labels, i.e. HEP-Th will be one cluster, etc. We explore different choices of , and see that the best performance is from NCLM, with , followed closely by GraphStats. TopEig ( in this case is where the eigenspectra of the larger networks have a knee) and the graph kernels do not perform very well. GraphStats take 765 seconds to complete, whereas NCLM finishes in 2.7 seconds. This is because the networks are large but extremely sparse, and so calculation of matrix powers is comparatively cheap.

In Figure 4 we plot the kernel similarity matrix obtained using NCLM on the real networks (higher the value, more similar the points are). The first 11 networks are from HEP-Th, whereas the next 11 are from Citeseer. The next 16 are from NIPS and the remaining ones are the ego networks from Facebook. First note that {HEP-Th, Citeseer}, NIPS and Facebook are well separated. However, HEP-Th and Citeseer are hard to separate out. This is also verified by the bad performance of TopEig in separating out the first two (shown in Section 5). However, in Figure 4, we can see that the Citeseer networks are different from HEP-Th in the sense that they are not as strongly connected inside as HEP-Th.

## 6 Discussion

We consider the problem of clustering network-valued data for two settings, both of which are prevalent in practice. In the first setting, different network objects have node correspondence. This includes clustering brain networks obtained from FMRI data where each node corresponds to a specific region in the brain, or co-authorship networks between a set of authors where the connections vary from one year to another. In the second setting, node correspondence is not present, e.g., when one wishes to compare different types of networks: co-authorship networks, Facebook ego networks, etc. One may be interested in seeing if co-authorship networks are more “similar” to each other than ego or friendship networks.

We present two algorithms for these two settings based on a simple general theme: summarize a network into a possibly high dimensional feature vector and then cluster these feature vectors. In the first setting, we propose NCGE, where each network is represented using its graphon-estimate. We can use a variety of graphon estimation algorithms for this purpose. We show that if the graphon estimation is consistent, then NCGE can cluster networks generated from a finite mixture of graphons in a consistent way, if those graphons are sufficiently different. In the second setting, we propose to represent a network using an easy-to-compute summary statistic, namely the vector of the log-traces of the first few powers of a suitably normalized version of the adjacency matrix. We call this method NCLM and show that the summary statistic concentrates around its expectation, and argue that this expectation should be able to separate networks generated from different models. Using simulated and real data experiments we show that NCGE is vastly superior to the naive but often-used method of comparing adjacency matrices directly, and NCLM outperforms most computationally expensive alternatives for differentiating networks without node correspondence. In conclusion, we believe that these methods will provide practitioners with a powerful and computationally tractable tool for comparing network-structured data in a range of disciplines.

#### Acknowledgments

We thank Professor Peter J. Bickel for helpful discussions. SSM was partially supported by NSF-FRG grant DMS-1160319 and a Loéve Fellowship. PS was partially supported by NSF grant DMS 1713082. LL was partially supported by NSF grants IIS 1663870, DMS 1654579 and a DARPA grant N-66001-17-1-4041.

## References

• [1] David J. Aldous. Representations for partially exchangeable arrays of random variables.

Journal of Multivariate Analysis

, 11(4):581 – 598, 1981.
• [2] S. et al Aliakbary. Learning an integrated distance metric for comparing structure of complex networks. Chaos, 25(2):177–214, 2015.
• [3] Arash A Amini, Aiyou Chen, Peter J Bickel, Elizaveta Levina, et al. Pseudo-likelihood methods for community detection in large sparse networks. The Annals of Statistics, 41(4):2097–2122, 2013.
• [4] Arash A Amini and Elizaveta Levina. On semidefinite relaxations for the block model. arXiv preprint arXiv:1406.5647, 2014.
• [5] Brian Ball, Brian Karrer, and MEJ Newman. Efficient and principled method for detecting communities in networks. Physical Review E, 84(3):036103, 2011.
• [6] Peter Bickel, David Choi, Xiangyu Chang, Hai Zhang, et al. Asymptotic normality of maximum likelihood and its variational approximation for stochastic blockmodels. The Annals of Statistics, 41(4):1922–1943, 2013.
• [7] Peter J. Bickel and Aiyou Chen. A nonparametric view of network models and newman girvan and other modularities. Proceedings of the National Academy of Sciences of the Unites States of America, 106(50):21068–21073, 2009.
• [8] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.
• [9] Eric Carlen. Trace inequalities and quantum entropy: an introductory course. Entropy and the quantum, 529:73–140, 2010.
• [10] Sourav Chatterjee. Matrix estimation by universal singular value thresholding. Ann. Statist., 43(1):177–214, 02 2015.
• [11] Hao Chen and Nancy Zhang. Graph-based change-point detection. Ann. Statist., 43(1):139–176, 02 2015.
• [12] David S Choi, Patrick J Wolfe, and Edoardo M Airoldi. Stochastic blockmodels with a growing number of classes. Biometrika, page asr053, 2012.
• [13] Paul Erdős and Alfréd Rényi. On random graphs i. Publicationes Mathematicae (Debrecen), 6:290–297, 1959 1959.
• [14] Chao Gao, Yu Lu, and Harrison H. Zhou. Rate-optimal graphon estimation. Ann. Statist., 43(6):2624–2652, 12 2015.
• [15] C. E. Ginestet, P. Balanchandran, S. Rosenberg, and E. D. Kolaczyk. Hypothesis Testing For Network Data in Functional Neuroimaging. ArXiv e-prints, July 2014.
• [16] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109–137, 1983.
• [17] D.N. Hoover. Relations on probability spaces and arrays of random variables. Technical report, Institute of Advanced Study, Princeton., 1979.
• [18] Brian Karrer and M. E. J. Newman. Stochastic blockmodels and community structure in networks. Phys. Rev. E, 83:016107, Jan 2011.
• [19] Jing Lei, Alessandro Rinaldo, et al. Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1):215–237, 2015.
• [20] Priya Mahadevan, Dmitri Krioukov, Kevin Fall, and Amin Vahdat. Systematic topology analysis and generation using degree correlations. SIGCOMM Comput. Commun. Rev., 36(4):135–146, August 2006.
• [21] Pierre-André G Maugis, Sofia C Olhede, and Patrick J Wolfe. Topology reveals universal features for network comparison. arXiv preprint arXiv:1705.05677, 2017.
• [22] M. E. J. Newman. Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23):8577–8582, 2006.
• [23] Mark E. Newman. Assortative mixing in networks. Phys. Rev. Lett., 89(20):208701, 2002.
• [24] M.E.J. Newman. The structure and function of complex networks. SIAM review, 45(2):167–256, 2003.
• [25] Sofia C. Olhede and Patrick J. Wolfe. Network histograms and universality of blockmodel approximation. Proceedings of the National Academy of Sciences of the Unites States of America, 111(41):14722–14727, 2014.
• [26] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional stochastic block model. Annals of Statistics, 39:1878–1915, 2011.
• [27] N. Shervashidze, SVN. Vishwanathan, TH. Petri, K. Mehlhorn, and KM. Borgwardt. Efficient graphlet kernels for large graph comparison. In JMLR Workshop and Conference Proceedings Volume 5: AISTATS 2009, pages 488–495, Cambridge, MA, USA, April 2009. Max-Planck-Gesellschaft, MIT Press.
• [28] S. V. N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, and Karsten M. Borgwardt. Graph kernels. J. Mach. Learn. Res., 11:1201–1242, August 2010.
• [29] P. J. Wolfe and S. C. Olhede. Nonparametric graphon estimation. ArXiv e-prints, September 2013.
• [30] Y Yu, T Wang, and RJ Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, 2015.
• [31] Y. Zhang, E. Levina, and J. Zhu. Estimating network edge probabilities by neighborhood smoothing. ArXiv e-prints, September 2015.
• [32] Shi Zhou and Raul J. Mondrag n. The rich club phenomenon in the internet topology. IEEE Communications Letters, 8(3):180–182, 2004.

## Appendix A Proofs and related discussions

### a.1 Ncge

###### Proposition A.1.

We have

 ∥^D−D∥2F≤4T∑i∥^Pi−Pi∥2F.
###### Proof.

The proof is straightforward. By triangle inequality we have

 |^Dij−Dij|=∣∣∥^Pi−^Pj∥F−∥Pi−Pj∥F∣∣≤∥^Pi−Pi∥F+∥^Pj−Pj∥F.

Therefore

 ∥^D−D∥2F=∑i,j|^Dij−Dij|2 ≤∑i,j(∥^Pi−Pi∥F+∥^Pj−Pj∥F)2 ≤2∑i,j(∥^Pi−Pi∥2F+∥^Pj−Pj∥2F)=4T∑i∥^Pi−Pi∥2F.

###### Proposition A.2 (Davis-Kahan).

Suppose has rank . Let (resp. ) be the matrix whose columns correspond to the leading eigenvectors (corresponding to the largest-in-magnitude eigenvalues) of (resp. ). Let be the -th smallest eigenvalue value of in magnitude. Then there exists an orthogonal matrix such that

 ∥^V^O−V∥F≤4∥^D−D∥Fγ.
###### Proof.

This follows from a slight variant of Davis-Kahan theorem that appears in [30]. Since is a Euclidean distance matrix of rank , its eigenvalues must be of the form

 λ1≥⋯≥λu>0=⋯=0>λv≥⋯≥λn,

with . Applying Theorem 2 of [30] with , we get that if denotes matrix whose columns are the eigenvectors of corresponding to and denotes the corresponding matrix for , then there exists an orthogonal matrix such that

 ∥^V+^O+−V+∥F≤2√2∥^D−D∥Fλu.

Similarly, considering the eigenvalues , and applying Theorem 2 of [30] with and we get that

 ∥^V−^O−−V−∥F≤2√2∥^D−D∥F−λv,

where , and are the relevant matrices. Set , and . Then note that the columns of are eigenvectors of corresponding to its largest-in-magnitude eigenvalues, that is orthogonal and also that . Thus

 ∥^V^O−V∥2F =∥^V+^O+−V+∥2F+∥^V−^O−−V−∥2F ≤8∥^D−D∥2Fλ2u+8∥^D−D∥2Fλ2v ≤16∥^D−D∥2Fγ2.

###### Proof of Theorem 4.1.

Follows immediately from Propositions A.1-A.2. ∎

###### Proposition A.3.

Suppose there are underlying graphons, as in the graphon mixture model (3), and assume that in our sample there is at least one representative from each of these. Then has the form where the th row of the binary matrix has a single one at position if network is sampled from , and is the matrix of distances between the . As a consequence is of rank . Then there exists a matrix whose columns are eigenvectors of corresponding to the nonzero eigenvalues, such that

 Vi⋆=Vj⋆⇔Zi⋆=Zj⋆,

so that knowing , one can recover the clusters perfectly.

###### Proof.

The proof is standard. Note that is positive definite. Consider the matrix and let be its spectral decomposition. Then the matrix has the required properties. ∎

USVT: Theorem 2.7 of [10] tells us that if the underlying graphons are Lipschitz then where the constant depends only on the Lipschitz constant of the underlying graphon . So, the condition (5) of Corollary 4.2 is satisfied with , . Neighborhood smoothing: The authors of [31] work with a class of piecewise Lipschitz graphons, see Definition 2.1 of their paper. The proof of Theorem 2.2 of [31] reveals that if , then there exist a global constant and constants , such that for all , with probability at least , we have Thus the condition (5) of Corollary 4.2 is satisfied with , for all .

###### Remark A.1.

In the case of the graphon mixture model (3), the constants and will be free of as there are only (fixed) underlying graphons. Also, if each , then will not depend on and if the Lipschitz constants are the same for each graphon, then will not depend on .

###### Remark A.2.

There are combinatorial algorithms that can achieve the minimax rate, , of graphon estimation [14]. Albeit impractical, these algorithms can be used to achieve the optimal bound of in Proposition A.1.

###### Remark A.3.

We do not expect CL-NAIVE to perform well simply because is not a good estimate of in Frobenius norm in general. Indeed,

 1n2E∥A−P∥2F=1n2∑i,jE(Aij−Pij)2=1n2∑i≠jPij(1−Pij)+1n2∑iP2ii≤14(1+o(1)),

with equality, for example, when each .

### a.2 Nclm

###### Proposition A.4 (Lipschitz continuity).

Suppose and are matrices with entries in , then

1. ,

i.e. the map defined on the space of symmetric matrices with entries in by is Lipschitz with constant . Note that this implies that the map defined by

 ~ϕ+,k((aij)1≤i

where , for and , is Lipschitz with constant . The same statements hold for analogously defined and .

###### Proof.

Let be the ordered eigenvalues of , whereas be the ordered eigenvalues of . Let be the ordered eigenvalues of . First note that since these matrices have entries in , their Frobenius norm is at most . Thus all their eigenvalues are in .

We now compute the derivative of with respect to a particular variable . We claim that

 ddAijtraceAk=2kAk−1ij.

To do this we shall work with the linear map interpretation of derivative (in this case multiplication by a number). First consider the map defined as . Then consider the map defined by . Finally consider the map defined as . Now we view the function for symmetric as a function of as

. Therefore, by chain rule

 ddAijtraceAk(u)=Df∘h(Aij)g∘Dh(Aij)f∘DAijh(u).

Now is a linear function. Therefore . On the other hand, it is easy to see that . Finally can be viewed as the composition of two maps and . Notice that , and . Thus

 DAf(H)=Dβ(A)α∘DAβ(H)=Dβ(A)α(H,…,H)=HAk−1+⋯+HAk−1=kHAk−1.

Therefore

 ddAijtraceAk(u)=g∘Dh(Aij)f(ueieTj+uejeTi).

Noting that we get

 ddAijtraceAk(u)=g(k(ueieTj+uejeTi)Ak−1)=2kAk−1iju.

Therefore the map defined by has gradient

Therefore

 |~ϕk(A)−~ϕk(B)|≤∥∇~ϕk∥2∥(Aij)−(Bij)∥2.

But (by repeated application of the inequality and noting that ), and . Therefore

 |~ϕk(A)−~ϕk(B)|≤knk−1∥A−B∥F.

Also as before

 |trace(A+U)k+−traceAk+| =|~ϕk((A+U)+)−~ϕk(A+)| ≤knk−1∥(A+U)+−A+∥F ≤knk−1∥U∥F,

where in the last step we have used the fact that is projection onto the PSD cone and hence non-expansive (i.e. -Lipschitz). Part 2. may now be obtained easily by noting that

###### Proposition A.5 (Convexity).

The functions and are convex on their respective domains.

###### Proof.

We recall the standard result that if a continuous map is convex, so is on the space of Hermitian matrices, and it is strictly convex if f is strictly convex (See, for example, Theorem 2.10 of [9]). To use this we note that is continuous and convex, and so is . This establishes convexity of . Convexity of is an immediate consequence. ∎

###### Proof of Theorem 4.3.

The idea is to use Talagrand’s concentration inequality for convex-Lipschitz functions (cf. [8], Theorem 7.12). First note that for even, we have

 ψk(A)=ψ+,k(A)+ψ−,k(A)

and for odd

 ψk(A)=ψ+,k(A)−ψ−,k(A),

where

 ψ±,k(A)=1√2knk−1~ϕ±,k(A).

Viewed as a map from to , both are convex, 1-Lipschitz. Therefore, by Talagrand’s inequality,

 P(|ψ±,k(A)−Mψ±,k(A)|>t)≤2exp(−t2/4),

where is a median of . By Exercise 2.2 of [8]

 |Mψ±,k(A)−Eψ±,k(A)|≤2√2,

which implies that

 P(|ψ±,k(A)−Eψ±,k(A)|>t)≤2exp(−(t−2√2)2/4).

Therefore

 P(|ψk(A)−Eψk(A)|>t) ≤P(|ψ+,k(A)−Eψ+,k(A)|>t/2)+P(|ψ−,k(A)−Eψ−,k(A)|>t/2) ≤4exp(−(t−4√2)2/16).

###### Proposition A.6 (Order of expectation).

Let , where , , and . Then

 ρk⪯ Emk(A)⪯ρk−1.
###### Proof.

Note that

 trace(Ak)=∑i1,i2,…,ikAi1i2Ai2i3⋯Aiki1.

Since ’s are Bernoulli random variables, Letting and , we see that

 Pℓ⋆≤EAi1i2Ai2i3⋯Aiki1≤Pℓ#,

where is the number of distinct sets in among . We call the weight of the sequence . We can easily see that the total number of sequences is bounded by , and the number of sequences with weight , call it , is bounded above by . In fact,

 N(k;k,n)=n(n−1)(n−2)k−3(n−3)≍nk.

We thus have

 k∑ℓ=1N(ℓ;k,n)Pℓ⋆≤Etrace(Ak)≤k∑ℓ