1 Introduction
Statistical network analysis spans a wide range of disciplines, including network science, statistics, physics, computer science and sociology, and an equally wide range of applications and analysis tasks such as community detection and link prediction. In this paper, we study the problem of inferring the generative mechanism of an undirected network based on a single realization of the network. The data consist of the network adjacency matrix , where is the number of nodes, and if there is an edge between nodes and . We assume the observed adjacency matrix is generated from an underlying probability matrix , so that for , ’s are independent Bernoulli trials, and the are edge probabilities.
It is impossible to estimate from a single realization of unless one assumes some form of structure in . When the network is expected to have communities, arguably the most popular assumption is that of the stochastic block model, where each node belongs to one of blocks and the probability of an edge between two nodes is determined by the block to which the nodes belong. In this case, the matrix is parametrized by the matrix of within and betweenblock edge probabilities, and thus it is possible to estimate from a single realization. The main challenge in fitting the stochastic block model is in estimating the blocks themselves, and that has been the focus of the literature, see for example Bickel & Chen (2009), Rohe et al. (2011), Amini et al. (2013), Saade et al. (2014) and Guédon & Vershynin (2016). Once the blocks are estimated,
can be estimated efficiently by a plugin moment estimator. Many extensions and alternatives to the stochastic block model have been proposed to model networks with communities, including those of
Hoff (2008), Airoldi et al. (2008), Karrer & Newman (2011), Cai & Li (2015) and Zhang et al. (arXiv:1412.3432), but their properties are generally only known under the correctly specified model with communities. Here we are interested in estimating for more general networks.A general representation for the matrix for unlabeled exchangeable networks goes back to Aldous (1981)
and the 1979 preprint by D. N. Hoover entitled “Relations on probability spaces and arrays of random variables”. Formally, a network is exchangeable if for any permutation
of the set , the distribution of edges remains the same. That is, if the adjacency matrix is drawn from the probability matrix , which we write as , then for any permutation ,(1) 
Aldous (1981) and Hoover showed that an exchangeable network always admits the following Aldous–Hoover representation:
Definition 1
For any network satisfying (1), there exists a function and a set of independent and identically distributed random variables , such that
(2) 
Following the literature, we call the graphon function. Unfortunately, as pointed out in Diaconis & Jason (arXiv 0712.2749), in this representation is neither unique nor identifiable, since for any measurepreserving onetoone transformation , both and yield the same distribution of . An identifiable and unique canonical representation can be defined if one requires to be nondecreasing (Bickel & Chen, 2009). Chan & Airoldi (2014) show that and ’s are jointly identifiable when , which can be interpreted as expected node degree, is strictly monotone. This assumption is strong and excludes the stochastic block model.
In practice, the main purpose of estimating is to estimate , and thus identifiability of or lack thereof may not matter if itself can be estimated. The preprint by Hoover and Diaconis & Jason (arXiv 0712.2749) showed that the measurepreserving map is the only source of nonidentifiability. Wolfe and Olhede (arXiv:1309.5936) and Choi & Wolfe (2014) proposed estimating up to a measurepreserving transformation via stepfunction approximations based on fitting the stochastic block model with a larger number of blocks . This approximation does not assume that the network itself follows the block model, and some theoretical guarantees have been obtained under more general models. In related work, Olhede & Wolfe (2014) proposed to approximate the graphon with socalled network histograms, that is, stochastic block models with many blocks of equal size, akin to histogram bins. Another method to compute a network histogram was proposed by Amini & Levina (2017), as an application of their semidefinite programming approach to fitting block models with equal size blocks. Recently, Gao et al. (2015) established the minimax error rate for estimating and proposed a least squares type estimator to achieve this rate, which obtains the estimated probability by averaging the adjacency matrix elements within a given block partition. A similar estimator was proposed in Choi (2017), applicable also to nonsmooth graphons. However, these methods are in principle computationally infeasible since they require an exhaustive enumeration of all possible block partitions. Cai et al. (arXiv:1412.2129) proposed an iterative algorithm to fit a stochastic blockmodel and approximate the graphon, but its error rate is unknown for general graphons. A Bayesian approach using block priors proposed by Gao et al. (arXiv:1506.02174) achieves the minimax error rate adaptively, but it still requires the evaluation of the posterior likelihood over all possible block partitions to obtain the posterior mode or the expectation for the probability matrix.
Other recent efforts on graphon estimation focus on the case of monotone node degrees, which make the graphon identifiable. The sort and smooth methods of Yang et al. (2014) and Chan & Airoldi (2014) estimate the graphon under this assumption by first sorting nodes by their degrees and then smoothing the matrix locally to estimate edge probabilities. The monotone degree assumption is crucial for the success of these methods, and as we later show, the sort and smooth methods perform poorly when it does not hold. Finally, general matrix denoising methods can be applied to this problem if one considers to be a noisy version of its expectation
; a good general representative of this class of methods is the universal singular value thresholding approach of
Chatterjee (2015). Since this is a general method, we cannot expect its error rate to be especially competitive for this specific problem, and indeed its mean squared error rate is slower than the cubic root of the minimax rate.In this paper, we propose a novel computationally efficient method for edge probability matrix estimation based on neighborhood smoothing, for piecewise Lipschitz graphon functions. The key to this method is adaptive neighborhood selection, which allows us to avoid making strong assumptions about the graphon. A node’s neighborhood consists of nodes with similar rows in the adjacency matrix, which intuitively correspond to nodes with similar values of the latent node positions . To the best of our knowledge, our estimator achieves the best error rate among existing computationally feasible methods; it allows easy parallelization. The size of the neighborhood is controlled by a tuning parameter, similar to bandwidth in nonparametric regression; the rate of this bandwidth parameter is determined by theory, and we show empirically that the method is robust to the choice of the constant. Experiments on synthetic networks demonstrate that our method performs very well under a wide range of graphon models, including those of low rank and full rank, with and without monotone degrees. We also test its performance on the link prediction problem, using both synthetic and real networks.
2 The neighborhood smoothing estimator and its error rate
2.1 Neighborhood smoothing for edge probability estimation
Our goal is to estimate the probabilities from the observed network adjacency matrix , where each is independently drawn from . While , where ’s are latent, our goal is to estimate for the single realization of ’s that gave rise to the data, rather than the function . We think of as a fixed unknown smooth function on , with formal smoothness assumptions to be stated later. Let denote the Bernoulli error and omit its dependence on . We can then write
(3) 
Formulation (3) resembles a nonparametric regression problem, except that the are not observed. This has important consequences: for example, assuming further smoothness in beyond order one does not improve the minimax error rate when estimating (Gao et al., 2015). Our approach is to apply neighborhood smoothing, which would be natural had the latent variables ’s been observed. Intuitively, if we had a set of neighbors of a node , in the sense that , where represents the th row of , then we could estimate by averaging over . Postponing the question of how to select until Section 2.2, we define a general neighborhood smoothing estimator by
(4) 
When the network is symmetric, we instead use a symmetric estimator
(5) 
For simplicity, we focus on undirected networks. A natural alternative is to average over , but (4) and (5
) allow vectorization and are thus more computationally efficient. Our estimator can also be viewed as a relaxation of step function approximations such as
Olhede & Wolfe (2014). In step function approximations, the neighborhood for each node is the nodes from its block, so the neighborhoods for two nodes from the same block are very similar, and the blocks have to be estimated first; in contrast, neighborhood smoothing provides for more flexible neighborhoods that differ from node to node, and an efficient way to select the neighborhood, which we will discuss next.2.2 Neighborhood selection
Selecting the neighborhood in (5) is the core of our method. Since we estimate by averaging over for , good neighborhood candidates should have close to , which implies close to . We use the distance between graphon slices to quantify this, defining
(6) 
While one may consider more general or other distances, the distance is particularly easy to work with theoretically. For the purpose of neighborhood selection, it is not necessary to estimate ; it suffices to provide a tractable upper bound. For integrable functions and defined on , define . Then we can write
(7) 
The third term in (7) can be estimated by , where and are nearly independent up to a single duplicated entry due to symmetry. The first two terms in (7) are more difficult, since is not a good estimator for . Here we present the intuition and provide a full theoretical justification in Theorem 1. For simplicity, assume for now is Lipschitz with a Lipschitz constant of . The idea is to use nodes with graphon slices similar to and to make the terms in the inner product distinct graphon slices. With high probability, for each , we can find such that , where the sequence is a function of and represents the error rate to be specified later. Then , and we can approximate by , where the latter can now be estimated by . The same technique can be used to approximate the second term in (7), but all these approximations depend on the unknown ’s. To deal with this, we rearrange the terms in (7) as follows:
(8) 
The inner product on the right side of (8) can be estimated by
(9) 
Intuitively, the neighborhood should consist of s with small . To formalize this, let denote the
th sample quantile of the set
, where is a tuning parameter, and set(10) 
where for notational simplicity we suppress the dependence of on . Thresholding at a quantile rather than at some absolute value is convenient since real networks vary in their average node degrees and other parameters, which leads to very different values and distributions of . Empirically, thresholding at a quantile shows significant advantage in stability and performance compared to an absolute threshold. The choice of will be guided by both the theory in Section 2.3, which suggests the order of , and empirical performance which suggests the constant factor. More details are included in the Supplementary Material.
An important feature of this definition is that the neighborhood admits nodes with similar graphon slices, but not necessarily similar ’s. For example, in the stochastic block model, all nodes from the same block would be equally likely to be included in each other’s neighborhoods, regardless of their ’s. Even though we use and to motivate (8), we always work with the function values ’s and never attempt to estimate the or by themselves. This contrasts with the approaches of Chan & Airoldi (2014) and Yang et al. (2014), and gives us a substantial computational advantage as well as much more flexibility in assumptions.
2.3 Consistency of the neighborhood smoothing estimator
We study the theoretical properties of our estimator for a family of piecewise Lipschitz graphon functions, defined as follows.
Definition 2 (Piecewise Lipschitz graphon family)
For any , let denote a family of piecewise Lipschitz graphon functions such that (i) there exists an integer and a sequence satisfying , and (ii) both and hold for all , and .
For any , define , the normalized matrix norm, by
Then we have the following error rate bound.
Theorem 1
Since for any , we have , Theorem 1 yields
Corollary 1
The bound (12) continues to hold if we replace by , but (11) may not hold. Next, we show that under the norm, our estimator is nearly rateoptimal, up to a factor.
Theorem 2
To the best of our knowledge, result (11) is the only error rate available for polynomial time graphon estimation methods. Most previous work focused on the mean squared error and only considered the special case . For , the minimax error rate established by Gao et al. (2015)
has so far only been achieved by methods that require combinatorial optimization or evaluation, including
Gao et al. (2015) and Klopp et al. (2017). The rate was previously achieved by combinatorial methods, including Wolfe and Olhede (arXiv:1309.5936) and Olhede & Wolfe (2014). Among computationally efficient methods, singular value thresholding (Chatterjee (2015), Theorem 2.7) achieves . Additionally, the sortandsmooth method proposed by Chan & Airoldi (2014) achieves the minimax error rate under the strong assumption that has strictly monotone expected node degrees . An anonymous referee of this manuscript sent us a proof that thresholding the leading singular values of the matrix achieves the mean squared error of, where the variance
is due to Candes & Plan (2011) and is the bias. Taking gives the best known mean squared error rate of for a computationally efficient algorithm. For the graphon family where that we study, the singular value thresholding method and our method achieve the same mean squared error rate.For the case of general , we can show that the minimax rate of established by Gao et al. (2015) still holds for the family , in Proposition 1; see the Supplementary Material.
Proposition 1
Whether this minimax error rate can be achieved by a computationally efficient method remains an open question.
3 Probability matrix estimation on synthetic networks
In this section we evaluate the performance of our symmetric estimator (5) on estimating the probability matrix for synthetic networks. We generate the networks from the four graphons listed in Table 3, selected to have different features in different combinations (monotone degrees, low rank, etc). The corresponding probability matrices are pictured in the first column of Figure 1 (lower triangular half). All networks have nodes.
Additional empirical results in the Supplementary Material show that our method is robust to the choice of the constant factor in the bandwidth , for simplicity, we set for the rest of this paper. Here we focus on comparing to benchmarks. From the general matrix denoising methods, we include the widely used method of universal singular value thresholding (Chatterjee, 2015) and the leading singular value thresholding method suggested by a referee. We also compare to the sort and smooth methods of Chan & Airoldi (2014) and Yang et al. (2014). These methods differ only in that the latter one employs singular value thresholding to denoise the network as a preprocessing step. Due to space constraints, we present both methods in Table 3 but only Chan & Airoldi (2014) in figures, since they are visually very similar.
We also inlcude two approximations based on fitting a stochastic block model, called network histograms by Olhede & Wolfe (2014). One is the oracle stochastic block model, where the blocks are based on the true values of the latent
’s. This cannot be done in practice, but we use it as the gold standard for a stepfunction approximation. The feasible version of this is an approximation based on a stochastic block model with estimated blocks; we fit it by regularized spectral clustering
(Chaudhuri et al., 2012). Any other algorithm for fitting the stochastic blockmodel can be used to estimate the blocks; for example, Olhede & Wolfe (2014) used a local updating algorithm initialized with spectral clustering to compute their network histograms. Here we chose regularized spectral clustering because of its speed and good empirical performance. For both approximations, we set the number of blocks to , as in Olhede & Wolfe (2014).A recent as yet unpublished method kindly shared with us by E. Airoldi proposes a stochastic block model approximation, adapting the method of Airoldi et al. (2013) to work with a single adjacency matrix. It uses a dissimilarity measure , which we considered before choosing (9) because it leads to a better guaranteed error rate. Airoldi’s method then builds blocks by starting with one notyetclustered node and including all nodes whose dissimilarity from is below a threshold as neighbors. We found that our strategy of thresholding by quantile instead of a fixed threshold is more efficient and stable, and the theoretical error rate is better for our method.
We present the heatmaps of results for a single realization in Figure 1, and the root mean squared errors and the mean absolute errors of in Table 3
. While these two errors mostly agree on method ranking, the few cases where they disagree indicate whether the errors are primariy coming from a small number of poorly estimated entries or are more uniformly distributed throughout the matrix.
Grahpon 1 has blocks with different withinblock edge probabilities, which all dominate the low betweenblock probability. The best results are obtained by our method, singular value thresholding, spectral clustering, and the oracle stochastic blockmodel approximation, which is expected given that the data are generated from a stochastic block model. The oracle uses blocks rather than the true , and thus makes substantial errors on the block boundaries, but not anywhere else. The method of Chan & Airoldi (2014) correctly estimates the main blocks because they have different expected degrees, but suffers from boundary effects due to smoothing over the entire region. In contrast, our method, which determines smoothing neighborhoods based on similarities of graphon slices, does not suffer from boundary effects. Chatterjee (2015) does a good job on denser blocks but thresholds away sparser blocks. Airoldi’s method captures tightly connected communities, but does not do as well on weaker communities.
Graphon 2 lacks node degree monotonicity, and thus the method of Chan & Airoldi (2014) does not work here. Spectral clustering also performs poorly, likely because it uses too many (
) eigenvectors which add noise. Airoldi’s method and the stochastic block model oracle give grainy but reasonable approximations to
, and the best results are obtained by our method, Chatterjee (2015), and singular value thresholding with eigenvalues. The latter two are expected to work well since this is a lowrank matrix.Graphon 3 is a diagonaldominated matrix, and our method is the best among computationally efficient methods. The method of Chatterjee (2015) does not perform well because this is not a lowrank matrix; spectral clustering, on the other hand, does fine, because there are many nonzero eigenvalues and the eigenvectors contain enough information. The singular value thresholding does better than Chatterjee (2015) and provides a lowerresolution denoising. Airoldi’s method only roughly shows the structure, likely due to the similarity measure it uses. The method of Chan & Airoldi (2014) fails since all node expected degrees are almost the same.
Graphon 4 is difficult to estimate for all methods. It is full rank, with structure at different scales. This makes it a difficult setting for lowrank approximations, among which the singular value thresholding alone uses enough eigenvalues to produce a reasonable result, albeit with boundary effects. This graphon is not a block matrix, and thus spectral clustering does not perform well. The expected node degrees are not the same, but their ordering does not match the ordering of the latent node positions, so this graphon is also difficult for the sortandsmooth method of Chan & Airoldi (2014). Our method successfully picks up the global structure and the curvature. While visually it is fairly similar to the result of singular value thresholding, our method has significantly better errors in Table 3. Overall, this example illustrates a limitation of all global methods when there are subtle local differences.
Table 3 shows the mean squared errors and the mean absolute errors of all methods on the four graphons averaged over 2000 replications. The results generally agree with those shown in the figures. The few relative discrepancies between RMSE and L1 errors occur when there is a small number of large errors, such as the boundary effects for the oracle for graphon 1, which affect RMSE more than the L1 error.
For graphon 1, our method and the spectral clustering perform best. For graphon 2, our method is only outperformed by universal singular value thresholding, whereas leading eigenvalue thresholding selects fewer eigenvalues than needed. For graphon 3, our method is comparable to the leading eigenvalue thresholding, and they are both better than other methods, not counting the oracle. For graphon 4, our method has shows significant advantage over all other methods except for the oracle. Thus in all cases, our method shows very competitive performance compared to benchmarks.
Overall, the results in this section show that various previously proposed methods can perform very well under their assumptions, which may be monotone degrees or lowrank or an underlying block model, but they fail when these assumptions are not satisfied. Our method is the only one among those compared that performs well in a large range of scenarios, because it learns the structure from data via neighborhood selection instead of imposing a priori structural assumptions. The singular value thresholding method also shows consistent performance across all graphons, although in all cases somewhat worse than ours. It performs very well in the lowrank case, but if the leading singular values decay slowly, our method performs better.
4 Application to link prediction
Evaluating probability matrix estimation methods on real networks directly is difficult, since the true probability matrix is unknown. We assess the practical utility of our method by applying it to link prediction, a task that relies on estimating the probability matrix. Here we think of the true adjacency matrix as unobserved, with binary edges drawn independently with probabilities given by , also unobserved. Instead we observe , where unobserved ’s are independent Bernoulli, and is unknown. Therefore is always a true edge, but could be either a true or a false negative. This setting is different from and perhaps more realistic than the link prediction setting in Gao et al. (2016), who assumed that ’s are observed. Under their setting, the missing rate can be estimated by the empirical missing rate , and all estimators can be corrected for missingness simply by dividing them by .
A link prediction method usually outputs a nonnegative score matrix , with scores giving the estimated propensity of a node pair to form an edge. For methods that estimate the probability matrix, can be taken to be ; other link prediction methods construct a binary by working directly on . Both types of methods essentially output a ranked list of most likely missing links, useful in practice for followup confirmatory analysis.
We compare various link prediction methods via their receiver operating characteristic curves. For each
, we define the false positive and the true positive rates byThen by varying we obtain the receiver operating characteristic curve. In practice, is often selected to output a fixed number of most likely links.
In this section we include three additional benchmark methods that produce score matrices rather than estimated probability matrices. One standard score is the Jaccard index
, see for example Lichtenwalter et al. (2010). The method by Zhao et al. (2017) computes scores so that similar node pairs to have similar predicted scores. The PropFlow algorithm of Lichtenwalter et al. (2010) uses an expected random walk distance between nodes as the score.We first compare all methods on simulated networks generated from the graphons in Table 3. We set due to the computational cost of some of the benchmarks, and set . All experiments are repeated 1000 times/ Figure 2 in the Supplementary Material shows the receiver operating characteristic curves for four graphons. Most differences between the methods can be inferred from Figure 1. Overall, the methods based on graphon estimation outperform scorebased methods. Our method outperforms all other methods on this task, producing a receiver operating characteristic curve very close to that based on the true probability matrix .
We also applied our method and others to the political blogs network (Adamic & Glance, 2005). This network consists of 1222 manually labelled blogs, 586 liberal and 636 conservative. The network clearly shows two communities, with heterogeneous node degrees (there are hubs). We removed 10% of edges at random and calculated the receiver operating characteristic curve for predicting the missing links, shown in Figure 2. Again, methods based on estimating the probability matrix performed much better than the scoring methods, and our method performs best overall. Sort and smooth methods slightly outperformed spectral clustering and Chatterjee (2015), perhaps due to the presence of hubs.
5 Discussion
The strength of our method is the adaptive neighborhood choice which works well under many different conditions; it is also computationally efficient, easy to implement, and essentially tuning free. Its main limitation is the piecewise Lipschitz condition, which occasionally leads to oversmoothing. Our method does not achieve the minimax error rate, and its rate cannot be improved; whether the minimax rate can be achieved by any polynomial time method is, to the best of our knowledge, an open problem. Another major future challenge is relaxing the assumption of independent edges to better fit realworld networks.
Acknowledgments
The authors thank an associate editor and two anonymous referees for very helpful suggestions, and E. M. Airoldi of Harvard University for sharing a copy of his manuscript and code and great comments. E.L. is supported by grants NSF DMS 1521551 and ONR N000141612910. J.Z. is supported by grants NSF DMS 1407698, KLAS 130026507, and KLAS 130028612.
Supplementary material
6 Choosing the constant factor for the bandwidth
First, we need to choose the quantile cutoff parameter which controls neighborhood selection. Theorem 1 gives the order of , and the following numerical experiments empirically justify our choice of the constant factor. Figure 3 shows the mean squared error curves for networks with nodes generated from the four graphons in Table 3, with the constant factor varying in the range .
Figure 3 demonstrates that in the range from to works equally well for all these very different graphons. This suggests empirically that the method is robust to the choice of , and therefore we set in all numerical results in the paper.
7 Receiver operating characteristic curves for link prediction simulations in Section 4
The results are presented in Figure 4. The link prediction results generally agree with how the methods performed on the task of estimating the probability matrix for these synthetic graphons, shown in Section 3 of the main manuscript.
8 Root mean squared errors for simulations in Section 3
In Table 1, we report for estimated by all methods in Section 3. The performances of our method is slightly better than but comparable to singular value thresholding and Chatterjee (2015), and these three methods are generally significantly better than other practical methods. These results suggest there may be a error bound applicable to the low rank methods, even though they were not designed to control this type of errors; investigating this is outside the scope of this manuscript.
Graphon 1  Graphon 2  Graphon 3  Graphon 4  
Our method  484(012)  597(008)  560(011)  713(011) 

Chan & Airoldi (2014)  3887(035)  5044(080)  1374(009)  943(014) 
Yang et al. (2014)  3887(041)  4977(074)  1311(004)  775(007) 
singular value thresholding  655(018)  834(015)  522(011)  1367(024) 
Blockmodel spectral  1665(138)  4534(039)  1300(057)  2950(053) 
Blockmodel oracle  3536(004)  947(002)  253(002)  173(003) 
Chatterjee (2015)  877(001)  393(010)  842(007)  773(006) 
Airoldi’s method  3474(046)  6697(012)  1679(007)  3047(074) 
9 Proofs
Proof (of Theorem 1)
For convenience, we start with summarizing notation and assumptions made in the main paper. Let , for and . Assume the graphon is a biLipschitz function on each of for . Let denote the maximum piecewise biLipschitz constant. The number of pieces may grow with , as long as .
For any , let denote the that contains . Let denote the neighborhood of in which is Lipschitz in for any fixed . Finally, recall our estimator is defined by
To prove the main theorem, it suffices to show that with high probability
holds for all . We begin the proof with the following decomposition of the error term:
We can bound the summand by
(14) 
Our goal is to bound . First, we prove a lemma which estimates the proportion of nodes in a diminishing neighborhood of ’s.
Lemma 1
For arbitrary global constants , , define . For large enough so that and , we have
(15) 
Proof (of Lemma 1)
For any and large enough to satisfy the assumptions, by Bernstein’s inequality we have, for any ,
Taking a union bound over all ’s gives
Letting , we have
(16) 
We now continue with the proof of Theorem 1. Recall that we defined a measure of closeness of adjacency matrix slices in Section 2 as
The neighborhood of node consists of nodes ’s with below the th quantile of . The next lemma shows two key properties of .
Lemma 2
Suppose we select the neighborhood by thresholding at the lower th quantile of , where we set with an arbitrary global constant satisfying for the from Lemma 1. Let be arbitrary global constants and assume is large enough so that (i) All conditions on in Lemma 1 are satisfied; (ii) ; (iii) ; and (iv) . Then the neighborhood has the following properties:

.

With probability , for all and , we have
Proof (of Lemma 2)
The first claim follows immediately from the choice of and the definition of . To show the second claim, we start with concentration results. For any such that , we have
(17) 
By Bernstein’s inequality, for any and we have
Taking a union bound over all , we have
Comments
There are no comments yet.