Network estimation via graphon with node features

Estimating the probabilities of linkages in a network has gained increasing interest in recent years. One popular model for network analysis is the exchangeable graph model (ExGM) characterized by a two-dimensional function known as a graphon. Estimating an underlying graphon becomes the key of such analysis. Several nonparametric estimation methods have been proposed, and some are provably consistent. However, if certain useful features of the nodes (e.g., age and schools in social network context) are available, none of these methods was designed to incorporate this source of information to help with the estimation. This paper develops a consistent graphon estimation method that integrates the information from both the adjacency matrix itself and node features. We show that properly leveraging the features can improve the estimation. A cross-validation method is proposed to automatically select the tuning parameter of the method.




The Age of Gossip in Networks

A source node updates its status as a point process and also forwards it...

Exploring Graph Learning for Semi-Supervised Classification Beyond Euclidean Data

Semi-supervised classification on graph-structured data has received inc...

Estimating network edge probabilities by neighborhood smoothing

The estimation of probabilities of network edges from the observed adjac...

An Influence-Receptivity Model for Topic based Information Cascades

We consider the problem of estimating the latent structure of a social n...

Online Graph-Adaptive Learning with Scalability and Privacy

Graphs are widely adopted for modeling complex systems, including financ...

Shape-constrained Estimation of Value Functions

We present a fully nonparametric method to estimate the value function, ...

Uncovering Causality from Multivariate Hawkes Integrated Cumulants

We design a new nonparametric method that allows one to estimate the mat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A network (undirected simple graph) can be modeled as a partial observation of an infinite random graph. Exchangeable random graph model (ExGM) is a popular nonparametric model for infinite graphs where node indices are exchangeable (e.g., Hoff, 2008; Kallenberg, 2006; Lovász, 2012; Orbanz and Roy, 2015)

, i.e., the joint distribution of edges is invariant under permutation of node indices. For instance, in a social network when a node represents a person, the assignment of node to person does not carry any information, and swapping the node indices between any two people (i.e., relabeling) defines the same network. An ExGM is characterized by a symmetric measurable function

known as graphon (Aldous, 1981; Hoover, 1979), which therefore plays a central role in model-based inference and prediction of network data under ExGM. Based on the Aldous-Hoover Theorem, we assume the following generative model of the network via graphon: a set of latent labels , each associated with a node, are first drawn independently from . These labels govern the probability of observing an edge between the corresponding two nodes through graphon. More specifically, given and , the probability that there is a connection between the -th node and the -th node is given by . According to these probabilities, edges will be then generated independently of each other conditional on .

In general, graphon provides a unified and solid framework for modeling networks. For instance, community structures widely used in the modeling of social networks correspond to a parametric piecewise-constant model of graphon. More importantly, graphon opens up an opportunity for more flexible but challenging nonparametric modeling, which has sparked a recent surge of interest among researchers (e.g., Airoldi et al., 2013; Wolfe and Olhede, 2013; Chan and Airoldi, 2014; Gao et al., 2015; Zhang et al., 2017; Klopp et al., 2017), which is also the focus of the present work. Since the knowledge of graphon facilitates our understanding of the underlying network, nonparametric graphon estimation helps discover unknown patterns in the corresponding network generation mechanism. Besides, statistical inference of network can also be conducted via graphon (e.g., Lloyd et al., 2012; Yang et al., 2014).

A typical assumption adopted by nonparametric graphon estimation is the smoothness of the underlying graphon . If we were given the latent labels , the graphon estimation is simply a nonparametric regression problem. However, are not observed, which poses a unique challenge. Due to smoothness assumption, various researchers have made use of the idea that “similar labels” produce “similar graphon slices”, where a graphon slice at a label is a one-dimensional function . In other words, if the labels and of two nodes are close, and should be similar. As graphon is unknown, several methods (e.g., Airoldi et al., 2013; Chan and Airoldi, 2014; Gao et al., 2015) instead use the rows and/or columns of the adjacency matrix as a proxy of graphon slices, to describe the distance between nodes, based upon which smoothing procedures can be constructed. Despite the success of many existing methods relying solely on the adjacency matrix, very often features (or attributes) of the nodes are available aside from the network (adjacency matrix) itself, and could potentially provide important information for network estimation. Take Facebook friendship network as an example. It is conceivable that users who share similar values of certain features (e.g., age and schools) will have similar connection behaviors. These additional features can be valuable resource for better estimating the underlying probabilities of linkages and the graphon.

The main contribution of the paper is the proposal of a nonparametric graphon estimation method that is capable of utilizing the information hidden in the node features for better network estimation. This can be realized by relating node features to graphon slices through the latent labels. That is, close labels should correspond to both similar graphon slices and similar node features.

The rest of this paper is organized as follows. In Section 2, we will review some basic definitions and related work on graphon estimation, and summarize our contribution. We introduce the proposed framework and estimation method in Section 3. Theoretical results are provided in Section 4, while numerical experiments on synthetic graphons are shown in Section 5. Finally, we apply our method to a real-world friendship network in Section 6, and supplementary material is deferred to the appendix.

2 Background

This section presents necessary background material. In sequel, for any matrix , we use , and to denote its -th element, -th row and -th column, respectively.

2.1 Graphon, exchangeability, and identifiability

Let be the adjacency matrix of a non-directed simple graph with nodes ( can be infinity); i.e., if the -th node and -th node is connected, and 0 otherwise. For an infinitely sized graph, we say it is exchangeable if the distribution over is invariant under any permutation of nodes. The Aldous-Hoover theorem (Aldous, 1981; Hoover, 1979) guarantees that every ExGM must be represented by a graphon.

  • (Graphon) A graphon is a symmetric measurable function such that

    where for .

A network of size can be modeled as a partial observation of an ExGM, and thereby generated by the following two-step sampling scheme:


Identifiability is a well-known issue of graphon, and different graphons can give rise to the same ExGM. Specifically, up to a measure preserving transformation , and define the same ExGM. That is, the distributions of these two random arrays are the same. To guarantee an unique representation, one can impose the strict monotonicity of degree condition (Bickel and Chen, 2009; Yang et al., 2014), which assumes that there exists a measure preserving transformation such that , and


is strictly increasing in . Here is called the canonical form of graphon , and is a unique representation of the underlying ExGM. However, this assumption is restrictive because it excludes commonly used models such as the stochastic block model. In our framework, we will not enforce strictly monotonic node degrees.

In principle, one cannot determine which graphon, from its equivalence class, that generates the underlying network based on the adjacency matrix. There are two layers of estimation: the first layer is the estimation of the graphon while the second layer is the estimation of the latent labels . Since it is unrealistic to estimate the labels without strong assumptions, the main purpose of estimating a graphon is sometimes to obtain the probabilities of linkages at the observed nodes, . This is also the goal of this paper.

2.2 Related work

In the literature of graphon estimation, a commonly adopted strategy is to employ the graphon slices (which can be estimated by the rows and columns of the adjacency matrix) to define the distance between nodes. With this, one can group similar nodes together into different blocks and estimate the graphon values within any block by averaging the number of edges in it.

Airoldi et al. (2013) proposed the Stochastic Blockmodel Approximation (SBA) algorithm, which approximates the graphon by a piecewise constant function. Their estimator is consistent in mean squared error, but a key assumption is that there are at least () independent realizations generated from , which is unlikely to hold in reality. They group the nodes into blocks, and the estimated graphon is a piecewise constant function over blocks.

Some other methods are based on the strong assumption of strict monotonicity of degree (Chan and Airoldi, 2014; Yang et al., 2014), under which a canonical graphon is well-defined and hence treated as the estimand of interest. One representative of this category is the Sorting-and-Smoothing (SAS) algorithm proposed by Chan and Airoldi (2014). It first sorts the nodes according to their empirical degrees, then computes a local histogram estimator for some bandwidth , and finally applies a smoothing technique to obtain the final estimate. This SAS estimator is consistent and reaches the rate of convergence . This rate matches with the optimal rate in general graphon estimation without the assumption of strict monotonicity of degree (Gao et al., 2015).

Another popular method is Universal Singular Value Thresholding (USVT) proposed by

Chatterjee et al. (2015), which targets at general matrix denoising problems with missing values. Since this method is not specifically for graphon, the rate of convergence is not competitive.

More recently, Zhang et al. (2017) proposed a novel Neighborhood Smoothing (NBS) method for estimating the underlying probability matrix , which is equivalent to estimating the graphon . Different from the SBA and the SAS algorithms, these authors proposed an adaptive neighborhood selection method which allows each node to have its own neighbors. The NBS method performs very well for a wide range of graphons in both low-rank and high-rank situations, and the only assumption on graphon is piecewise Lipschitz. These authors also showed that the error rate of NBS is the smallest among all existing non-combinatorial methods.

2.3 Our contribution

To the best of our knowledge, none of the existing methods are designed to utilize information other than the adjacency matrix itself for nonparametric newtork/graphon estimation. With additional node features, the estimation could be much improved. In this paper, we propose a generative model of node features which allows borrowing information from the features in an adaptive manner to improve the network/graphon estimation. If similar node features correspond to similar graphon slices, these features are valuable, especially when the graphon itself has weak/local signals. In contrast, it could happen that two nodes with identical attributes behave very differently. In such scenarios, it is unwise to contaminate the estimation by using these unhelpful features. We will avoid this contamination by selecting the tuning parameter adaptively, which controls the weight of using the features’ information.

3 Methodology

We begin with some notations. Recall that is an observed adjacency matrix generated by graphon . That is,

where for . For each node

, we also observe a feature vector

, . We assume that these ’s are also generated from (unknown) latent labels:


where is an unknown function, and

is a random vector with independent entries of mean zero and variance

. Also, are mutually independent. We note that , the elements of are dependent in general due to the sharing of through ; and the assumption of constant variance can be relaxed easily. In addition, we assume that and are piecewise Lipschitz functions which will be defined in Section 4.

Although we aim to utilize the features for better graphon estimation, our feature model (3) is fairly general and does not assume the usefulness of features for graphon estimation in priori. To see this, the essential information of for graphon estimation is captured by the hidden label . When is monotonic, close feature vectors correspond to close latent labels, which will generate similar slices in the adjacency matrix (under smoothness assumption of ). In this case, similarity of features is a helpful source we can borrow information from. On the other hand, when is non-monotonic, close features does not necessarily imply similar graphon slices. For example, if for very different and , then whether including feature similarity is useful or not will depend on if graphon slice is close to . Given that we do not know if the latter is true, the use of features could worsen the graphon estimation. In the subsequent sections, we will develop a method that allows adaptive incorporation of feature information via a tuning parameter, as well as a data-adaptive method for choosing such a parameter.

In what follows, we use to denote the underlying (conditional) probability matrix with , i.e., . Our goal is to estimate .

3.1 Feature Assisted Neighborhood Smoothing (FANS)

This subsection provides a general description of the proposed method for graphon estimation. The method is called FANS, short for Feature Assisted Neighboring Smoothing.

Since the latent labels are unavailable, the key of estimating a graphon is to define a measure of node dissimilarity. Here we define a (squared) dissimilarity function between the -th node and the -th node () as the weighted sum of two terms:


where their relative weights are determined by a tuning parameter . In (4), is a distance measure for graphon slices while is a distance measure for features; exact forms of these two measures are given in Section 3.2. The parameter controls how much information we want to borrow from the node features. When these features are not helpful, we could avoid using them by setting . We will discuss a data-driven choice of later.

The first step of the proposed estimation method is to estimate . Once such an estimate is obtained, the next step is to obtain the neighborhood for each node. Similar to the NBS method (Zhang et al., 2017), we define the neighborhood of the -th node as


where is the

-th sample quantile of the set

, and with a global constant . From our experience, the performance of the proposed method is not sensitive to the choice of in a mild range between 0.5 and 1.5. In practice, we set . Unlike the SBA and SAS algorithms, this neighborhood is different from node to node. Finally, the estimated graphon evaluated at is given by


To sum up, the proposed FANS method consists of the following three major steps:

  1. Obtain an estimate for in (4), where , are estimate by (9) and (8), and is chosen by Algorithm 2.

  2. Calculate the for all using (5).

  3. Compute the estimated with (6).

Details for these three steps are given below. See also Algorithm 1.

3.2 Defining and

Following the ideas of Airoldi et al. (2013) and Zhang et al. (2017), we define using the distance of graphon slices. To be more specific, for any , define


With slight notational abuse, we use the notation to denote both the inner product of two functions and the Euclidean inner product of two vectors.

We immediately notice that the last term can be estimated by , because the entries of and are “almost” independent (except for and ). However, cannot be well estimated using (one can consider estimating

in Bernoulli distribution as an analogy.) Similarly for

. The SBA algorithm solves this issue by requiring that at least two independent copies of the network are observed.

Following Zhang et al. (2017), we instead use an approximate upper bound of

, which is motivated by the following heuristic argument. First,


With large sample, it is likely that there exist , such that and for small . Suppose , as a function of , has a Lipschitz constant for every . For the first term of (7), we have

since and . Similarly for the second term of (7). Therefore, we have

Disregarding the multiplicative constant and the small term , it can be estimated by

In the same vein, we define , and the estimator of its upper bound (up to a multiplicative constant) is


where is the feature vector for the -th node. The usage of these upper bounds and their estimates will be justified both theoretically and empirically in subsequent sections.

3.3 Tie-corrected

Due to the nature of upper bound and the fact that ’s are binary, has an issue of ties. For now suppose ; i.e., not using any node feature. When is a piecewise constant function (for which the stochastic block model is an example), may contain a large number of ties. In our simulation we found that when , the number of unique values in can be as small as 65. These ties cause a problem when defining because the set of boundary points can be large. Hence does not change continuously as changes, and therefore it may include too many nodes on the boundary. Clearly, not all the nodes in are as useful as those in , but there is no mechanism to distinguish which nodes in are useful to be included. On the other hand, we do not want to either exclude or include all of them. To solve this problem, we propose using an adjusted by applying a random perturbation to the original definition:


with . Since is always an integer, this randomization will not change the order of any other points that are not ties. As for , the randomization is not necessary if

’s are continuous random variables since ties are unlikely.

Input: , ,
Step 1: Calculate with .
Step 2: Calcualte .
Step 3: Compute .
Step 4: Define where .
Step 5: Output .
Algorithm 1 The FANS method

3.4 Cross-validation for selecting

Selection of any tuning parameter in network estimation is generally a challenging problem. Popular data-spliting strategies like cross-validation has no trivial extension to the setting of network data. Recently, Chen and Lei (2018) proposed a piecewise node-pair splitting technique for cross-validation to determine the number of communities in a stochastic block model. An unpublished work of Li et al. (2016) proposed a two-stage network cross-validation by edge splitting for stochastic block model. A key assumption of both methods is that is low-rank, which does not fit in general graphon framework.

One advantage of having node features is that, by solely comparing the features, it is possible to generate prediction of edge connection. If a new node comes into an existing network, we can find its nearest neighbor based on its features. Then we use the estimated graphon slice of as a prediction of ’s connections. Assuming that the features are useful (which means the optimal ), a good model should predict with a small error.

Our cross-validation method is outlined in Algorithm 2. It would be natural to use

norm or negative binomial log-likelihood as the loss function. However, we found that

norm would fail easily since it is much less robust than error. The reason we do not use log-likelihood is that we may have or 1 occasionally. Simulation results suggest that our method works well and will set in cases where node features are not helpful (such as Graphon 2 in Section 5.)

Input: , , .
Output: Optimal .
for  do
     Randomly sample nodes as the validation set.
     Let be the training set.
     Split and : , .
     for  do
         Fit for using , and .
         for  do
              Find , and let , .
         end for
         Compute loss for model , .
     end for
end for
Let , and return .
Algorithm 2 Cross-validation for choosing

3.5 Feature screening

When the number of features are large, there are two potentially undesirable consequences. First, the computational time for the proposed estimation method is large, and second, there is a chance that some of the features are useless which may worsen the estimation quality. Therefore, we propose a feature screening procedure for removing some useless features before we apply the proposed graphon estimation method.

Let be the distance matrix based on the adjacency matrix , and be the dissimilarity matrix based on a single feature. We use the correlation between and to describe the coherence of features and graphon slices. If a feature is irrelevant, the correlation between and would be small. Since and do not form a linear relation, we use Kendall’s correlation (or Spearman correlation). We will discuss the practical choice of threshold in Section 5.

4 Theoretical Results

In this section, we establish the asymptotic convergence of the proposed estimator. For simplicity, we study the estimator without node features screening (Section 3.5). To accommodate important models such as stochastic block model, we do not assume the graphon to be completely smooth. Instead, we focus on the following family of functions.

  • (Bivariate piecewise Lipschitz graphon family) For any , let denote a family of piecewise Lipschitz graphon functions such that (i) there exists and such that ; (ii) for any and , .

Similarly, our theory also allows piecewise Lipschitz form for the feature function in the following sense.

  • (Piecewise Lipschitz feature family) For any , let denote a family of piecewise Lipschitz functions such that (i) there exists and s.t. ; (ii) ,

We further define the sub-Gaussian distribution.

  • (Sub-Gaussian distribution) we say is sub-Gaussian() if and .

We have the following theorem regarding the rate of convergence of the estimated probability matrix .

  • (Consistency of ) Assume that (i) and with global constants and ; (ii) the length of the smallest common Lipschitz piece satisfies ; (iii) , and sub-Gaussian() for all and ; (iv) for any ; (v) for any global constant . Then for all and , we have

    where is given as follows.

    • If ,

    • If ,

In our theorem, the best convergence rate of is which is sub-optimal when compared to the minimax rate . However, as claimed by a recent work (Zhang et al., 2017), is the best rate obtained among existing non-combinatorial methods for general graphon estimation without making strong assumptions such as strict monotonicity of degree condition. In our theorem, this rate can be achieved by both zero and nonzero (with appropriate rate). When , the node features play no role in the proposed estimation procedure. This indicates that, in our theorem, we do not obtain any gain in terms of rate of convergence by the additional usage of features. We hypothesize that this is largely due to the flexible modeling between features and hidden labels. Besides, a similar conclusion is obtained by another recent work on community detection with node features (Zhang et al., 2016), which suggests that the real benefit of features is revealed in the finite sample performance. As indicated evidently by our empirical study in Sections 5 and 6, the usage of features (i.e., ) improves the quality of estimation. It is then of theoretical interest to understand how large could be accommodated by the proposed method without compromising the overall rate of convergence. Our theorem has shed light on this theoretical question. The interplay of and their effects to the rate of convergence can be easily seen by enforcing . Here we give two examples under the setting of bounded . When (high-dimensional setting), we require to be . As for (low-dimensional setting), we need .

5 Numerical Experiments

5.1 Effects of node features

Setup: We generate a network from a graphon by the two-step procedure described in (1). For each node, we generate its features from by (3). If close ’s imply close graphon slices, our method can leverage the information of features and effectively improve the estimation. One such assumption that guarantees this is that is smooth and monotonic. As for when and the graphon has totally different structures, then inclusion of in the dissimilarity measure (i.e., choosing a large ) will worsen the estimation.

The level of noise also influences the effect of features. In order to compare the effects of different noise levels, we first standardize each feature

by its standard deviation before adding noise. To be more specific,

where and .

Finally, we define MSE and MAE to measure the performance of the estimation:


An illustrative example: We will first use an illustrative example to show the effects of different node features.

Consider the case of a single feature and a simple monotonic graphon that is shown in Figure 1. Any smooth and monotonic will be useful, for instance, , which approximates the true labels well. However, if or even (which are non-monotonic and periodic), then close and may not well describe the similarity of the corresponding graphon slices. Thus, and may not be as helpful. Nevertheless, it turned out that using or can still help locally and improve estimation when is small since they are smooth. More details are shown in Figure 2.

Figure 1: .
Figure 2: The effect of different features , . Legends correspond to and from top to bottom.

General graphons: Now, we study the effect of node features in the more general cases. We consider networks generated from the following four graphons that are used in the literature; see Figure 3 and Table 1. Graphon 1 is a stochastic block model (SBM) with blocks. Graphon 2 is periodic and low-rank. Graphons 3 and 4 are more general and both full-rank. Here the rank of a graphon is evaluated numerically on .

Graphon Rank local structure
if ; No
3 No
full No
full Yes
Table 1: Four general graphons used in numerical experiments.
Figure 3: Graphon visualizations when . From top-left to bottom-right (by row): to .

As for the node features, we select two non-monotonic functions and two monotonic ones:

and , where

is the CDF function of standard normal distribution.

Figure 4 shows the comparison of curves of MSE against with different feature noise levels for to , and each curve is an average from 20 repeated experiments.

Figure 4: Effect of with 20 independent experiments for each curve. From top-left to bottom-right (by row): to . The starting point of each curve is , which is equivalent to the case without using features. Legends (from top to bottom) correspond to respectively.

Since is block-structured, using an appropriate amount of information from should improve the results, as smooth features make the neighborhoods more concentrated except near the boundaries. However when is large, it will pull nodes on the boundaries into wrong neighborhoods, thus MSE can deteriorate rapidly. For , although the graphon is smooth, close ’s do not well correspond to similar graphon slices due to the periodic structure of . In this case, the adjacency matrix itself carries such strong information that adding features does not help that much. However, the MSE is not influenced a lot when is close to 0. For and , the effects of node features is significant since close features correspond to close graphon slices. There is a 20-30% improvement in MSE with -value. In particular, smooth node features can help better capture the local structure in ; see Appendix A for further details.

As expected, the performance becomes worse in general when the noise level of features increases. In practice, neither the pattern of features nor the noise level is known, therefore the cross-validation method is proposed for selecting an appropriate .

5.2 Threshold for feature screening

Recall that the feature screening procedure developed in Section 3.5 requires the specification of a threshold. Our numerical experiments show that 0.03 would be a reasonable choice of threshold. We tested it under Graphons 1 to 4 with feature being Gaussian noise. We performed 1000 independent trials to calculate the proportion of successful screen-out. The results are summarized in Table 2

. The probability of false positive (i.e., keeping a useless feature) is well controlled (approximately under 0.05) when we set the threshold as 0.03. In practice, the proposed feature screening mechanism should be combined with field knowledge to perform feature selection.

97.0% 99.9% 96.6% 94.7%
Table 2: Feature screening with Gaussian noise feature.

5.3 Comparison with existing methods

In this subsection we compare the performance of the proposed method FANS with other methods found in the literature. These methods include

  • SBA: Stochastic Blockmodel Approximation of Airoldi et al. (2013),

  • SAS: Sorting-and-Smoothing of Chan and Airoldi (2014),

  • USVT: Universal Singular Value Thresholding of Chatterjee et al. (2015), and

  • NBS: Neighborhood Smoothing of Zhang et al. (2017).

Given a graphon and a size , we randomly generated independent realizations. For each realization we applied the above five methods to obtain the corresponding estimated graphons. Since the SBA algorithm requires at least two independent graphs, we followed the way the authors conducted simulations in their paper and generated two of graphs to make the comparison fair. For the SAS algorithm, we set the bandwidth as suggested in their paper. For NBS and FANS, we set . For FANS specifically, feature screening (Section 3.5) was performed and was automatically chosen by cross-validation (Section 3.4) before every fitting.

We calculated the MSE and MAE for each estimated graphon. The results for and are summarized in Table 3

, where the averages and the standard errors of the calculated MSEs and MAEs are reported. The number at the end of each row is the

-value when the FANS method is compared with NBS using a paired one-sided -test. From this table, one can see that, when node features were available, FANS gave the best results (but we note that sometimes NBS gave similarly best results). This is not surprising, as FANS is the only method that was designed to incorporate node feature information for graphon estimation. When there is no such node feature present, FANS is similar to NBS, which gave extremely favorable results when comparing with the remaining methods.

MSE (SE) 0.0112 (0.0015) 0.0269 (0.0049) 0.1679 (0.0013) 0.0020 (2.1e-4) 0.0017 (1.9e-4) [0.058]
     MAE (SE) 0.0682 (0.0058) 0.1039 (0.0138) 0.3865 (0.0034) 0.0309 (0.0018) 0.0296 (0.0012) [0.058]
MSE (SE) 0.0295 (0.0023) 0.0983 (0.0160) 0.0641 (0.0017) 0.0040 (2.9e-4) 0.0042 (0.0018) [0.1236]
     MAE (SE) 0.1220 (0.0053) 0.2709 (0.0255) 0.1768 (0.0025) 0.0479 (0.0017) 0.0489 (0.0079) [0.1004]
MSE (SE) 0.0158 (0.0015) 0.0144 (3.3e-4) 0.0122 (0.0021) 0.0067 (3.7e-4) 0.0039 (1.9e-4) [0]
     MAE (SE) 0.0684 (0.0057) 0.0855 (0.0016) 0.0738 (0.0106) 0.0484 (0.0015) 0.0327 (0.0011) [0]
MSE (SE) 0.0172 (0.0023) 0.0044 (3.0e-4) 0.1015 (0.0055) 0.0044 (2.5e-4) 0.0034 (2.9e-4) [0]
     MAE (SE) 0.0978 (0.0073) 0.0545 (0.0018) 0.2920 (0.0111) 0.0526 (0.0015) 0.0455 (0.0019) [0]
MSE (SE) 0.0297 (5.5e-4) 0.0210 (0.0046) 0.1791 (7.0e-4) 8.1e-4 (4.9e-5) 7.8e-4 (4.4e-4) [5e-31]
     MAE (SE) 0.0245 (0.0034) 0.0782 (0.0120) 0.4024 (0.0018) 0.0201 (4.0e-4) 0.0198 (3.5e-4) [4e-28]
MSE (SE) 0.0154 (0.0014) 0.0907 (0.0142) 0.0617 (7.4e-4) 0.0019 (7.7e-5) 0.0019 (7.6e-5) [0.1728]
     MAE (SE) 0.0899 (0.0044) 0.2562 (0.0257) 0.1679 (0.0016) 0.0321 (6.3e-4) 0.0321 (6.3e-4) [0.1238]
MSE (SE) 0.0081 (7.8e-4) 0.0136 (2.8e-4) 0.0049 (4.2e-4) 0.0031 (1.3e-4) 0.0023 (7.5e-5) [0]
     MAE (SE) 0.0453 (0.0038) 0.0839 (0.0015) 0.0408 (0.0020) 0.0293 (7.5e-4) 0.0240 (4.7e-4) [0]
MSE (SE) 0.0099 (0.0015) 0.0029 (2.1e-4) 0.1008 (0.0032) 0.0024 (8.8e-5) 0.0017 (8.4e-5) [0]
     MAE (SE) 0.0768 (0.0061) 0.0450 (0.0018) 0.2895 (0.0067) 0.0384 (7.9e-4) 0.0326 (7.8e-4) [0]
Table 3: Comparison with existing methods. Reported are the means and the standard errors (in parentheses) of MSEs and MAEs, based on 100 independent trials. Numbers in brackets are the -values of paired -tests comparing the mean values between NBS and FANS. The lowest values are listed in bold.

6 Application to Real Data

In this section, we apply the proposed FANS method to a real-life dataset and illustrates its usefulness via both visualization and a leave-one-out link prediction problem. This dataset is related to friendship network and was collected by the National Longitudinal Study of Adolescent Health (the AddHealth study), which can be downloaded from In this study students were asked to list their friends that they recently chatted with. Note that the original data are directed graphs, but we consider two students as friends if one named the other. The whole dataset consists of 81 sub-datasets, each containing either one or two schools. We analyzed one of them (“comm10”) which contains 587 students from a single school. The three covariates recorded are gender, race and grade. We treated grade

as an ordinal variable. As for

gender and race, we convert them to a 0/1 vector. The -th coordinate is if and only if the student belongs to the -th category.

6.1 Visualization

The observed adjacency matrix is displayed in the bottom-right plot in Figure 5. The block structure of the six communities that correspond to the six grades from 7 (upper-right) to 12 (bottom-left) is apparent. Here we also include a recently proposed community detection algorithm that also makes use of node features, JCDC, proposed by Zhang et al. (2016). We chose the number of communities for JCDC. We applied the five methods mentioned in Section 5.3 to fit the underlying graphon, and JCDC to estimate the communities. The comparison is visualized in Figure 5, where the nodes were sorted by grade. For the SAS algorithm, in addition to nodes sorted by grade, we also present the estimated graphon with nodes sorted by the empirical node degrees (which is the direct output from their method). From neither one of the two SAS plots can we observe any patterns of interest. The USVT method fails since it omitted all the singular values.

We can observe the block structures recovered from NBS, JCDC and FANS methods, but the latter two provided a much clearer view. However, JCDC seems to miss the two small communities near the lower corner. On the other hand, the proposed FANS with clearly distinguished all the six communities that correspond to the different grades. With the assistance of the node features, FANS was able to capture the two subtle communities omitted by all other methods near the lower corner. Even when , FANS still outperformed NBS since FANS resolves the tie-issue discussed in Section 3.

Figure 5: Estimated structures for the AddHealth “comm10” friendship network (first seven plots) and the observed adjacency matrix (bottom-right). From top to bottom (by row): SAS, SAS with nodes sorted by node degree, USVT, NBS, JCDC, FANS with , and FANS with .

We further sorted the nodes by gender. The estimated graphon by FANS is shown in Figure 6. The bottom-left block is the sub-network among females, and the upper-right corner forms the community among males. The other two parts are the cross-community connections. Within each block, there are 6 blocks that correspond to the six grades from 7 (upper-right) to 12 (bottom-left). One can see that the behaviors of male and female students are similar, and they are more likely to know each other in higher grades. This figure also illustrates that different graphons defined up to a measure preserving transformation correspond to the same network.

Figure 6: Visualization of sub-networks in “comm10”.

6.2 Leave-one-out link prediction

Since for this problem the true graphon is unknown, it is hard to quantitatively evaluate the quality of an estimated graphon. Therefore, we used a leave-one-out link prediction to compare FANS with NBS. In general, graphon estimation methods are not applicable to link prediction because they require a fully-observed network. However, FANS can be slightly modified such that the estimation of is independent of its observed value . This can be done by, in Step 1 of Algorithm 1, modifying to . Then can be predicted completely independent of .

This is an element-wise procedure, and it can only predict one value at a time. Thus, we masked one pair of entries () each time and conducted a leave-one-out link prediction. We calculated the area under the receiver operating characteristic (ROC) curve to evaluate the link prediction. From Figure 7, we can see that FANS with performed the best.

Figure 7: ROC curves for NBS, FANS with , and FANS with .

7 Discussion

This paper developed a graphon estimation method that is capable of utilizing the information from both the observed adjacency matrix and node features. Under some mild regularity conditions, the consistency properties of the proposed method is established. The rate of convergence is the same as without using node features, but in practice the proposed method can improve the estimation results in most cases. Lastly, for a real world dataset the proposed method has benefits as it reflects more meaningful structures of a network and yields a higher link prediction accuracy.

Appendix A Local Structure in Graphon 4

Figure 8: Local structure in . Left: estimated graphon without using node features with the NBS method; right: estimated graphon with smooth node features obtained by the FANS method.

This appendix provides an example to demonstrate the potential usefulness of incorporating node features into graphon estimation. The bottom-left corner of presents a rich local structure that is difficult to be captured by the adjacency matrix. As shown in Figure 8, the left estimation fails to detect any local structure since the signal carried by the adjacency matrix is relatively weak compared to the Bernoulli noise. On the other hand, the fitted graphon on the right of Figure 8, which utilizes smooth node features to assist the estimation, is able to capture such local information.

Appendix B Proof of Theorem 3.1

This appendix presents technical arguments leading to the theoretical results in the paper. In order to prove Theorem 3.1, if we write

then it suffices to prove the consistency of . In other words, we need to prove that


By triangle inequality for Frobenius norm, (10) implies Theorem 3.1.

It is clear to observe that

so we only need to obtain a bound for the right-hand-side. Let’s consider the following decomposition.

Let’s first consider . When , we have