Graphons are limits of dense graph-sequences Lovász (2012). A graphon is a symmetric measurable function from to . Probabilistic models of networks based on graphons are a particular type of latent space model. The first such model is due to Bollobás et al. (2007). Bickel and Chen (2009) also considered this model. A graphon is used here as a non-parametric link function—there are i.i.d. latent characteristics of each node in a network, and given these characteristics, an edge is formed between a pair of nodes and , independently of all other edges, with probability . Now, there are inherent unidentifiability issues in such a model. For any measure preserving bijection on , the graphon also gives rise to the same probability model. Thus graphons can only be estimated up to equivalences classes. Or to remove the identifiability issue, one needs to make further assumptions on the graphon, e.g. monotone degrees Chan and Airoldi (2014). The other estimation problem is that of estimating the probability matrix coming from a graphon, which is a well-defined problem. In this paper, we will mean by graphon estimation this latter problem.
The problem of estimating the underlying graphon from an observed network has attracted much attention in recent past. Airoldi et al. (2013) considered a stochastic blockmodel approximation to a graphon. Zhang et al. (2017) devised an elegant neighbourhood smoothing estimator for graphons. The universal singular value thresholding () method of Chatterjee (2015) is capable of estimating graphons. The matrix completion method of Keshavan et al. (2010) can also be used for this purpose. Gao et al. (2015) obtained the minimax rate for graphon estimation and proposed a combinatorial algorithm that achieves it.
Although graphon estimation has been studied in some detail for fully observed networks, its study under missing data set-ups is perhaps more important. This is because when collecting network data one is hardly certain about all the edges. There is enormous scope of application if one is able to predict links when one suspects that a zero in the adjacency matrix is possibly indicating missing data and not the absence of an edge. More generally, such link prediction problems have been considered by many authors (Liben-Nowell and Kleinberg (2007); Al Hasan et al. (2006); Lü and Zhou (2011)).
Recently, graphon estimation under a missing data set-up where one observes full ego-networks of some (but not all) individuals in a network has been carried out in Wu et al. (2018). In general, the problem of link prediction with partially observed data has been tackled before in Zhao et al. (2017); Gaucher and Klopp (2019).
In this paper too, we study graphon estimation under a missing data set-up. The missing data model we study is very different from that of Wu et al. (2018). Instead of ego-networks as in their paper, one observes, in our model, certain overlapping subgraphs. We extend the neighbourhood smoothing estimator of Zhang et al. (2017) to this missing data set-up by devising a method based on the triangle inequality to extend a distance matrix to all of the network, when one actually has some estimate of the distances within the overlapping subgraphs. The case where there are only two overlapping subgraphs is easier to tackle and we use this as a building block for a more general algorithm for the case when there are more than two overlapping subgraphs.
Through extensive numerical study on simulated and real world graphs, we show that the extended algorithm, for a wide range of graphons, vastly outperforms standard graphon estimation methods such as vanilla neighbourhood smoothing (), universal singular value thresholding (), blockmodel approximation, matrix completion, etc.
The rest of the paper is organized as follows. In Section 2 we describe in detail the missing data model we consider. Then, in Section 3, we briefly recap the algorithm of Zhang et al. (2017) and then extend it to our set-up. Section 4 contains our empirical results. We finally end with some concluding remarks and future directions in Section 5.
2 Problem set-up
Suppose we observe subgraphs of some simple undirected graph on vertices, generated from some graphon . For ease of notation we will take . Furthermore, assume that the vertex sets of these subgraphs have some intersection. To make it precise, define a super-graph on nodes, where the -th node represents . Put an edge between nodes and if . Assume that is connected. We also assume that , i.e. these subgraphs cover the whole graph.
Thus, if denotes the adjacency matrix of , then there are (unobserved) variables , such that
By we denote the adjacency matrix of network . Then, using a subsetting notation, . Let
be the set of observed pairs. Now we observe the matrix , where
The goal is to estimate the probability matrix , where , given . See Figure 1 for an example of this set-up.
Although we are assuming that there is some big network of which some overlapping subgraphs are observed, it is quite straightforward to adapt our approach to the case where one observes graphs coming from the same graphon where the vertex sets of these graphs have some intersection in the sense of the corresponding super-graph being connected.
We will generalize the neighbourhood smoothing approach of Zhang et al. (2017). Their approach is to construct a certain neighbourhood for each node . This is done by first calculating a distance measure between each pair of vertices , and then saying that if is less than a certain threshold. Once these neighbourhoods have been constructed, can estimated by
However, this is not symmetric, so Zhang et al. (2017) take as the final estimate.
The distance measure that Zhang et al. (2017) use is
We refer the reader to Section 2.2 of Zhang et al. (2017) for details on how one obtains this distance measure. One thing to note here is that is not a metric, it tries to approximate one though.
By we denote an algorithm that takes as input the adjacency matrix , and outputs a distance matrix where . Once we have such a distance matrix , the next algorithm does neighbourhood smoothing.
-th sample quantile of the set, where is a tuning parameter. Set, for ,
From their theoretical considerations, Zhang et al. (2017) recommend the choice . We will also use this recommendation in our extended algorithm.
3.1 Distance Extension
We first discuss the case, which will then be used to tackle the general case.
3.1.1 The case
Suppose, like Zhang et al. (2017), we have a measure of distance between the nodes of a network. We can use the triangle inequality and the common intersection between the two networks to estimate distances between nodes that are part of different graphs. To elaborate, if , , then we define
The triangle inequality can be used again to obtain a lower bound. Since , we define
If were a true distance, then we would have
So we may take our estimate to be some average of and .
Experimentally we did not find much differences between different types of averages. Overall, the harmonic meanseemed to perform well.
Also, on we may have two potentially different values of coming from the two different graphs. We choose the arithmetic mean of these two values and assign that to .
Thus we have a measure of distance between any two vertices in . So we can define a neighbourhood smoothing estimator of just like Zhang et al. (2017). To that end, we first describe the distance extension algorithm.
3.1.2 The general case
In this case, we have overlapping subgraphs. As described in Section 2, it is more illuminating to consider the super-graph on nodes, where the -th node and there is an edge between nodes and if . We assume that is connected. Given , , we will try to estimate using the overlaps. As is connected, there is a path of overlapping subgraphs .
The issue is that, e.g., computing max of sum of distances along all possible chains of between vertices from these overlapping graphs is expensive (for , this was fine). So, as a compromise, we take a spanning tree of . On this tree, we visit each node on a particular traversal , a finite sequence of adjacent nodes of which covers all the vertices. Say that the traversal is . At point of the traversal, we apply on and to get a distance matrix on all of . At the end we get a distance matrix that depends on both the tree and the particular traversal . Finally, we do this several times over a number of spanning trees and traversals thereof, and take the average of all the resulting as the final estimate of .
We now describe this algorithm in detail.
What spanning trees to take? We can take several uniformly random spanning trees. An alternative to this is to take a maximal spanning tree of where the edge-weights are the overlaps, and consider its traversals only. In Section 4 we perform experiments to show the impact of traversals.
3.2 Neighbourhood Smoothing, Extended
Once we have computed a distance matrix on the full graph, we can do the usual neighbourhood smoothing on the matrix . Now we describe this extended algorithm.
Note that so far our goal has been to estimate the neighbourhoods better than what a vanilla algorithm would do. However, when we estimate as done in , we are underestimating the numerator, because we are replacing unobserved edges by . This can be corrected for to some extent by the following prescription: Let be the estimate we get from , and let denotes the neighbourhood of constructed in . We then correct by replacing unobserved edges by their corresponding estimated edge probabilities, obtained from :
Let us denote the above procedure as , i.e. . This procedure can be repeated a few times until the estimates get stable. That is, after we get , we can correct it further by the same procedure: and so on. The following simple lemma shows that this iterative scheme always converges.
increase to a probability matrix .
Note first that for all , because, for an unobserved pair , we have . Also,
Hence, , and so on. That is, is an increasing sequence. But, we have the trivial upper bound
Therefore, being an increasing sequence bounded from above, converges to some . ∎
In practice, we continue these iterations until becomes smaller than a pre-specified threshold. Now we are in a position to describe the full algorithm.
In the case, We generated networks of size from six graphons (see Figure 2), observed were two random subgraphs of size , where controls the size of the overlap. See Figure 3 for a comparison between various algorithms. In all of these, we see huge improvement achieved by especially when is small.
In another experiment, we consider a missing data scenario as depicted in Figure 4-(b). Five subgraphs were observed. In Figure 5, we plot the estimation error of for different traversals (traversals are paths in this example) and also for the case when we take an averaged distance matrix over paths as described in the algorithm. In general, we do not find any significant impact of traversals on the estimation error. We have done this experiment in other missing data scenarios as well and have arrived at the same conclusion. In Table 1, we compare (with maximal spanning path) against other algorithms.
Because of our missing data model, a significant number of elements close to the diagonal of the adjacency matrix are observed. Therefore, probability matrix estimation suffers the least from our missing data model for graphons that are strongly concentrated near the line. This effect is clearly seen in graphon (c) and also to some extent in graphon (f) (see Figure 3 and Table 1).
4.2 Real Data
When applying the method on real networks, we do not actually have . So, given a real network , we first estimate based on the full graph, call it . Then we sample subgraphs , with some degree of overlap, and based on these subgraphs only (i.e. on the incompletely observed full network) apply an algorithm to get an estimate . Then measures how much the incompleteness (or lack of overlap) influences the algorithm.
frb59-26-4: This dataset contains benchmark graphs for testing several NP-hard graph algorithms including but not limited to the maximum clique, the maximum independent set, the minimum vertex cover and the vertex coloring problems. It has nodes with about million edges.
bn-mouse-retina_1: In this dataset of a brain network edges represent fiber tracts that connect one vertex to another. It has nodes and about thousand edges.
econ-beaflw: This is an economic network that has nodes and about thousand edges.
In Figure 6, we plot the adjacency matrices of these graphs and also the probability matrix estimates obtained via under various missing data scenarios. From Figure 7, we see that suffers the least from lack of overlap.
In conclusion, we have considered the estimation of the probability matrix of a network coming from a graphon model under a missing data set-up, where one only observes certain overlapping subgraphs of the network in question. We have extended the neighbourhood smoothing () algorithm of Zhang et al. (2017) to this missing data set-up. We have shown experimentally that the proposed extension vastly outperforms standard graphon estimation techniques. We leave the study of theoretical properties such as obtaining the rate of convergence, how it depends on the degree of overlap, the number of subgraphs, etc. to future work.
Thanks to Ananya Mukherjee for spotting an error in an earlier version of the paper.
- Airoldi et al. (2013) Airoldi, E. M., Costa, T. B., and Chan, S. H. (2013). Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In Advances in Neural Information Processing Systems, pages 692–700.
Al Hasan et al. (2006)
Al Hasan, M., Chaoji, V., Salem, S., and Zaki, M. (2006).
Link prediction using supervised learning.In SDM06: workshop on link analysis, counter-terrorism and security.
- Bickel and Chen (2009) Bickel, P. J. and Chen, A. (2009). A nonparametric view of network models and newman–girvan and other modularities. Proceedings of the National Academy of Sciences, 106(50):21068–21073.
Bollobás et al. (2007)
Bollobás, B., Janson, S., and Riordan, O. (2007).
The phase transition in inhomogeneous random graphs.Random Structures & Algorithms, 31(1):3–122.
Chan and Airoldi (2014)
Chan, S. and Airoldi, E. (2014).
A consistent histogram estimator for exchangeable graph models.
International Conference on Machine Learning, pages 208–216.
- Chatterjee (2015) Chatterjee, S. (2015). Matrix estimation by universal singular value thresholding. The Annals of Statistics, 43(1):177–214.
- Gao et al. (2015) Gao, C., Lu, Y., Zhou, H. H., et al. (2015). Rate-optimal graphon estimation. The Annals of Statistics, 43(6):2624–2652.
- Gaucher and Klopp (2019) Gaucher, S. and Klopp, O. (2019). Maximum likelihood estimation of sparse networks with missing observations. arXiv preprint arXiv:1902.10605.
- Keshavan et al. (2010) Keshavan, R. H., Montanari, A., and Oh, S. (2010). Matrix completion from a few entries. IEEE transactions on information theory, 56(6):2980–2998.
- Liben-Nowell and Kleinberg (2007) Liben-Nowell, D. and Kleinberg, J. (2007). The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7):1019–1031.
- Lovász (2012) Lovász, L. (2012). Large networks and graph limits, volume 60. American Mathematical Soc.
- Lü and Zhou (2011) Lü, L. and Zhou, T. (2011). Link prediction in complex networks: A survey. Physica A: statistical mechanics and its applications, 390(6):1150–1170.
Rossi and Ahmed (2015)
Rossi, R. A. and Ahmed, N. K. (2015).
The network data repository with interactive graph analytics and
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence.
- Wu et al. (2018) Wu, Y.-J., Levina, E., and Zhu, J. (2018). Link prediction for egocentrically sampled networks. arXiv preprint arXiv:1803.04084.
- Zhang et al. (2017) Zhang, Y., Levina, E., and Zhu, J. (2017). Estimating network edge probabilities by neighbourhood smoothing. Biometrika, 104(4):771–783.
- Zhao et al. (2017) Zhao, Y., Wu, Y.-J., Levina, E., and Zhu, J. (2017). Link prediction for partially observed networks. Journal of Computational and Graphical Statistics, 26(3):725–733.