What is the fraction of male-female connections against that of female-female connections in a given Online Social Network (OSN)? Is the OSN assortative or disassortative? Edge, triangle, and node statistics of OSNs find applications in computational social science (see e.g. ), epidemiology , and computer science [4, 11, 27]
. Computing these statistics is a key capability in large-scale social network analysis and machine learning applications. But because data collection in the wild is often limited to partial OSN crawls through Application Programming Interface (API) requests, observational studies of OSNs – for research purposes or market analysis – depend in great part on our ability to compute network statistics with incomplete data. Case in point, most datasets available to researchers in widely popular public repositories are partial OSN crawls111The majority of the datasets in the public repositories SNAP  and KONECT  are partial website crawls, not complete datasets or uniform samples.. Unfortunately, these incomplete datasets have unknown biases and no statistical guarantees regarding the accuracy of their statistics. To date, the best methods for crawling networks ([3, 10, 28]) show good real-world performance but only provide statistical guarantees asymptotically (i.e., when the entire OSN network is collected).
This work addresses the fundamental problem of obtaining unbiased and reliable node, edge, and triangle statistics of OSNs via partial crawling. To the best of our knowledge our method is the first to provide a practical solution to the problem of computing OSN statistics with strong theoretical guarantees from a partial network crawl. More specifically, we (a) provide a provable finite-sample unbiased estimate of network statistics (and their spectral-gap derived variance) and (b) provide the asymptotic posterior of our estimates that performs remarkably well all tested real-world scenarios.
More precisely, let be an undirected labeled network – not necessarily connected – where is the set of vertices and is the set of edges. Both edges and nodes can have labels. Network is unknown to us except for arbitrary initial seed nodes in . Nodes in must span all the different connected components of . From the seed nodes we crawl the network starting from and obtain a set of crawled edges , where is a parameter that regulates the number of website API requests. With the crawled edges we seek an unbiased estimate of
Note that functions of the form eq. (1) are general enough to compute node statistics
where is the degree of node , and statistics of triangles such as the local clustering coefficient of first provided by 
where the expression inside the sum is zero when and are the neighbors of in . Our task is to find estimates of general functions of the form in eq. (1).
In our work we provide a partial crawling strategy using random walk tours whose posterior
is shown to have an unbiased maximum a posteriori estimate (MAP)regardless of the number of nodes in the seed set and regardless of the value of , i.e., . Note that we guarantee that our MAP estimate is unbiased in the finite-sample regime unlike previous asymptotic methods [3, 10, 20, 28, 29]. Moreover, we provide the posterior for the large regime and prove its convergence in distribution showing its convergence rate. In our experiments we note that the posterior is remarkably accurate using a variety of networks large and small. We also provide upper and lower bounds for .
The works of  and  are the ones closest to ours.  estimates the size of a network based on the return times of random walk tours.  estimates number of triangles, network size, and subgraph counts from weighted random walk tours using results of . The previous works on non-asymptotic inference of network statistics from incomplete network crawls [12, 17, 18, 13, 14, 22, 30] need to fit the partial observed data to a probabilistic graph model such as ERGMs (exponential family of random graphs models). Our work advances the state-of-the-art in estimating network statistics from partial crawls because: (a) we estimate statistics of arbitrary edge functions without assumptions about the graph model or the underlying graph; (b) we do not need to bias the random walk with weights; this is particularly useful when estimating multiple statistics reusing the same observations; (c) we derive upper and lower bounds on the variance of estimator, which both show the connection with the spectral gap; and, finally, (d) we compute a posterior over our estimates to give practitioners a way to access the confidence in the estimates without relying on unobtainable quantities like the spectral gap and without assuming a probabilistic graph model.
The remainder of the paper is organized as follows. In Section 2 we introduce our main theorems and supporting lemmas and proofs. In Section 3 we introduce artificial illustrative examples to aid understanding our method. In Section 4 we introduce our results using simulations over real-world networks. Finally, in Section 5 we present our conclusions.
2 Network Estimation from Partial Crawls
In this section we present our main results. The outline of this section is as follows. Section 2.1 introduces key concepts and defines the notation used throughout this manuscript. Section 2.2 introduces our main results in the form of two theorems: Theorem 1 presents an unbiased estimator of any function over edges of an undirected graph using random walk tours. Our random walk tours are shorter than the “regular random walk tours” because the “node” that they start from is an amalgamation of a multitude of nodes in the graph. Here, we briefly explains the approximate posterior of the estimator in Theorem 1. Section 2.3 proves Theorem 1, introducing important upper and lower bonds of the estimator variance in Section 2.4.1 and showing the effect of the spectral gap. Finally, Section 2.4 derives the approximate Bayesian posterior (3) also using the bounds obtained in Section 2.4.1.
Let be an unknown undirected graph. Our goal is to find an unbiased estimate of in eq. (1) and its posterior by crawling a small fraction of . We are given a set of initial arbitrary nodes denoted . If has disconnected components must span all the different connected components of .
Our network crawler is a classical random walk over the following augmented multigraph . A multigraph is a graph that can have multiple edges between two nodes. In we aggregate all nodes of into a single node, denoted hereafter , the super-node. Thus, . The edges of are , i.e., contains all the edges in including the edges from the nodes in to other nodes, and is merged into the super-node . Note that is necessarily connected as spans all the connected components of .
A random walk on
has transition probability from nodeto an adjacent node , with , where is the degree of and is the number of edges between and . We note that the theory presented in the paper can be extended to more sophisticated random walks as well. Let be the stationary distribution at node in the random walk on .
A random walk tour is defined as the sequence of nodes visited by the random walk during successive -th and -st visits to the super-node . Here denote the successive return times to
. Tours have a key property: from the renewal theorem tours are independent since the returning times act as renewal epochs. Moreover, letbe a random walk on in steady state.
Note that the random walk on is equivalent to a random walk on where all the nodes in are treated as one single node.
The function is redefined on as follows: for , remains same when and . But when or , is redefined as zero.
The introduction of super-node is primary motivated by the following three reasons:
Tackling disconnected or low-conductance graphs:
When the graph is not strongly connected or has many connected components, forming a super-node with representatives from each of the components make the modified graph connected and suitable for applying random walk theory. Even when the graph is connected, it might not be well-knit, i.e., it has low conductance. Since the conductance is closely related to mixing time of Markov chains, such graph will prolong the mixing of random walks. But with proper choice of super-node, we can reduce the mixing time and, as we show, improve the estimation accuracy. This idea is illustrated with a Dumbell graph example in Section3.
Faster Estimate with Shorter Tours: The expected value of the -th tour length is inversely proportional to the degree of the super-node . Hence, by forming a massive-degree super-node we can significantly shorten the average tour length.
2.2 Main Results
In what follows we present our main results. Theorem 1 proposes an unbiased estimator of edge characteristics via random walk tours. Then we present the approximate posterior distribution of the unbiased estimator presented in Theorem 1.
Let be an unknown undirected graph where initial arbitrary set of nodes is known which span all the different connected components of . Consider a random walk on the augmented multigraph described in Section 2.1 starting at super-node . Let be the -th random walk tour until the walk first returns to and let denote the collection of all nodes in such tours, . Then,
is an unbiased estimate of , i.e., . Moreover the estimator is strongly consistent, i.e., a.s. for .
Theorem 1 provides an unbiased estimate of network statistics from random walk tours. The length of tour is short if it starts at a massive super-node as the expected tour length is inversely proportional to the degree of the super-node, . This provides a practical way to compute unbiased estimates of node, edge, and triangle statistics using (eq. (2)) while observing only a small fraction of the original graph. Because random walk tours can have arbitrary lengths, we show in Lemma 2, Section 2.4, that there are upper and lower bounds on the variance of . For a bounded function , the upper bounds are shown to be always finite.
In what follows we show the approximate posterior of the estimator in Theorem 1. In Section 4 we shall see that the approximate posterior matches very well the empirical posterior using simulations over real-world networks while crawling of the nodes in the network.
Let be the true value outside the subgraph formed by the nodes that were merged into the super-node.
Bayesian approximation of the posterior of
In the scenario of Theorem 1 for tours and assuming priors , ( is the variance of ), then the marginal posterior density of as converges in distribution to a non-standardized -distribution
with degrees of freedom parameter
and scale parameter
Note that approximation in (3) is a Bayesian approach and Theorem 1 is the frequentist counterpart. In fact, the motivation to form the Bayesian approach comes from the frequentist estimator ( samples). From the approximate posterior, the Bayesian MAP estimator for sufficiently large values of is
Thus when , the Bayesian estimator is essentially the first term in the frequentist estimator (second term is calculated a priori), and hence both the estimators are same. In this paper we make use of the posterior distribution from the Bayesian approach to get the degrees of belief along with the common estimator from both the approaches.
The above remark shows that the approximate posterior in (3) provides a way to access the confidence in the estimate
. The Normal prior for the average gives the largest variance given a given mean. The inverse-gamma is a non-informative conjugate prior if the variance of the estimator is not too small, which is generally the case in our application. Other choices of prior, such as uniform, are also possible yielding different posteriors without closed-form solutions . The posterior is conservative as is calculated from tours while the posterior considers only tours. Being conservative, however, is advised as the posterior is for large values of and the conservative estimate better protects us from finite-sample anomalies and perform very well in practice as we see in Section 4.
In what follows we provide the proofs of our main results.
2.3 Proof of Theorem 1
The outline of this proof is as follows. In Lemma 1 we show that the estimate of from each tour is unbiased.
Let be the nodes traversed by the -th random walk tour on , starting at super-node . Then the following holds, ,
The random walk starts from the super-node , thus
Consider a renewal reward process with inter-renewal time distributed as and reward as the number of times Markov chain crosses . From renewal reward theorem,
Here the left-hand side is essentially . Now (5) becomes
which concludes our proof. ∎
2.4 Derivation of the approximate posterior
The derivation of (3) relies first on showing that
has finite first and second moments. We go further and in Lemma2 we introduce upper and lower bounds on the variance of . By Theorem 1 the first moment of is finite as . To show that the second moment is finite we prove that the estimate , , whose variance is , has finite second moment. The results in Lemma 2 are of interest on their own because they establish a connection between the estimator variance and the spectral gap.
2.4.1 Impact of spectral gap on variance
Let , where is the random walk transition probability matrix as defined in Section 2.1 and is a diagonal matrix with the node degrees of
. The eigenvaluesof and are same and . Let
th eigenvector ofbe . Let be the spectral gap, . Let the left and right eigenvectors of be and respectively. . Define , with , and matrix with th element as . Also let
be the vector with.
The following holds
Assuming the function is bounded, , and for tour ,
(i). The variance of the estimator at tour starting from node is
It is known from [1, Chapter 2 and 3] that
The latter can be upper-bounded by ).
For the second part, we have
for a constant using inequality. From , it is known that there exists an , such that , and this implies that for all . This proves the theorem.
(ii). We denote for and
indicates Gaussian distribution with meanand variance
. With the trivial extension of the central limit theorem of Markov chains of node functions to edge functions, we have for the ergodic estimator ,
We derive in Lemma 3. Note that is also the asymptotic variance of the ergodic estimator of edge functions.
Consider a renewal reward process at its -th renewal, , with inter-renewal time and reward . Let be the average cumulative reward gained up to -th renewal, i.e., . From the central limit theorem for the renewal reward process [31, Theorem 2.2.5] after total number of steps, with , yields
In fact it can be shown that (see [24, Proof of Theorem 17.2.2])
We are now ready to derive the approximation (3).
Let . Given
and and because the tours are i.i.d. the marginal posterior density of is
For now assume that
then [15, Proposition C.4]
are the posteriors of parameters and , respectively. The non-standardized
-distribution can be seen as a mixture of normal distributions with equal mean and random variance inverse-gamma distributed[15, Proposition C.6]. Thus, if are i.i.d. normally distributed then the posterior of is a non-standardized -distributed with parameters
Left to show is that are converge in distribution to i.i.d. normal random variables as . As the spectral gap of is greater than zero, , Lemma 2 showns that for then
From the renewal theorem we know that are i.i.d. random variables and thus any subset of these variables is also i.i.d.. By construction are also i.i.d. with mean and finite variance . Applying the Lindeberg-Lévy central limit theorem [7, Section 17.4] yields
where . Thus, in the limit as and (recall that ), the variables are i.i.d. normally distributed with
is constant and known, which concludes our proof.
3 Illustrative example
Here we consider the classical example of low-conductance graph: the dumbbell graph. Here we Illustrate how the super-node solves the variance problem for random walk tours on graphs. A dumbbell graph consists of two complete graphs on vertices connected by a single edge. The spectral gap , where is the second largest absolute eigenvalue of , is roughly .
It is known that the variance of the return time of tour is related to as 
Drawing nodes from both components to create the super-node, the new graph with the super-node will be more connected and hence improves, and so does the variance.
Another way to view the impact of forming the super-node is that of the cover time of a random walk on dumbell graph. Without the super-node the cover time is . If random walks run in parallel with some conditions on distributing them, the covering time can be reduced to . In this view the super-node tours acts as multiple parallel random walks that quickly cover more of graph with less effort.
4 Experiments on Real-world Networks
In this section we demonstrate the effectiveness of the theory developed above with the experiments on real data sets of various social networks 222The developed software is available here: http://www-sop.inria.fr/members/Jithin.Sreedharan/HypRW.zip. We assume the contribution from super-node to the true value is known a priori and hence we look for in the experiments. In the case that the edges of the super-node are unknown, the estimation problem is easier and can be taken care separately. One option is to start multiple random walks in the graph and form connected subgraphs. Later, in order to estimate the bias created by this subgraph, do some random walk tours from the largest degree node in each of these sub graph and use the idea in Theorem 1.
In the figures we display both approximate posterior and empirical posterior generated from . For the approximate posterior, we have used the following parameters . The green line in the plots shows the actual value .
In the numerical experiments, the super-node is formed in one of following ways: a) uniformly sample nodes from the network without replacement; b) run random walk crawl starting from any node and cover around of the graph, and form the super-node with the largest degree nodes. In both the ways, if the network is disconnected, super-node should be initially created with at least one node from each of the connected component.
First we study a network of moderate size, a connected subgraph of Friendster network with nodes and edges (data publicly available at the SNAP repository ). Friendster is an online social networking website where nodes are individuals and edges indicate friendship. Here, we consider two types of functions:
These functions reflect assortative nature of the network. The super-node is formed from uniformly sampled nodes just as a way to test our method. Figures 1 and 2 display the results for functions and , respectively. A good match between the approximate and empirical posteriors can be observed from the figures. Moreover the true value is also fitting well with the plots. The percentage of graph crawled is in terms of edges and this drops to if we use random walk based super-node formation.
4.2 Dogster network
The aim of this example is to check whether there is any affinity for making connections between the owners of same breed dogs . The network data is based on the social networking website Dogster. Each user (node) indicates the dog breed; the friendships between dogs form the edges. Number of nodes is and number of edges is .
In Figure 3, two cases are plotted. Function