1 Introduction
What is the fraction of malefemale connections against that of femalefemale connections in a given Online Social Network (OSN)? Is the OSN assortative or disassortative? Edge, triangle, and node statistics of OSNs find applications in computational social science (see e.g. [25]), epidemiology [26], and computer science [4, 11, 27]
. Computing these statistics is a key capability in largescale social network analysis and machine learning applications. But because data collection in the wild is often limited to partial OSN crawls through Application Programming Interface (API) requests, observational studies of OSNs – for research purposes or market analysis – depend in great part on our ability to compute network statistics with incomplete data. Case in point, most datasets available to researchers in widely popular public repositories are partial OSN crawls
^{1}^{1}1The majority of the datasets in the public repositories SNAP [21] and KONECT [19] are partial website crawls, not complete datasets or uniform samples.. Unfortunately, these incomplete datasets have unknown biases and no statistical guarantees regarding the accuracy of their statistics. To date, the best methods for crawling networks ([3, 10, 28]) show good realworld performance but only provide statistical guarantees asymptotically (i.e., when the entire OSN network is collected).This work addresses the fundamental problem of obtaining unbiased and reliable node, edge, and triangle statistics of OSNs via partial crawling. To the best of our knowledge our method is the first to provide a practical solution to the problem of computing OSN statistics with strong theoretical guarantees from a partial network crawl. More specifically, we (a) provide a provable finitesample unbiased estimate of network statistics (and their spectralgap derived variance) and (b) provide the asymptotic posterior of our estimates that performs remarkably well all tested realworld scenarios.
More precisely, let be an undirected labeled network – not necessarily connected – where is the set of vertices and is the set of edges. Both edges and nodes can have labels. Network is unknown to us except for arbitrary initial seed nodes in . Nodes in must span all the different connected components of . From the seed nodes we crawl the network starting from and obtain a set of crawled edges , where is a parameter that regulates the number of website API requests. With the crawled edges we seek an unbiased estimate of
(1) 
Note that functions of the form eq. (1) are general enough to compute node statistics
where is the degree of node , and statistics of triangles such as the local clustering coefficient of first provided by [28]
where the expression inside the sum is zero when and are the neighbors of in . Our task is to find estimates of general functions of the form in eq. (1).
Contributions
In our work we provide a partial crawling strategy using random walk tours whose posterior
is shown to have an unbiased maximum a posteriori estimate (MAP)
regardless of the number of nodes in the seed set and regardless of the value of , i.e., . Note that we guarantee that our MAP estimate is unbiased in the finitesample regime unlike previous asymptotic methods [3, 10, 20, 28, 29]. Moreover, we provide the posterior for the large regime and prove its convergence in distribution showing its convergence rate. In our experiments we note that the posterior is remarkably accurate using a variety of networks large and small. We also provide upper and lower bounds for .Related Work
The works of [23] and [6] are the ones closest to ours. [23] estimates the size of a network based on the return times of random walk tours. [6] estimates number of triangles, network size, and subgraph counts from weighted random walk tours using results of [1]. The previous works on nonasymptotic inference of network statistics from incomplete network crawls [12, 17, 18, 13, 14, 22, 30] need to fit the partial observed data to a probabilistic graph model such as ERGMs (exponential family of random graphs models). Our work advances the stateoftheart in estimating network statistics from partial crawls because: (a) we estimate statistics of arbitrary edge functions without assumptions about the graph model or the underlying graph; (b) we do not need to bias the random walk with weights; this is particularly useful when estimating multiple statistics reusing the same observations; (c) we derive upper and lower bounds on the variance of estimator, which both show the connection with the spectral gap; and, finally, (d) we compute a posterior over our estimates to give practitioners a way to access the confidence in the estimates without relying on unobtainable quantities like the spectral gap and without assuming a probabilistic graph model.
The remainder of the paper is organized as follows. In Section 2 we introduce our main theorems and supporting lemmas and proofs. In Section 3 we introduce artificial illustrative examples to aid understanding our method. In Section 4 we introduce our results using simulations over realworld networks. Finally, in Section 5 we present our conclusions.
2 Network Estimation from Partial Crawls
In this section we present our main results. The outline of this section is as follows. Section 2.1 introduces key concepts and defines the notation used throughout this manuscript. Section 2.2 introduces our main results in the form of two theorems: Theorem 1 presents an unbiased estimator of any function over edges of an undirected graph using random walk tours. Our random walk tours are shorter than the “regular random walk tours” because the “node” that they start from is an amalgamation of a multitude of nodes in the graph. Here, we briefly explains the approximate posterior of the estimator in Theorem 1. Section 2.3 proves Theorem 1, introducing important upper and lower bonds of the estimator variance in Section 2.4.1 and showing the effect of the spectral gap. Finally, Section 2.4 derives the approximate Bayesian posterior (3) also using the bounds obtained in Section 2.4.1.
2.1 Preliminaries
Let be an unknown undirected graph. Our goal is to find an unbiased estimate of in eq. (1) and its posterior by crawling a small fraction of . We are given a set of initial arbitrary nodes denoted . If has disconnected components must span all the different connected components of .
Our network crawler is a classical random walk over the following augmented multigraph . A multigraph is a graph that can have multiple edges between two nodes. In we aggregate all nodes of into a single node, denoted hereafter , the supernode. Thus, . The edges of are , i.e., contains all the edges in including the edges from the nodes in to other nodes, and is merged into the supernode . Note that is necessarily connected as spans all the connected components of .
A random walk on
has transition probability from node
to an adjacent node , with , where is the degree of and is the number of edges between and . We note that the theory presented in the paper can be extended to more sophisticated random walks as well. Let be the stationary distribution at node in the random walk on .A random walk tour is defined as the sequence of nodes visited by the random walk during successive th and st visits to the supernode . Here denote the successive return times to
. Tours have a key property: from the renewal theorem tours are independent since the returning times act as renewal epochs. Moreover, let
be a random walk on in steady state.Note that the random walk on is equivalent to a random walk on where all the nodes in are treated as one single node.
The function is redefined on as follows: for , remains same when and . But when or , is redefined as zero.
Supernode Motivation
The introduction of supernode is primary motivated by the following three reasons:

Tackling disconnected or lowconductance graphs:
When the graph is not strongly connected or has many connected components, forming a supernode with representatives from each of the components make the modified graph connected and suitable for applying random walk theory. Even when the graph is connected, it might not be wellknit, i.e., it has low conductance. Since the conductance is closely related to mixing time of Markov chains, such graph will prolong the mixing of random walks. But with proper choice of supernode, we can reduce the mixing time and, as we show, improve the estimation accuracy. This idea is illustrated with a Dumbell graph example in Section
3. 
Faster Estimate with Shorter Tours: The expected value of the th tour length is inversely proportional to the degree of the supernode . Hence, by forming a massivedegree supernode we can significantly shorten the average tour length.
2.2 Main Results
In what follows we present our main results. Theorem 1 proposes an unbiased estimator of edge characteristics via random walk tours. Then we present the approximate posterior distribution of the unbiased estimator presented in Theorem 1.
Theorem 1.
Let be an unknown undirected graph where initial arbitrary set of nodes is known which span all the different connected components of . Consider a random walk on the augmented multigraph described in Section 2.1 starting at supernode . Let be the th random walk tour until the walk first returns to and let denote the collection of all nodes in such tours, . Then,
(2) 
is an unbiased estimate of , i.e., . Moreover the estimator is strongly consistent, i.e., a.s. for .
Theorem 1 provides an unbiased estimate of network statistics from random walk tours. The length of tour is short if it starts at a massive supernode as the expected tour length is inversely proportional to the degree of the supernode, . This provides a practical way to compute unbiased estimates of node, edge, and triangle statistics using (eq. (2)) while observing only a small fraction of the original graph. Because random walk tours can have arbitrary lengths, we show in Lemma 2, Section 2.4, that there are upper and lower bounds on the variance of . For a bounded function , the upper bounds are shown to be always finite.
In what follows we show the approximate posterior of the estimator in Theorem 1. In Section 4 we shall see that the approximate posterior matches very well the empirical posterior using simulations over realworld networks while crawling of the nodes in the network.
Let be the true value outside the subgraph formed by the nodes that were merged into the supernode.
Bayesian approximation of the posterior of
Let
In the scenario of Theorem 1 for tours and assuming priors , ( is the variance of ), then the marginal posterior density of as converges in distribution to a nonstandardized distribution
(3) 
with degrees of freedom parameter
location parameter
and scale parameter
Remark 1.
Note that approximation in (3) is a Bayesian approach and Theorem 1 is the frequentist counterpart. In fact, the motivation to form the Bayesian approach comes from the frequentist estimator ( samples). From the approximate posterior, the Bayesian MAP estimator for sufficiently large values of is
Thus when , the Bayesian estimator is essentially the first term in the frequentist estimator (second term is calculated a priori), and hence both the estimators are same. In this paper we make use of the posterior distribution from the Bayesian approach to get the degrees of belief along with the common estimator from both the approaches.
The above remark shows that the approximate posterior in (3) provides a way to access the confidence in the estimate
. The Normal prior for the average gives the largest variance given a given mean. The inversegamma is a noninformative conjugate prior if the variance of the estimator is not too small
[9], which is generally the case in our application. Other choices of prior, such as uniform, are also possible yielding different posteriors without closedform solutions [9]. The posterior is conservative as is calculated from tours while the posterior considers only tours. Being conservative, however, is advised as the posterior is for large values of and the conservative estimate better protects us from finitesample anomalies and perform very well in practice as we see in Section 4.In what follows we provide the proofs of our main results.
2.3 Proof of Theorem 1
The outline of this proof is as follows. In Lemma 1 we show that the estimate of from each tour is unbiased.
Lemma 1.
Let be the nodes traversed by the th random walk tour on , starting at supernode . Then the following holds, ,
(4) 
Proof.
The random walk starts from the supernode , thus
(5) 
Consider a renewal reward process with interrenewal time distributed as and reward as the number of times Markov chain crosses . From renewal reward theorem,
Here the lefthand side is essentially . Now (5) becomes
which concludes our proof. ∎
Theorem 1.
By Lemma 1 the estimator is an unbiased estimate of . By the linearity of expectation the average estimator is also unbiased. Finally for the estimator
has average
Furthermore, by strong law of large numbers with
, a.s. for . This completes our proof. ∎2.4 Derivation of the approximate posterior
The derivation of (3) relies first on showing that
has finite first and second moments. We go further and in Lemma
2 we introduce upper and lower bounds on the variance of . By Theorem 1 the first moment of is finite as . To show that the second moment is finite we prove that the estimate , , whose variance is , has finite second moment. The results in Lemma 2 are of interest on their own because they establish a connection between the estimator variance and the spectral gap.2.4.1 Impact of spectral gap on variance
Let , where is the random walk transition probability matrix as defined in Section 2.1 and is a diagonal matrix with the node degrees of
. The eigenvalues
of and are same and . Letth eigenvector of
be . Let be the spectral gap, . Let the left and right eigenvectors of be and respectively. . Define , with , and matrix with th element as . Also letbe the vector with
.Lemma 2.
The following holds

Assuming the function is bounded, , and for tour ,
Moreover,

(6)
Proof.
(i). The variance of the estimator at tour starting from node is
(7) 
It is known from [1, Chapter 2 and 3] that
Using Theorem 1 eq. (7) can be written as
The latter can be upperbounded by ).
For the second part, we have
for a constant using inequality. From [24], it is known that there exists an , such that , and this implies that for all . This proves the theorem.
(ii). We denote for and
indicates Gaussian distribution with mean
and variance. With the trivial extension of the central limit theorem of Markov chains
[16] of node functions to edge functions, we have for the ergodic estimator ,(8) 
where
We derive in Lemma 3. Note that is also the asymptotic variance of the ergodic estimator of edge functions.
Consider a renewal reward process at its th renewal, , with interrenewal time and reward . Let be the average cumulative reward gained up to th renewal, i.e., . From the central limit theorem for the renewal reward process [31, Theorem 2.2.5] after total number of steps, with , yields
(9) 
with and
In fact it can be shown that (see [24, Proof of Theorem 17.2.2])
Therefore . Combing this result with Lemma 3 shown in the appendix we get (6). ∎
We are now ready to derive the approximation (3).
Proof.
Let . Given
and and because the tours are i.i.d. the marginal posterior density of is
For now assume that
are i.i.d. normally distributed random variables, and let
then [15, Proposition C.4]
are the posteriors of parameters and , respectively. The nonstandardized
distribution can be seen as a mixture of normal distributions with equal mean and random variance inversegamma distributed
[15, Proposition C.6]. Thus, if are i.i.d. normally distributed then the posterior of is a nonstandardized distributed with parameters(10) 
Left to show is that are converge in distribution to i.i.d. normal random variables as . As the spectral gap of is greater than zero, , Lemma 2 showns that for then
From the renewal theorem we know that are i.i.d. random variables and thus any subset of these variables is also i.i.d.. By construction are also i.i.d. with mean and finite variance . Applying the LindebergLévy central limit theorem [7, Section 17.4] yields
where . Thus, in the limit as and (recall that ), the variables are i.i.d. normally distributed with
and
is constant and known, which concludes our proof.
∎
3 Illustrative example
Here we consider the classical example of lowconductance graph: the dumbbell graph. Here we Illustrate how the supernode solves the variance problem for random walk tours on graphs. A dumbbell graph consists of two complete graphs on vertices connected by a single edge. The spectral gap , where is the second largest absolute eigenvalue of , is roughly .
It is known that the variance of the return time of tour is related to as [1]
Drawing nodes from both components to create the supernode, the new graph with the supernode will be more connected and hence improves, and so does the variance.
Another way to view the impact of forming the supernode is that of the cover time of a random walk on dumbell graph. Without the supernode the cover time is . If random walks run in parallel with some conditions on distributing them, the covering time can be reduced to [2]. In this view the supernode tours acts as multiple parallel random walks that quickly cover more of graph with less effort.
4 Experiments on Realworld Networks
In this section we demonstrate the effectiveness of the theory developed above with the experiments on real data sets of various social networks ^{2}^{2}2The developed software is available here: http://wwwsop.inria.fr/members/Jithin.Sreedharan/HypRW.zip. We assume the contribution from supernode to the true value is known a priori and hence we look for in the experiments. In the case that the edges of the supernode are unknown, the estimation problem is easier and can be taken care separately. One option is to start multiple random walks in the graph and form connected subgraphs. Later, in order to estimate the bias created by this subgraph, do some random walk tours from the largest degree node in each of these sub graph and use the idea in Theorem 1.
In the figures we display both approximate posterior and empirical posterior generated from . For the approximate posterior, we have used the following parameters . The green line in the plots shows the actual value .
In the numerical experiments, the supernode is formed in one of following ways: a) uniformly sample nodes from the network without replacement; b) run random walk crawl starting from any node and cover around of the graph, and form the supernode with the largest degree nodes. In both the ways, if the network is disconnected, supernode should be initially created with at least one node from each of the connected component.
4.1 Friendster
First we study a network of moderate size, a connected subgraph of Friendster network with nodes and edges (data publicly available at the SNAP repository [21]). Friendster is an online social networking website where nodes are individuals and edges indicate friendship. Here, we consider two types of functions:
These functions reflect assortative nature of the network. The supernode is formed from uniformly sampled nodes just as a way to test our method. Figures 1 and 2 display the results for functions and , respectively. A good match between the approximate and empirical posteriors can be observed from the figures. Moreover the true value is also fitting well with the plots. The percentage of graph crawled is in terms of edges and this drops to if we use random walk based supernode formation.
4.2 Dogster network
The aim of this example is to check whether there is any affinity for making connections between the owners of same breed dogs [8]. The network data is based on the social networking website Dogster. Each user (node) indicates the dog breed; the friendships between dogs form the edges. Number of nodes is and number of edges is .
In Figure 3, two cases are plotted. Function
Comments
There are no comments yet.