In 1970, (davis1970clustering) discovered that transitivity—the tendency of friends of friends to be friends themselves—is a prevalent feature in social networks. Since that early discovery, real-world networks have been observed to have many other common macroscopic features, and these discoveries have led to probabilistic models for networks that display these phenomena. The observation that transitivity and other common subgraphs are prevalent in networks motivated the exponential random graph model (ERGM) (frank1986markov). (barabasi1999emergence) demonstrated that many large networks display a scale-free power law degree distribution, and provided a model for constructing such graphs. Similarly, the small world phenomenon—that networks display surprisingly few degrees of separation—motivated the network model in (watts1998collective). While network science is often driven by the observation and modelling of common properties in networks, it is incumbent on the practicing data scientist to explore network data using statistical methods.
One approach to understanding network data is to fit free parameters in these network models to the data through likelihood-based or Bayesian methods. In (wasserman1996logit)
, a pseudolikelihood method was used with graphlet counts to fit an ERGM designed to display transitivity, and Monte Carlo Markov Chain (MCMC) methods were developed in(snijders2002markov) for fitting general ERGMs. Fitting such models from data can be cumbersome, and to do so implicitly assumes that the network follows such a model exactly. Network statistics, such as the clustering coefficient, algebraic connectivity, and degree sequence, are more flexible tools. A good statistic can be used to fit and test models, for example, (watts1998collective) used the local clustering coefficient, a measure of the number of triangles relative to wedges, to test if a network is a small-world graph. The clustering coefficient is also used to understand social network graphs, (chakrabarti2006graph)
. More generally, it was discovered that re-occurring subgraph patterns can be used to differentiate real-world networks, and that genetic networks, neural networks, and internet networks all presented different common interconnectivity patterns,(milo2002network). In this work, we will propose a new method for counting the occurrences of any subgraph pattern, otherwise known as graphlets—a term coined in (prvzulj2004modeling)—or motifs.
A graphlet is a small connected graph topology, such as a triangle, wedge, or -clique, which we will use to describe the local behavior of a larger network (example graphlets of size 3, 4, and 5, can be seen in Figure 1). Let the graph in question be where is a set of vertices and is a set of unordered pairs of vertices ( is assumed to be undirected and unweighted). Imagine specifying a -graphlet and testing for every induced subgraph of the graph (denoted where ), if it is isomorphic to the subgraph (it has the same topology). We would like to compute the number of Connected Induced Subgraphs of size (denoted by -CIS throughout) for which this match holds. We call this number the graphlet counts and the proportion of the number of such matches to the total number of -CISs is called the graphlet coefficient.
Graphlets are the graph analogue of wavelets (small oscillatory functions that are convolved with a signal to produce wavelet coefficients) because they are small topologies that are matched to induced subgraphs of the original graph to produce the graphlet coefficients. Graphlet coefficients, also referred to as graph moments, are used in the method of moments to fit certain graph models by selecting parameters that match the empirical graph moments to their expectations(bickel2011method). Graphlet coefficients are used to understand biological networks, such as the protein-protein interaction network, and reoccuring patterns are thought to indicate evolutionary conserved modules (prvzulj2006efficient). If the graphlet size, , is small then testing for isomorphism, a problem called graph matching (cordella2004sub), is feasible, but testing every induced subgraph can require on the order of iterations in its most naive implementation. We propose a class of methods called lifting that allow us to quickly estimate graphlet coefficients.
While there exist several algorithms that can estimate the proportion of triangles, wedges, and graphlets with four or five vertices (for example, (Ahmed2017count; rahman2014graft)), there are few algorithms that can efficiently estimate the proportion of larger graphlets. Many methods that can handle arbitrary graphlets are Monte Carlo sampling procedures that traverse through the space of all graphlets of a certain size within the large network. Two such methods are GUISE algorithm of (bhuiyan2012guise) and the pairwise subgraph random walk of (Wang2014psrw), which differ in the way that they perform a random walk between the CIS samples. Another option is to generate a sequence of vertices that induces a CIS sample, which has been done in (Han2016waddling) using an algorithm called Waddling random walk. Alternatives to random Monte Carlo schemes include the color coding scheme of (Bressan2017colourcoding), but it processes the whole graph, while the Monte Carlo schemes can traverse the network locally. In this work, we propose a new Monte Carlo algorithm, called lifting, for estimating the graphlet counts within a large network. The lifting step takes a smaller CIS of size and produces a subgraph of size by adding an adjacent vertex to it (according to a specific scheme). In this paper we consider procedures that start from vertices or edges and lift to sample CIS of size . Lifting is a simple, flexible procedure that can be easily distributed to accommodate massive network datasets.
1.1. Our contributions
Graphlet coefficients are multipurpose statistical tools that can be used for model fitting and testing, network regression, and exploratory analysis for network data. Any CIS Monte Carlo sampling scheme has three goals: that it provides unbiased estimates of the graphlet coefficients, that the variance of the estimated coefficients is small, and that it does so in as few iterations as possible. Because massive graphs are often stored in distributed databases, we would like the sampling scheme to require only neighborhood queries (requests to the database returns the neighbors of a node) and we will avoid storing or processing the full graph. Because communication is often the bottleneck in distributed computing, neighborhood queries are the basic unit for measuring computational time complexity.
After discussing the precise requirements for any CIS sampling procedure, we will introduce the lifting scheme for subgraphs. The key difficulty in any Monte Carlo method for graphlet counting is calculating the sampling distribution. We provide two methods, the ordered lift estimator and the unordered lift estimator, which differ in the way that subgraphs are represented and counted in the graphlet count. The ordered estimator allows for a modification, called shotgun sampling that samples multiple subgraphs in one shot. For our theoretical component, we prove that the estimated graphlet coefficients are unbiased when the underlying MCMC has reached the stationary distribution (called perfect mixing). We also prove that under perfect mixing, the variance of the estimator scales like where is the maximum degree, and show that the lifting scheme can have significantly lower sample correlations than the subgraph random walk. All proofs can be found in the supplement. We conclude with real-world network experiments that reinforce the contention that subgraph lifting has a lower variance than Waddling and lower sample correlation than subgraph random walks.
2. Sampling graphlets
2.1. Definitions and notation
Throughout we will assume that our graph is simple, connected and undirected. For a subset , a subgraph of induced by , is a graph with vertices in and edges in . We call a connected motif on vertices a -graphlet. Given a -graphlet , we’ll be interested in the number of -subgraphs of isomorphic to . The set of all connected induced -subgraphs (or -CISs) of is denoted by (or simply ).
An unordered set of vertices is denoted while an ordered list is denoted . Let be all non-isomorphic motifs for which we would like the graphlet counts. For , we say that “ is subgraph of type ” if is isomorphic to , and denote this with . The number of -subgraphs in of type is equal to , where is the indicator function. For a subgraph , denote to be the set of its vertices, to be the set of its edges. Denote (vertex neighborhood of ) to be the set of all vertices adjacent to some vertex in not including itself. Denote (edge neighborhood of ) to be the set of all edges that connect a vertex from and a vertex outside of . Also, denote (degree of ) to be the number of edges in , and denote (-degree of ) to be the number of vertices from that are connected to . Note that .
2.2. Prior graphlet sampling methods
The ideal Monte Carlo procedure would sequentially sample CISs uniformly at random from the set
, classify their type, and update the corresponding counts. Unfortunately, uniformly sampling CISs is not a simple task because they are required to be connected—a random set of k vertices is unlikely to be connected. CIS sampling methods require Monte Carlo Markov Chains (MCMCs) for which one can calculate the stationary distribution,, over the elements of . First, let us consider how we update the graphlet counts, , given a sample of CISs, .
The desire to sample subgraphs uniformly is natural, because the empirical counts will be unbiased estimates of their population quantities. Instead, suppose that our CIS sample, with , is drawn with known sampling distribution
. Then we use Horvitz-Thompson inverse probability weighting to estimate the graphlet counts,
It is simple to see that this is an unbiased estimate of the graphlet counts as long as is supported over all elements of . Alternative updating methods include rejection sampling, which can be combined with inverse probability weighting, but we will use (1) for ease of presentation.
We find ourselves in a game, where we can choose any CIS Monte Carlo algorithm that induces a stationary distribution , but we must be able to quickly and accurately compute in order to use (1). We will analyze the theoretical implications of the sampling algorithm based on mixing times of the Markov chain and the variance of the graphlet count estimates. Before we explore the lifting procedure, this paper’s algorithmic contribution, we would like to discuss some existing MCMC CIS sampling methods.
The subgraph random walk is described in (Wang2014psrw), where they make a modification called the pairwise subgraph random walk (PSRW). In order to perform a random walk where the states are , we form the CIS-relationship graph. Two -CISs, are connected with an edge if and only if vertex sets of and differ by one element, i.e. when . Given the graph structure, we sample -CISs by a random walk on the set , which is called Subgraph Random Walk (SRW). Because the transition from state is made uniformly at random to each adjacent CIS, then we know that the stationary distribution will sample each edge in the CIS-relationship graph with equal probability. This fact enables (Wang2014psrw) to provide a local estimator of the stationary probability . PSRW is a modification of the SRW algorithm, where each transition is performed from to in and then the -CIS is returned.
The mixing time is a critical issue for any MCMC, and subgraph sampling is no exception. Dependence between consecutive samples results in a higher variability of , and sufficient mixing is required for the stationary distribution to approximate the sampling distribution. It was pointed out in (Bressan2017colourcoding) that the mixing time of the SRW can be of order , even if the mixing time of the random walk on the original graph is of constant order . PSRW also requires global constants based on the CIS-graph, which can be computationally intractable (super-linear time).
Another approach is to sample on ordered sequences of vertices , denoted by , which would induce a -CIS, . Given a sampling scheme of such sequences with probability , the estimator for graphlet counts is given by
for some fixed weights . The main difference between these types of sampling is that we maintain the ordering of the vertices, while a CIS is an unordered set of vertices.
A naive method for sampling such sequences would be to perform a random walk on the graph and then sample the vertices most recently visited. This scheme is appealing because it has an easy to compute stationary distribution, and can ‘inherit’ the mixing rate from the random walk on (which is relatively small). Despite these advantages, certain graphlet topologies, such as stars, will never be sampled, and modifications are needed to remedy this defect. (chen2016general) combined this basic idea with the SRW by maintaining a length history of the SRW on CISs of size , and unioning the history, but this suffers from the same issues as SRW, such as slow mixing and the need to calculate global constants based on the CIS-graph.
(Han2016waddling) introduced a Waddling protocol which retains a memory of the last vertices in the random walk on and then extends this subgraph by vertices from either the first or last vertex visited in the -subgraph (this extension is known as the ‘waddle’). The authors provide a method for calculating the stationary distribution for this MCMC, and prove a bound on the error for the graphlet coefficients. The downside to this method is that the precise Waddling protocol used should depend on the desired graphlet, and the algorithm involves a rejection step which may lead to a loss of efficiency. In contrast, the lifting scheme has the advantage of inheriting the mixing time of the random walk on , and it can simultaneously sample many CISs of differing sizes without any rejections.
3. Subgraph lifting
The lifting algorithm is based on a randomized protocol of attaching a vertex to a given CIS. For any -CIS, we lift it to a -subgraph by adding a vertex from its neighborhood,
at random according to some probability distribution. Note that this basic lifting operation can explore any possible subgraph in. In this work, we show that one can lift from a random walk on the vertices, or another vertex or edge sampling distribution, and achieve favorable properties.
You can see an example of the lifting sampling scheme in Figure 2, where the algorithm iteratively builds a -CIS from a chosen node. First assume we have a node sampled from the distribution , a base distribution that can be computed from local information (step (a)). We assume that , where is some function (usually a polynomial) and is some global normalizing constant which is assumed to be precomputed. Denote . To start our procedure, sample an edge uniformly from (step (b)). The vertex is then attached to , forming a subgraph (step (c)). After that, we sample another edge (with ) uniformly from , and the vertex is then attached to (steps (d-f)). At each step we sample an edge (with ) from uniformly at random, and attach the vertex to the subgraph (steps (g-h)). After operations, we obtain a -CIS . We’ll refer to the procedure above as the lifting procedure starting at vertex .
By induction, we can see that every -CIS has a non-zero probability of being visited, assuming that is supported on every vertex. The Waddling method was motivated by the fact that a simple random walk would not visit all subgraphs in , yet Waddling only partially solved this issue, because not every wadding protocol can sample every -graphlet. Typically, is assumed to be the stationary distribution of a simple random walk, but it could be another distribution such as uniform sampling over the vertices or edges. One motivation for lifting is that if we sample the vertices from a simple random walk, then lifting ‘inherits’ its mixing time, much like Waddling. Hence, lifting is a simple algorithm that can sample any CIS with good mixing rates and no rejections. It remains to be demonstrated that we can calculate the probability of sampling the -CIS using only local information.
3.1. Ordered lift estimator
We can think of a lifting procedure as a way of sampling a sequence , ordered from the first vertex sampled to the last, that is then used to generate a CIS. Denote the set of such sequences as . Let be the -CIS obtained by the lifting procedure on step . The probability of sampling vertex on the step is equal to
Thus, the probability of sampling a sequence is equal to
Critically, this equation can be computed with only neighborhood information about the involved vertices, so it takes neighborhood queries. Because there are many orderings that could have led to the same CIS , then we need to apply proper weights in the graphlet count estimate (2) by enumerating the number of possible orderings.
Consider the sampled -CIS . Denote the set of possible sequences that would form in the lifting process as . Notice that must be a connected subgraph for all . Thus,
Since the elements of are just certain orderings of vertices in , we call an element from a compatible ordering of . Note that only depends on the type of the graphlet isomorphic to , and it can be precomputed using dynamic programming. Thus, when , the number of compatible orderings are equal: . Note that can vary from (for -path) to (for -clique). We set up the estimator from (2) as
We call it the ordered lift estimator for the graphlet count.
A drawback of the algorithm is that it takes queries to lift the CIS plus the number of steps required to sample the first vertex (when sampled from Markov chain). To increase the number of samples per query, notice that if we sample via lifting, we can get subgraphs induced by for all without any additional queries.
Thus, for each sampled sequence , we can compute the sum to incorporate the information about all -CISs in the neighborhood of . We call this procedure shotgun sampling. The corresponding estimator based on (2) is
Shotgun sampling produces more CIS samples with no additional query cost, but the CIS samples generated in a single iteration will be highly dependent. The following proposition states that the resulting estimators are unbiased.
Proposition 3.1 ().
The ordered lifted estimator, , and the shotgun estimator, , are unbiased for the graphlet counts .
3.2. Unordered lift estimator
The ordered lift estimators computed the probability of sampling the CIS and the order of included vertices. Alternatively, we can compute the marginal probability of sampling the unordered graphlet, for the lifted CIS . One advantage of this approach is that this probability is a function of only the degrees of vertices . This can be done either recursively or directly. Throughout, let the set of vertices of be .
We begin the algorithm by querying the probability of obtaining any vertex in , . We will build the probability of obtaining any connected subgraph of inductively. This is possible because the probability of getting via lifting is given by the sum , where the sum is taken over all connected -subgraphs , and denotes the probability of getting from to in the lifting procedure. Then
where the sum is taken over all connected -subgraphs .
where, given , is the th vertex in and .
Although calculation of this probability on-the-fly is cost-prohibitive, we can greatly reduce the number of operations by noticing that the probability is a function of degrees of the vertices: for a CIS of type , let be an arbitrary labelling of the vertices of with , then the probability of is
for a cached function given by (8).
Example. Consider a triangle, which is a 3-graphlet with edges , and . Given the degrees of the corresponding vertices, the probability function is
Example. Consider a wedge, which is a 3-graphlet with edges and . Given the degrees of the corresponding vertices, the probability function is
We need to only compute functions once before starting the algorithm. When a -CIS is sampled via lifting procedure, we find the natural labelling of vertices in via the isomorphism , and use the function together with the degrees of vertices of to compute the value of .
3.3. Sampling a starting vertex
One advantage of the lifting protocol is that it can be decoupled from the selection of a starting vertex, and our calculations remained agnostic to the distribution (although, we did require that it was a function of the degrees). There are two method that we would like to consider, one is the uniform selection over the set of vertices, and the other is from a random walk on the vertices, that presumably has reached its stationary distribution.
Consider sampling the starting vertex independently and from an arbitrary distribution when we have access to all the vertices. The advantage of sampling vertices independently, is that the lifting process will result in independent CIS samples. A byproduct of this is that the variance of the graphlet count estimator (1) can be decomposed into the variance of the individual CIS samples. Given iid draws, the variance of the estimator is then
which is small when the distribution of
is close to uniform distribution on. Equation (11) demonstrates fundamental property that when is small then it contributes more to the variance of the estimator. The variation in (11) can be reduced by an appropriate choice of , i.e. the starting distribution.
Calculating takes operations (preparation), sampling starting vertex takes operations, and lifting takes , where is the maximum vertex degree in .
When we don’t have access to the whole graph structure, a natural choice is to run a simple random walk (with transitional probabilities whenever in connected to with an edge). Then the stationary distribution is and we can calculate all probabilities accordingly. One feature of the simple random walk is that the resulting edge distribution is uniform: for all (edges are -graphlets). Therefore, the probabilities are the same as if sampling an edge uniformly at random and start lifting procedure from that edge.
4. Theoretical Analysis of Lifting
As long as the base vertex distribution, , is accurate then we have that the graphlet counts are unbiased for each of the aforementioned methods. The variance of the graphlet counts will differ between these methods and other competing algorithms such as Waddling and PSRW. The variance of sampling algorithms can be decomposed into two parts, an independent sample variance component and a between sample covariance component. As we have seen the independent variance component is based on the properties of resulting from the procedure (see (11)). The covariance component will be low if the samples are not highly dependent on the past, namely that the Markov chain is well mixed. We have three different estimators: Ordered Lift estimator , Shotgun Lift estimator and Unordered Lift estimator . For each estimator, we sample different objects: sequences for Ordered, sequences for Shotgun, and CISs for Unordered estimator. Throughout this section, we will denote
for the Ordered Lift estimator,
for the Shotgun Lift estimator,
for the Unordered Lift estimator,
Note that , and for the corresponding estimators.
The variance can be decomposed into the independent sample variance and a covariance term,
For Markov chains, the summand in the second term will typically decrease exponentially as the lag increases, due to mixing. Let us focus on the first term, with the goal of controlling this for either choice of base vertex distribution, , and the lifting scheme.
Theorem 4.1 ().
(1) If is the stationary distribution of the vertex random walk then
(2) If is the uniform distribution over the vertices then
This result is comparable to analogous theorems for Waddling, (Han2016waddling), and PSRW, (Wang2014psrw).
When the vertices are sampled independently, the covariance term disappears, so we will focus on the sampling vertices via random walk in this subsection. One advantage of the lifting procedure over the SRW is that it inherits the mixing properties from the vertex random walk. This can be thought of as a consequence of the data processing inequality in that the lifted CISs are no more dependent then the starting vertices from which they were lifted. To that end, let us review some basics about mixing of Markov chains,
Definition 4.2 ().
Define the mixing coefficient of a stationary Markov chain with discrete state space as
where is the stationary distribution of the Markov chain. Also, define the mixing time of a stationary Markov chain as
Theorem 4.3 ().
Given stationary Markov chain with being the second largest eigenvalue of the transitional matrix,
being the second largest eigenvalue of the transitional matrix,
There are two consequences of mixing for CIS sampling. First, an initial burn-in period is needed for the distribution to converge to the stationary distribution (and for the graphlet counts to be unbiased). Second, by spacing out the samples with intermediate burn-in periods and only obtaining CISs every steps we can reduce the covariance component of the variance of . Critically, if we wish to wait for steps, we do not need to perform the lifting scheme in the intervening iterations, since those graphlets will not be counted. So, unlike in other MCMC method, spacing in lifted CIS sampling is computationally very inexpensive. Because burn-in is a one-time cost and requires only a random walk on the graph, we will suppose that we begin sampling from the stationary distribution, and the remaining source of variation is due to insufficient spacing between samples. The following theorem illustrates the point that the lifted MCMC inherits mixing properties from the vertex random walk.
Theorem 4.4 ().
Consider sampling a starting vertex from a random walk, such that a sufficient burn in period has elapsed and stationarity has been reached. Let be the spacing between the CIS samples, be defined as in Theorem 4.1, and be the second largest eigenvalue of the transition matrix for the vertex random walk. Let be as defined in (12), (13) or (14), then
Corollary 4.5 ().
In the notation of the Theorem 4.4,
Hence, if we allow to grow large enough then we can reduce the effect of the covariance term, and our CISs will seem as if they are independent samples.
For our experiments, we picked five networks of different size, density and domain. All networks are available online in the network repository (nr-aaai15). The corresponding graphs are undirected, and were preprocessed to be simple and unweighted.
We demonstrate performance of the shotgun and unordered lift method compared to two competitive methods: PSRW and Waddling. All algorithms were implemented in Python, and the code is available on GitHub111github.com/KirillP23/LiftSRW.
We will compare the effectiveness of estimators in two ways:
Relative error of a graphlet count estimator with (wedges and triangles ) given limited number of queries (or, equivalently, limited number of samples).
Variance and correlation of samples for a graphlet count estimator for 3-stars and 4-paths .
In all methods, the first vertex of the lifting procedure is taken from a standard random walk on vertices. Whenever we specify the burn-in, that would correspond to the random walk steps taken to sample the starting vertex in case of Lifting and Waddling, and the number of steps taken between sampling CISs in case of PSRW. We compare the Shotgun Lift estimator against Waddling estimator for the first problem, and compare the Unordered Lift estimator against PSRW and Waddling for the second problem.
5.1. Relative Error given limited queries
The brute force method gives us the exact graphlet count for :
|Graphlet counts for|
We will compare the relative error
between Shotgun Lift estimator and Waddling estimator with burn-in . The number of queries required for each sample is for the Shotgun Lift algorithm and the number of queries for each -CIS sample is for the Waddling algorithm. We take the average error across runs.
As we see from the plots in Figure 3, Shotgun Lift method outperforms Waddling by a factor of 2 in most cases. This agrees with our theory, since the number of actual CISs sampled by the Shotgun method is much bigger than Waddle given the same number of queries.
5.2. Variation and correlation of samples
We will use the Unordered Lift estimator, to count all motifs of size , but the performance would be measured in terms of the variance of the estimator, both the ”variance under independence” and ”covariance” parts. To get an idea about the distribution of -graphlets, we count the -graphlets with the Unordered Lift estimator.
|Graphlet estimates for|
Denote for all estimators. Note that . We compare the variance of the estimator using equation (15). The table below shows estimates for :
|Variation under independence|
We can see that PSRW generally has smaller variation under independence, mostly because of the probability being independent of the degrees of vertices of . On the other hand, variation under independence of Lift and Waddle methods is comparable, meaning that varies similarly for those two methods. Next, we compare the dependence of and using correlation for different values of the burn-in (see Fig.4). For Lift and Waddling, the burn-in between and is the number of steps taken after sampling to get a new starting vertex for . For PSRW, burn-in is the number of steps between CIS samples in the random walk on subgraphs. From the graphs in Figure 4, we see that PSRW produces highly correlated samples compared to Lift and Waddling methods. This agrees with our analysis of PSRW, since it takes many more steps for the subgraph random walk to achieve desired mixing compared to the random walk on vertices.
6. Supplement to ”Estimating Graphlets via Lifting”
6.1. Proof of Prop. 3.1.
Let be as defined in (12), (13). For both estimators, because of the form of (5) and (6), if a single term is unbiased then is as well. Let us begin with , by considering a draw from the lifting process, which induces the -subgraph, . By the definition of ,
Hence, the is unbiased. Consider the shotgun estimator, ,
Hence, the shotgun estimator is unbiased as well. ∎
6.2. Proof of Theorem 4.1.
We can bound the variance in (11) by the second moment, which is bounded by,
Seeking to control the the maximum of , we see that,