 # Moment based estimation of stochastic Kronecker graph parameters

Stochastic Kronecker graphs supply a parsimonious model for large sparse real world graphs. They can specify the distribution of a large random graph using only three or four parameters. Those parameters have however proved difficult to choose in specific applications. This article looks at method of moments estimators that are computationally much simpler than maximum likelihood. The estimators are fast and in our examples, they typically yield Kronecker parameters with expected feature counts closer to a given graph than we get from KronFit. The improvement was especially prominent for the number of triangles in the graph.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stochastic Kronecker graphs were introduced by  as a method for simulating very large random graphs. Random synthetic graphs are used to test graph algorithms and to understand observed properties of graphs. By using simulated graphs, instead of real measured ones, it is possible to test algorithms on graphs larger or denser than presently observed ones. Simulated graphs also allow one to judge which features of a real graph are likely to hold in other similar graphs and which are idiosyncratic to the given data.

Stochastic Kronecker graphs are able to serve these purposes through a model that has only three or four parameters. Parameter estimation poses unique challenges for those graphs. The main problem is that for a graph with nodes, the likelihood has contributions from permutations of the nodes . In practice, many thousands or millions of randomly sampled permutations are used to estimate the likelihood. Even then it takes more than work to evaluate the likelihood contribution from one of the permutations.

In this paper we present a method of moments strategy for parameter estimation. While moment methods can be inefficient compared to maximum likelihood, statistical efficiency is of reduced importance for enormous samples and in settings where the dominant error is lack of fit. The method equates expected to observed counts for edges, triangles, hairpins (-stars or wedges) and tripins (-stars). The Kronecker model gives quite tractable formulas for these moments.

The outline of this paper is as follows. Section 2 defines Kronecker graphs and introduces some notation. Section 3 derives the expected feature counts. Section 4 describes how to solve method of moment equations for the parameters of the Kronecker graph model. Section 5 presents some examples on fitting Kronecker models to some real world graphs. We compare several moment based ways to estimate Kronecker graph parameters and find the most reliable results come from a criterion that sums squared relative errors between observed and expected features. We find that the fitted Kronecker models usually underestimate the number of triangles compared to the real graphs. While our parameter estimates underestimate triangle counts and some other features, we find that they provide much closer matches than some previously published parameters fit by KronFit. Section 6 fits parameters to graphs that were randomly generated from the Kronecker model. We find that the estimated parameters closely track their generating values, with some small bias when a parameter is at the extreme range of valid values. Section 7 has our conclusions.

The data for our examples is online at

along with the code used to estimate Kronecker parameters.

## 2 The Kronecker model

Given a node set of cardinality , and a matrix defined over , a random graph is one where the edge

exists with probability

and all edges exist or don’t independently. The graph includes loops and may possibly include both and . We snip these out by defining the random graph with edges only when and , using any non-random ordering of . Both and are in fact probability weighted ensembles of graphs, but for simplicity we describe them as single random graphs. We assume that is a symmetric matrix and so the ordering of nodes does not affect the distribution.

The description of allows up to parameters that affect the outcome. Much more parsimonious descriptions can be made by taking to be the Kronecker product of two or more smaller matrices. Recall that the Kronecker product of matrices and is

 X⊗Y≡⎛⎜ ⎜ ⎜ ⎜ ⎜⎝X11YX12Y⋯X1nYX21YX22Y⋯X2nY⋮⋮⋱⋮Xm1YXm2Y⋯XmnY⎞⎟ ⎟ ⎟ ⎟ ⎟⎠∈Rmr×ns.

An extremely parsimonious stochastic Kronecker graph takes to be the -fold Kronecker product of for . That is

 P=P(r)=Θ⊗Θ⊗⋯⊗Θ≡Θ[r].

If the power is known, then only three numbers need to be specified, and with them we can then simulate other graphs that are like the original. Perhaps surprisingly, stochastic Kronecker graphs imitate many, but of course not all, of the important features seen in large real world graphs. See for example .

We would like to pick parameters to match the properties seen in a real and large graph. Parameter matrices and give rise to the same graph distribution. To force identifiability, we may assume that .

## 3 Moment formulas

The Kronecker structure in makes certain aspects of very tractable. For example, the number of edges in can be shown to have expectation

 E(E)=12((a+2b+c)r−(a+c)r). (1)

Simply counting the edges in

gives us valuable information on the parameter vector

. Because

is a sum of independent Bernoulli random variables we find that

and so the relative uncertainty will be small in a graph with a large number of expected edges.

This section derives equation (1) and similar formulas for the expected number of features of various types. The expected feature counts require sums over various sets of nodes. Section 3.1 records some summation formulas that simplify that task. Then Section 3.2 turns expected feature counts into sums and Section 3.3 shows how those sums simplify for stochastic Kronecker matrices.

### 3.1 Summation formulas

Let for a finite index set . A plain summation sign represents sums over all combinations of levels of all the subscripting indices used. The symbol includes all levels of all indices, except for any combinations where two or more of those indices take the same value. In several places we find that sums are easier to do over all levels of all indices, while the desired sums are over unique levels. Here we record some formulas to translate the second type into the first.

It is elementary that

 ∑ij∗fij=∑ijfij−∑ifii, (2)

and similarly

 ∑ijk∗fijk=∑ijkfijk−∑ij(fijj+fiji+fiij)+2∑ifiii. (3)

When there are four indices, we get

 (4)

Equation (4) is more complicated than the others. It can be proved by defining , writing and then applying (3).

In some of our formulas below, the first index is singled out but the others are exchangeable. By this we mean that , when there are three indices, while is the version for four indices.

When indices after the first are exchangeable, then equation (3) simplifies to

 ∑ijkfijk−∑ij(fijj+2fiij)+2∑ifiii, (5)

and equation (4) simplifies to

 ∑ijklfijkl−3∑ijk(fiijk+fijjk)+∑ij(2fijjj+5fiijj+4fiiij)−6∑ifiiii. (6)

When all indices are exchangeable, so that , then equation (5) simplifies to

 ∑ijkfijk−3∑ijfiij+2∑ifiii. (7)

### 3.2 Expected feature counts for independent edges

The graph features we describe are shown in Figure 1. In addition to edges, there are hairpins (-stars) where two edges share a common node, tripins (-stars) where three edges share a node, and triangles. The Kronecker model has independent edges. Here we find the expected feature counts for any random graph where edge appears with probability and edges are independent. Figure 1: This figure illustrates some of the graph features that we can count, for use in moment based estimates of the parameters in the stochastic Kronecker graph.

Recall that is a random graph with (independently). Let it have incidence matrix . There may be loops , and for , and are independently generated. The graph is formed by deleting loops from and symmetrizing the incidence matrix via

 Aij=⎧⎪⎨⎪⎩A∗iji>j0i=jA∗jii

The number of edges in is . The expected number of edges satisfies

 2E(E) =E(∑∗ijAij)=∑ij∗Pij=∑ijPij−∑iPii, (8)

using .

The number of hairpins in is . Dividing by two adjusts the sum for counting twice. The expected value of satisfies

 2E(H)= ∑ijk∗PijPik=∑ijkPijPik−∑ijP2ij−2∑ijPiiPij+2∑iP2ii

by letting , for which , and applying equation (5).

The number of triangles in is , because the sum counts each triangle times. The expected value of each term is which is symmetric in its three arguments and so we may apply equation (7) to get

 6E(Δ)=∑ijkPijPikPjk−3∑ijPiiP2ij+2∑iP3ii.

The number of tripins in is . The final three indices in are exchangeable, and so equation (6) applies. Thus

 6E(T) =∑ijklPijPikPil−3∑ijkPiiPijPik−3∑ijkP2ijPik +2∑ijP3ij+5∑ijPiiP2ij+4∑ijP2iiPij−6∑iP3ii.

### 3.3 Simplifying the sums

The sums in the expected counts simplify, because of the properties of the Kronecker graph. Let the node set be . For write for . Similarly let , , and be described in terms of for .

The matrix entry may be written

 P(r)ij=r∏s=1Θisjs.

For , we simplify the expression by induction using a smaller version of the problem defined via . Specifically,

 ∑ijk P(r)ijP(r)ik=∑i1⋯∑ir∑j1⋯∑jr∑k1⋯∑krr∏s=1ΘisjsΘisks =(∑i1⋯∑ir−1∑j1⋯∑jr−1∑k1⋯∑kr−1r−1∏s=1ΘisjsΘisks)∑irjrkrΘirjrΘirkr =(∑ijkP(r−1)ijP(r−1)ik)∑irjrkrΘirjrΘirkr =(∑irjrkrΘirjrΘirkr)r,

where indices , and are summed over their full ranges, and the indices , , for are summed over the node set .

All of the sums of products of elements of listed in the previous section, with summation over all levels of each index, also reduce this way to ’th powers of their value for the case .

For we need to sum products of elements of over or over or over . These cases correspond to the first , , or rows of Table 1 for . For instance requires which we know to be the ’th power of

 ∑irjrkrΘirjrΘirkr =a2+ba+b2+cb+ab+b2+bc+c2 (9) =(a+b)2+(b+c)2.

The first expression (9) follows by summing over the rows of Table 1. As a result

 ∑ijkPijPik=((a+b)2+(b+c)2)r.

In the rest of this section, we record the other sums we need. First, the sums over one index variable take the form

 ∑iPmii=(am+cm)r, (10)

where cases are used in our expected feature counts. The sums over two index variables are

 ∑ijPmiiPnij =(am(an+bn)+cm(bn+cn))r.

The cases we need are for .

Four sums over three indices are used. They are:

 ∑ijkPijPik =((a+b)2+(b+c)2)r ∑ijkP2ijPik =(a3+c3+b(a2+c2)+b2(a+c)+2b3)r ∑ijkPijPikPjk =(a3+c3+3b2(a+c))r,and ∑ijkPiiPijPik =(a(a+b)2+c(b+c)2)r.

Finally, one sum over four indices is used:

 ∑ijklPijPikPil =((a+b)3+(b+c)3)r.

### 3.4 Expected feature counts

Now we can specialize the results of Section 3.2 to the Kronecker graph setting. Gathering together the previous developments, we find

 2E(E)= (a+2b+c)r−(a+c)r 2E(H)= ((a+b)2+(b+c)2)r−2(a(a+b)+c(c+b))r 6E(Δ)= (a3+3b2(a+c)+c3)r−3(a(a2+b2)+c(b2+c2))r+2(a3+c3)r 6E(T)= ((a+b)3+(b+c)3)r−3(a(a+b)2+c(b+c)2)r −3(a3+c3+b(a2+c2)+b2(a+c)+2b3)r+2(a3+2b3+c3)r +5(a3+c3+b2(a+c))r+4(a3+c3+b(a2+c2))r−6(a3+c3)r.

In each formula, the terms from sums over fewer indices come after the ones from more indices. The later terms adjust for loops and double edges and other degenerate quantities. For large , we expect that the first term should be most important. In particular if then in all cases the first quantity raised to the power is the largest one. For example the first term in is times as large as the second one, which subtracts out loops.

The first term will dominate for large unless . The relative magnitude of the second term is

 (a+ca+2b+c)r=2rlog2((a+c)/(a+2b+c))=N−α

where . If then dropping the second term in makes a smaller difference than the sampling uncertainty in . This holds when the off diagonal element of is not too small compared to the average of the diagonal elements: .

### 3.5 Illustrations

Some special cases of the formulas are of interest. For example if then there are no edges in apart from loops. As a result has isolated nodes. We find from the above that when .

If instead then each node with coordinates has a dual node which has coordinates for . The only possible edges in are between nodes and their duals. There are nodes each with probability of having an edge out to its dual. The formula above gives when , as it should. We also get when .

If , then has every possible edge and loop with probability . As a result is the complete graph on nodes. Then it has edges, hairpins, triangles, and it has tripins.

## 4 Solving for a, b, and c

There are four equations in Section 3.4. To estimate , , and will require at least three of them. Because they are high order polynomials it is possible that there are multiple solutions or even none at all. The latter circumstance would provide some evidence of lack of fit of the stochastic Kronecker model to a given graph. Regardless, each of the equations involves the count of a feature in the graph.

### 4.1 Counting features in a graph

Three of the features we use are easily obtainable from the degrees of the nodes. Let be the degree of node in graph . Then

 E =12∑idi, H =12∑idi(di−1),and T =16∑idi(di−1)(di−2)

give the number of edges, hairpins (or wedges), and tripins in terms of the degrees .

The number of triangles is not a simple function of . Algorithms to count triangles are considered in . The time complexity can be as low as , and sometimes even lower for approximate counting .

### 4.2 Objective functions

A pragmatic way to choose , , and is to solve

 mina,b,c∑F(F−Ea,b,c(F))2Ea,b,c(F) (11)

where the sum is over three or four of the features from Section 3.4 and the minimization is taken over and . The terms in (11

) are scaled by an approximate variance. A sharper expression would account for correlations among the features used. That should increase statistical efficiency, but in large problems lack of fit to the Kronecker model is likely to be more important than inefficiency of estimates within it.

Many real world networks may not have good fits in terms of these three Kronecker parameters. This is the case for most of the forthcoming experiments. The following more general objective can be more robust to these deviances:

 (12)

Here is either of the two distance functions:

 Dsq(x,y)=(x−y)2 or Dabs(x,y)=|x−y|

and is one of the normalizations:

 NF(F,E)=F,NF2(F,E)=F2,NE(F,E)=E,NE2(F,E)=E2.

Using and makes it equal to the previous objective (11).

In principle, either of the two distance functions can be combined with any of the four normalizations. We do not think it is reasonable to expect a quadratic denominator to be a suitable match for the absolute error. Therefore our investigations exclude combination of with either or .

We will find in Section 5 below that robust results arise from the combination and , for which (12) reduces to

 mina,b,c∑F(F−Ea,b,c(F)F)2, (13)

a sum of squared relative errors.

Because there are only three parameters, the criterion (12) can simply be evaluated over a grid inside . To be sure of taking a point within of the minimizer takes work . An alternative is to employ a general nonlinear minimization procedure. The remainder of this section looks at a method to reduce that effort.

In a synthetic graph is known. When fitting to a real world graph a pragmatic choice is . The interpretation is that the random graph may have had isolated nodes that were then dropped when forming , but we suppose that fewer than half of the nodes in have been dropped.

If we consider just the lead terms, then we could get estimates , , and by solving three of the equations:

 e≡(2E)1/r =^a+2^b+^c, h≡(2H)1/r =(^a+^b)2+(^b+^c)2, δ≡(6Δ)1/r =(^a3+^c3)+3^b2(^a+^c)and t≡(6T)1/r =(^a+^b)3+(^b+^c)3.

The equations for and together can be solved to get

 ^x≡^a+^b=e+√2h−e22^y≡^b+^c=e−√2h−e22, (14)

where we have assumed that . The transformed tripin count matches and so it is redundant given and , if we are just using lead terms. We must either count triangles, or use higher order terms.

Equation (14) may fail to have a meaningful solution. At a minimum we require and . These translate into

 h≤e2≤2h,

which is equivalent to

 2H≤4E2≤2r+1H

that in terms of node degrees is

 ∑idi(di−1)≤(∑idi)2≤N∑idi(di−1). (15)

The left hand inequality in (15) holds for any graph, but the right hand side need not. It holds when . If the variance of the node degrees is smaller than their mean, then equation (14) does not have real valued solutions. The degree distribution of a stochastic Kronecker graph has heavy tails . Therefore in applications where that model is suitable equation (14) will give a reasonable solution.

When have a variance larger than their mean, then we can do a univariate grid search for using equation (14) to get and . The choice of can then be made as the minimizer of .

## 5 Examples

In this section, we experiment with different techniques for fitting the parameters of the Kronecker model. These experiments involve 8 real world networks whose statistical properties are listed in the rows of the forthcoming tables labeled “Source.”

The networks ca-GrQc, ca-HepTh, ca-HepPh are co-authorship networks from arXiv . The nodes of the network represent authors, and there is an edge between two nodes when the authors jointly wrote a paper. Likewise, the hollywood-2009 network is a collaboration graph between actors and actresses in IMDB [1, 2]. Nodes are actresses or actors, and edges are collaborations on a movie, as evidenced by jointly appearing on the cast. These networks are naturally undirected and all edges are unweighted.

Both as20000102 and as-Skitter are technological infrastructure networks . Each node represents a router on the internet and edges represent a physical or virtual connection between the routers. Again, these networks are undirected and unweighted.

The wikipedia-20051105 graph is a symmetrized link graph of the articles on Wikipedia generated from a data download on November 5th, 2005 . The underlying network is directed, but in these experiments, we have converted it into an undirected network by dropping the direction of the edges.

All of the previously described networks have distinctly skewed degree distributions. That is, there are a few authors, actors, routers, or articles with a large number of links, despite the overall network having a small average degree. The final network we study is usroads, a highway-level network from the National Highway Planning Network (

http://www.fhwa.dot.gov/planning/nhpn/), which does not have a highly skewed distribution. We include it as an example of a nearly planar network. It is also naturally undirected.

In two of the experiments, we generate synthetic Kronecker networks. The algorithm to realize these networks is an explicit coin-flipping procedure instead of the more common ball-dropping method . For each cell in the upper triangular portion, we first determine the log of the probability of a non-zero value in that cell, then generate a random coin flip with that probability as heads and record an edge when the coin comes up heads. This procedure is scalable because the full matrix of probabilities is never formed. It is also easily parallelizable. Our implementation uses pthreads to exploit multi-core parallelism. It takes somewhat more work than the ball-dropping procedure, scaling as instead of , where is the number of balls dropped. Often , that is, balls per vertex . Each ball generates about one edge; see  for a more thorough analysis. Coin-flipping preserves the exact Kronecker distribution whereas ball-dropping is an approximation.

The experiments with these networks investigate (i) the difference in results from the various choices of and in the objective (12); (ii) the fitted parameters to the 8 real world networks; and (iii) the difference in fitted parameters when only using three of the four graph features.

### 5.1 Objective functions

The first study regards the choice of objective function. Of eight possible combinations of distance and normalization, we considered two to be unreasonable a priori. Here we investigate the other six pairs.

Table 2 shows the different parameters and chosen by each objective function, as well as the expected feature counts for those parameters for three graphs: a single realization of a Kronecker graph with , the collaboration network ca-GrQc, and the infractucture network as20000102. The rows labeled “Source” contain the actual feature counts in each network. The optimization algorithm to pick uses the best objective value from three procedures. First, it tries 50 random starting points for the fmincon function in Matlab R2010b, an active set algorithm. Then, it performs a grid search procedure with 100 equally spaced points in each dimension. Finally, it tries the leading term matching algorithm from Section 4.3, and considers those parameters.

The results in the table show that the choice of objective function does not make a difference when the graph fits the Kronecker model. However, it can make a large difference when the graph does not exactly fit, as in the ca-GrQc and as20000102 networks. Both of the objectives and

produced distinctly different fits for these two networks, compared to the other objectives. These two fits seem to be primarily matching the number of triangles – almost to the exclusion of the other features. The other odd fit for the ca-GrQc graph comes from the

objective. This fit appears to be matching the tripin count and ignoring other features, something that also seems to be true for the as20000102 graph. Among the remaining fits for ca-GrQc, there is little difference among the fitted parameters and estimated features. The results are a bit different for as20000102. The fits for and are almost identical and show a good match to the tripin count, but a poor match to the remaining features. The fits for and are similar and deciding which is better seems like a matter of preference. These observations held up under further experimentation, which we omit here in the interest of space.

Based on these results, either of the objectives or appears to be a robust choice when the model does not fit exactly. Due to the continuity of the function, the rest of our fits in this manuscript uses the variation.

### 5.2 Parameters for real-world networks

For the 8 networks previously described, we use the objective function (12) with to fit the parameters . The results, along with the expected feature counts for the fitted parameters, are presented in Table 3. We show the minimizer for the three different strategies to optimize the objective described in the previous section: a direct minimization procedure, the grid search procedure, and the leading term matching approach (Section 4.3). For each approach, the table also shows the time required for that algorithm and the value of the objective function at the minimizer.

Leskovec et al.  provide the fitted parameters and from their KronFit algorithm for the networks ca-GrQc, ca-HepTh, ca-HepPh, and as20000102. We include them in Table 3 for comparison. In all cases but one, the expected feature count using KronFit is farther from the observed feature count than the expectation under our moment based fits. Sometimes it is much farther. There was one exception. For the graph as20000102, KronFit gave a better estimate of the number of edges than our moment method gave.

KronFit typically underestimates the feature counts. The effect is severe for triangles. Kronecker random graphs commonly have many fewer triangles than the real world graphs to which they are fit. Our moment based estimators find parameters leading to many more triangles than the KronFit parameters do.

In fairness, we point out that our method is designed to match expected to observed feature counts, while KronFit fits by maximum likelihood. Therefore the evaluation criterion is closer to the fitting criterion for us. But maximum likelihood ordinarily beats or matches the method of moments in large samples from parametric models; it’s mismatching criteria are more than compensated for by superior statistical efficiency. The explanation here may involve maximum likelihood being less robust to lack of fit of the Kronecker model, or it may be that KronFit is not finding the MLE.

The results in Table 3 show small differences in the fits between the direct and grid algorithms, although the direct algorithm is much faster. The leading term matching algorithm, when it succeeds, generates similar Kronecker parameters, although with a distinctly worse objective value. The results from the KronFit algorithm differ and likely match the graph in another aspect.

Lead term matching is tens of times faster than direct search and roughly times faster than grid search. But even the grid search takes under a minute in our examples, so the speed savings from the lead term approach is of little benefit here. For the large graphs, the time to compute the network features dominates the time to fit the parameters, showing that this approach scales to large networks.

Overall, the results indicate that the Kronecker models tend not to be a good fit to the data. The model appears to have a considerable difference in at least once of the graph features. Usually, it’s the number of triangles, which differs by up to two orders of magnitude for many of the collaboration networks.