# Bayesian inference on random simple graphs with power law degree distributions

We present a model for random simple graphs with a degree distribution that obeys a power law (i.e., is heavy-tailed). To attain this behavior, the edge probabilities in the graph are constructed from Bertoin-Fujita-Roynette-Yor (BFRY) random variables, which have been recently utilized in Bayesian statistics for the construction of power law models in several applications. Our construction readily extends to capture the structure of latent factors, similarly to stochastic blockmodels, while maintaining its power law degree distribution. The BFRY random variables are well approximated by gamma random variables in a variational Bayesian inference routine, which we apply to several network datasets for which power law degree distributions are a natural assumption. By learning the parameters of the BFRY distribution via probabilistic inference, we are able to automatically select the appropriate power law behavior from the data. In order to further scale our inference procedure, we adopt stochastic gradient ascent routines where the gradients are computed on minibatches (i.e., subsets) of the edges in the graph.

## Authors

• 14 publications
• 6 publications
• 84 publications
• 2 publications
• 16 publications
• ### A Bayesian model for sparse graphs with flexible degree distribution and overlapping community structure

We consider a non-projective class of inhomogeneous random graph models ...
10/03/2018 ∙ by Juho Lee, et al. ∙ 0

• ### Sampling and Inference for Beta Neutral-to-the-Left Models of Sparse Networks

Empirical evidence suggests that heavy-tailed degree distributions occur...
07/09/2018 ∙ by Benjamin Bloem-Reddy, et al. ∙ 4

• ### Adapting Stochastic Block Models to Power-Law Degree Distributions

Stochastic block models (SBMs) have been playing an important role in mo...
04/05/2019 ∙ by Maoying Qiao, et al. ∙ 8

• ### Linear Work Generation of R-MAT Graphs

R-MAT is a simple, widely used recursive model for generating `complex n...
05/09/2019 ∙ by Lorenz Hübschle-Schneider, et al. ∙ 0

• ### The Quotient of Normal Random Variables And Application to Asset Price Fat Tails

The quotient of random variables with normal distributions is examined a...
02/13/2018 ∙ by Carey Caginalp, et al. ∙ 0

• ### Sparse Allreduce: Efficient Scalable Communication for Power-Law Data

Many large datasets exhibit power-law statistics: The web graph, social ...
12/11/2013 ∙ by Huasha Zhao, et al. ∙ 0

• ### Beyond the Chinese Restaurant and Pitman-Yor processes: Statistical Models with Double Power-law Behavior

Bayesian nonparametric approaches, in particular the Pitman-Yor process ...
02/13/2019 ∙ by Fadhel Ayed, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In statistical applications, random graphs serve as Bayesian models for network data, that is, data consisting of objects and the observed linkages between them. Here we will focus on models for random simple graphs (that is, graphs with edges that take binary values), which are appropriate for applications where we observe either the presence or absence of links between objects in the network. For example, in social networks, nodes may represent individuals and a link (i.e., a nonzero value of an edge) could represent friendship. In a protein-protein interaction network, nodes may represent proteins and links could represent an observed physical or chemical interaction between proteins. Many domains involving network data (including social and protein-protein interaction networks) have been shown to exhibit power law, i.e., heavy-tailed, degree distributions (Barabási & Albert, 1999). Models for random graphs with power law degree distributions, also called scale-free random graphs, have therefore become one of the most actively studied areas of graph theory and network science (Bollobás et al., 2001; Albert & Barabási, 2002; Dorogovtsev & Mendes, 2002). In this paper we present a model for simple, scale-free random graphs, which we apply as a probabilistic model for several network datasets.

The model we present in this paper is a special case of the generalized random graph defined by Britton et al. (2006), and studied further by van der Hofstad (2016, Ch. 6), which outlines a framework for defining scale-free random graphs, but does not provide practical constructions, much less algorithms for performing statistical inference on the model components given data. Here we provide one such practical construction, along with a variational inference routine (Jordan et al., 1999) for efficient posterior inference. What’s more, our construction readily generalizes to include the structure of latent factors/clusters, as captured by the popular stochastic blockmodels (Nowicki & Snijders, 2001; Airoldi et al., 2009), while maintaining power law behavior in the graph.

Applying Bayesian inference algorithms on network datasets is a challenge because likelihood computations, in general, scale with the number of edges in the graph, which is in a network with nodes. To help overcome these difficulties, we follow Hoffman et al. (2013) and develop a stochastic variational inference algorithm in which we approximate many likelihood computations on only subsets of the data, called minibatches. In the case of a network dataset, the minibatches are comprised of subsets of edges in the graph.

We apply this inference procedure to several network datasets that are commonly observed to possess power law structure. Our experiments show that accurately capturing this power law structure improves performance on tasks predicting missing edges in the networks.

## 2 Bayesian models for simple graphs

We represent a simple graph with nodes by an adjacency matrix , where if there is a link between nodes and and otherwise. Here we will only consider undirected graphs, in which case represents a symmetric matrix. Furthermore, we do not allow self links, so the diagonal entries in are meaningless. Most probabilistic models for simple graphs take the entries in to be conditionally independent Bernoulli random variables; in particular, for every , let be the (random) probability of a link between nodes and , and let . For every simple graph , we may then write the likelihood for the parameters given as

 P(X=x∣p)=∏i

where in our case it should be clear that the product is only over such that and . Random simple graphs date back to the Erdös–Rényi model, which may be reviewed, along with the more general theory of random graphs, in the text by Bollobás (1998). A random graph is called scale-free when the fraction of nodes in the network having connections to other nodes behaves like for large values of and some exponent . More precisely, let denote the (random) degree of node , for every . Then is (asymptotically) scale-free when, for every node ,

 P{Dn,i=k}∼ck−τ, as n→∞,

for some constant , a power law exponent , and sufficiently large. Here the notation denotes that the ratio in the specified limit.

In order to model scale-free random graphs, Britton et al. (2006) suggested reparameterizing the model in Section 2

by a sequence of odds ratios

, for every , which factorize as , for some . The node-specific factors are then modeled as for some sequence of nonnegative random variables and where . In a series of results, (Britton et al., 2006, Thms. 3.1 & 3.2) and (van der Hofstad, 2016, Cor. 6.11 & Thm. 6.13) assert conditions on the random variables so that the limiting distribution of the degrees

is a mixed Poisson distribution. We will further detail these previous results in

Section 4. The distribution of is interpreted here as a prior distribution for the degree of node , and if its distribution has heavy tails, then so will the distribution of . Conversely, if the distribution of does not have heavy tails, then neither will the distribution of the degrees . We explore this alternative in Section 7.

Previous authors did not suggest any particular choices for the distribution of , and so we elect to model them with BFRY random variables (Bertoin et al., 2006; Devroye & James, 2014), which have a heavy-tailed distribution and have recently played a role in the construction of several power law models in Bayesian statistics. Other heavy tailed distributions, such as those exhibited by log normal random variables, may also be used to model the , and these options may be explored. One benefit of the BFRY distribution is that the thickness of its tails, and thus the power law behavior of the resulting graph, may be straightforwardly controlled by the discount parameter .

## 3 A generalized random graph

Consider the model from the previous section, parameterized by the odds ratios . Define

 G(r)\vcentcolon=∏i

and note that the conditional likelihood in Section 2 may be rewritten in terms of the degrees as

The random simple graph is called a generalized random graph, and we will henceforth write .

Let , which we call the discount parameter, and let be a sequence of positive values satisfying

 limn→∞Cn=∞andlimn→∞Cαn/n=0.

Let the weights be i.i.d. with density

 fn(w)∝w−α−1(1−e−w)1{0≤w≤Cn}.

(These are truncated BFRY random variables and will be discussed, along with a method for simulation, in Section 3.1.) Then the corresponding generalized random graph has an (asymptotic) power law degree distribution with power law exponent . We summarize this construction in the following theorem:

###### Theorem 3.1.

For every , let be i.i.d. with density and let be the degrees of the generalized random graph , where is the sequence of odds ratios

 ri,j=WiWj/L,i

and . Then the following hold:

1. [nolistsep]

2. For , , for every node and for some constant , as .

3. For any , the collection are asymptotically independent, as .

This construction is closely related to the model described by van der Hofstad (2016, Thm. 6.13), and the proof of creftype 3.1, which is provided in the supplementary material, follows analogously to the results by Britton et al. (2006, Thms. 3.1 & 3.2). Note that the power law exponent of the graph (as described by Section 2) is determined by the parameter , and takes values in . While power law exponents in has often been suggested in the past, it has more recently been shown that exponents within the range of our model is more appropriate in many domains (van der Hofstad, 2016, Ch. 1); (Crane & Dempsey, 2015).

### 3.1 Truncated BFRY random variables

A random variable with density function given by Section 3 is a ratio of gamma and beta random variables, upper truncated at . In particular let

 g ∼ gamma(1−α,1)andb ∼ beta(α,1),

be independent, then the ratio has density on (by construction), which is known as the Bertoin-Fujita-Roynette-Yor (BFRY) distribution (Bertoin et al., 2006; Devroye & James, 2014)

and has been used in the construction of power law models in some recent applications in machine learning

(James et al., 2015; Lee et al., 2016). The random variable is then obtained by upper truncating the random variable at . By our requirements on the sequence (c.f. Section 3), the density function of approaches the density function of the BFRY random variable as , that is,

 limn→∞fn(w)=αΓ(1−α)w−α−1(1−e−w),

which is heavy-tailed with infinite moments. It is straightforward to simulate these truncated BFRY random variables by repeatedly simulating

and as in Section 3.1, accepting as a sample when .

The truncation of at produces a random variable with finite mean (for ), which is essential when constructing the generalized random graph and motivates the construction by van der Hofstad (2016, Thm. 6.13) alluded to earlier; see Section 4. For simplicity, one could take , but the flexibility to set this parameter allows us to control other properties of the model. For example, in the next section we show how to vary this truncation level to control the sparsity of the graph.

### 3.2 Controlling power law and sparsity in the graph

The discount parameter controls the power law behavior of the graph, where decreasing results in heavier tails in the degree distribution of the nodes in the graph. We can visualize this behavior by simulating graphs at different values of . In Fig. 1, we set and show the number of nodes of varying degrees in two simulated graphs, one with and one with .

The degree distribution of the nodes in a graph of course affects the sparsity of the graph; to characterize this relationship, we can upper bound the expected number of links in the graph as follows:

###### Theorem 3.2.

Let be the number of positive edges in the graph. Then

The derivation of this result is provided in the supplementary material. While varying can thus control the sparsity of the graph in addition to the power law behavior, we often want to decouple these behaviors, in which case we could parameterize the truncation level as , for some sparsity parameter . Note the restriction must be enforced in order to ensure that the conditions in Section 3 are satisfied. In this case, the bound in creftype 3.2 becomes . The interpretation here is that increasing the upper bound increases the likelihood that any particular node will link to others, but does not affect the (asymptotic) power law characterized by creftype 3.1. In Fig. 2, we display the average number of positive edges in graphs that were simulated with fixed and varying values of the sparsity parameter . We note that in simulations, we encountered numerical issues in regimes.

## 4 Related work

Referring to the construction for generalized random graphs in Section 2, Britton et al. (2006, Thm. 3.1) shows that when the weights have finite first and second moments, then the limiting distribution of the degree is a mixed Poisson distribution. Most such distributions are light-tailed, however, in which case the degrees will not exhibit power law behavior. Britton et al. (2006, Thm. 3.2) therefore provides an alternative construction in which may have infinite moments (so that it may exhibit a heavy tail), which results in a graph with a power law exponent of . Finally, van der Hofstad (2016, Thm. 6.13) suggests yet another construction where the are upper truncated to be of order , where is the number of nodes in the graph. The resulting random variables therefore have finite moments, yet exhibit a heavy tail, and the resulting random graph has a heavy tailed degree distribution with an arbitrary power law exponent. None of these results suggest a particular choice for the distribution of , however, and so we have elected to use BFRY random variables (which are heavy tailed) that are upper truncated (so that they have finite moments). We note that the requirements on our truncation level (c.f. Section 3) is less strict than the criterion of the van der Hofstad (2016, Thm. 6.13) construction.

The reader may consult the surveys by Bollobás & Riordan (2003); Albert & Barabási (2002); Dorogovtsev & Mendes (2002) for a background on scale-free random graphs, which is too large to review here. While these models are numerous, the following recent pieces of work in the Bayesian statistics and machine learning communities may be of interest to the reader: Caron & Fox (2014); Veitch & Roy (2015); Crane & Dempsey (2016); Cai & Broderick (2015). This collection of work discusses power law degree distributions, albeit in some cases in multi-graphs (i.e., graphs with nonnegative integer-valued edges) and in some cases the power law behavior is not characterized, only numerically observed in simulations. Many of these models can be seen to invoke their power law properties from the Pitman–Yor process (Pitman & Yor, 1997) (or related stochastic processes), where the extent of this behavior is controlled by the discount parameter of the Pitman–Yor model, which, like the BFRY distribution, is related to a stable subordinator of index .

## 5 Incorporating latent factors

Latent factor models for relational data assume that a set of latent clusters underlie the network. For example, in a social network, the latent factors could be the unobserved hobbies or interests of individuals, which determine the observed friendships in the network. Bayesian models for latent factors in relational data are widespread, with some of the most popular based on stochastic blockmodels

, where models for unsupervised learning, or clustering, are used to infer the latent factors

(Nowicki & Snijders, 2001; Kemp et al., 2006; Airoldi et al., 2009; Miller et al., 2009). In this section, we present extensions of the generalized random graph that incorporate latent factors by scaling the odds ratios, while maintaining their power law degree distribution.

We will first provide a general result showing how to incorporate random scaling variables into the model, followed by specific examples that model these scaling variables with latent clusters. Let the odds ratios in the generalized random graph be given by for some . Note that as and as , and so the edge-specific weight simply scales the link probability. The random graph then has the likelihood

 P(X=x∣r)=G(r)−1∏i

where the normalization term in Section 3 is now

where the final equality follows simply because . So constructed, the odds ratios will influence the link probabilities in the generalized random graph, but will not affect the power law behavior of the degree distributions (under some assumptions on the random variables ). We summarize this construction in the following theorem, the proof for which is provided in the supplementary material:

###### Theorem 5.1.

Let be i.i.d. random variables with density function (in Section 3). Let be a collection of uniformly bounded random variables, where, for every , the collection is exchangeable. Let be the degrees of the random graph , where is the sequence of odds ratios

 ri,j=Ai,jWiWj/L,i

and where . Then the degrees satisfy statements (1) and (2) in creftype 3.1.

For example, we may construct stochastic blockmodels, such as those introduced by Nowicki & Snijders (2001), as follows: For every , let be a random variable taking values in , indicating which one (and only one) of different factors to associate with node . We want the latent cluster assignments for two nodes and to influence their link probability, which we could capture with a set of parameters , for . Then the parameter could represent, or influence, the probability of a link between nodes and . Taking a Bayesian approach, the indicator variables may be modeled with a Dirichlet-categorical conjugate distribution and their values may be inferred via probabilistic inference. An example of such a model could be summarized as follows: Let

 Zi ∼ categorical(π), i≤n, (1) π ∼ Dirichlet(c/K), where c>0, (2) θℓ,k ∼ gamma(aθ,bθ), ℓ,k≤K, (3) Ai,j =θZi,Zj, i

and construct the random graph as in creftype 5.1. Kemp et al. (2006) developed a nonparametric extension of a similar model that in a sense takes the limit , allowing an appropriate number of clusters to be automatically inferred from the data. In this case, the marginal law of the indicator variables is given by a Chinese restaurant process (with concentration parameter ).

Several generalizations of the stochastic blockmodel allow the clusters underlying the network to overlap, leading to mixed membership stochastic blockmodels (Airoldi et al., 2009) or the related latent feature relational models (Miller et al., 2009). To capture this structure, we may generalize the indicators to now represent a binary

-vector with entry

indicating node is associated with cluster , now called a feature, and otherwise. One example of such a model could be summarized as follows:

 Zi,k ∼ Bernoulli(pk), i≤n,k≤K, (5) pk ∼ beta(c,cγ/K), k≤K,and c,γ>0, (6) θℓ,k ∼ gamma(aθ,bθ), ℓ,k=1,2,…, (7) Ai,j =∑k,ℓθk,ℓZi,kZj,ℓ, i

and construct the random graph as in creftype 5.1. Miller et al. (2009) derived a nonparametric extension of this model that in a sense takes the limit , in which case the marginal law of the vectors is that of an Indian buffet process (with mass parameter and concentration parameter ) (Ghahramani et al., 2007).

## 6 Variational inference

We derive a variational Bayesian inference algorithm (Jordan et al., 1999) that approximates the (optimal state of the) posterior distribution of the model components, given a network dataset. We approximate the required gradients in this procedure with stochastic gradient ascent (Bottou, 2010; Hoffman et al., 2013), computed on minibatches (i.e., subsets) of edges in the graph.

### 6.1 The variational lower bound

In variational inference, we approximate the posterior distribution on the latent variables with a variational distribution , the parameters of which are fit to maximize the following lower bound on the marginal likelihood

where is the likelihood function computed as in Section 3, and is the prior on represented by the density function in Section 3. The (non random) discount parameter is inferred by corresponding gradient ascent updates maximizing the likelihood of the model, which is described in Section 6.4.

We specify a mean field variational distribution . We considered several approximations for the marginals

including truncated BFRY and truncated gamma distributions, however, in our experiments we found that the following

rectified gamma distribution performed well:

independently for every , where and denote the shape and rate parameters of the gamma distribution, respectively, and the notation emphasizes that this formula holds under the variational distribution .

We maximize the lower bound on the right hand side of Section 6.1 by stochastic gradient ascent, where on the -th step of the algorithm, we make the following updates to the parameters in parallel

 θ(t+1)i←θ(t)i+ρt∇θiEq(W;θ(t))[L(X,W;θ(t))],

for and some sequence of positive numbers satisfying the Robbins–Monro criterion (Robbins & Monro, 1951) and and where

where denotes the observed edges (both links and non-links) in the dataset. We cannot evaluate the expectation (with respect to the rectified gamma distributions ) analytically, and so we elect to use a particular Monte Carlo approximation of this gradient detailed by Knowles (2015), which was developed for gamma variational distributions and easily applies to the rectified gamma case.

Briefly, for every , create the collection of Monte Carlo samples from the variational distribution as follows: Independently for , let , and set , where and

is the inverse of the cumulative distribution function for a gamma random variable. For convenience, we recall that

 Fa,b(x)=∫x0baΓ(a)ta−1e−btdt.

For every , the gradient with respect to the parameters is then approximated by

 ∇θkEq(W;θ)[L(X,W;θ)]≈1S∑s∇WkL(X,W(s);θ)∇θkψ(z(s)k;θk),

where

. This estimator is unbiased and has low enough variance that often a single sample suffices for the approximation

(Salimans & Knowles, 2013; Kingma & Welling, 2014). The gradient of is nonzero only when , in which case we may immediately obtain the partial derivative with respect to the rate parameter; in particular, we have

The partial derivative with respect to the shape parameter does not have a closed form solution and must be approximated. Different approximation routines are suggested by Knowles (2015) for different regimes of the shape parameter , and we found these approximations to be accurate and efficient in our experiments.

### 6.3 Minibatches of edges in the graph

Computing the required gradients in Section 6.2 may be done in parallel, and this computation, whether performed analytically or with automatic differentiation methods, scales with the number of edges in the graph. This can be prohibitive for many network datasets, and we therefore introduce a further approximation where this gradient is evaluated on subsets (a.k.a. minibatches) of the dataset, a technique from stochastic gradient ascent (Bottou, 2010) adopted in the context of variational Bayesian inference by Hoffman et al. (2013). In the case of a network dataset, we may select minibatches that are subsets of the observed edges in the graph. In particular, write the gradient of Section 6.2 with respect to the variable (which is required by Section 6.2) as

 ∇WkL(W(s);θ)=∑(i,j)∈Eg(i,j)(X,W(s);k),

where

is the gradient that ignores all but one edge of the graph. We may therefore compute the unbiased estimate of this gradient

on a minibatch of the observed edges.

### 6.4 Inference on the parameters α and β

Without good prior knowledge of how to set the discount parameter and the sparsity parameter controlling the power law and sparsity behaviors of the graph, respectively, we infer their values from the data. First consider the discount parameter, which we infer with gradient ascent. After every update to the latent variables , we fix them to their mean under the distribution , i.e., where , and take a step in the direction of the gradient

which is straightforward to derive from the density function in Section 3, and where the normalization term

 Zα,β\vcentcolon=∫Cn0w−α−1(1−e−w)dw

is a function of and , if we let as suggested in Section 3.2. We do not have a closed form solution for this term when , and, unfortunately, inference on model parameters where the likelihood is difficult to evaluate is a challenging problem; for example, see the approaches taken by Murray et al. (2006) on such problems, which those authors call doubly intractable distributions. Accurate inference for is important in our model, because it controls the power law behavior of the graph. In our experiments, we approximate the gradient in Section 6.4 for (fixed ) by approximating (via Section 6.4) and with line integrals. In the Section 7, we demonstrate that this approximation works well in various regimes of , with slight overestimation for moderate values.

Similar approaches to infer may be derived with finite difference approximations; we did not find these approaches successful in our experiments, however, and so we instead select by cross validation.

## 7 Experiments

We first demonstrate how the inference procedure in Section 6.4 can correctly differentiate between various regimes of . We ran an experiment where for each value , we simulated 10 datasets from the model with nodes, while fixing . For each simulated dataset, we ran an instance of the inference routine with randomly initialized. In Fig. 3, we show the trace plots of alpha during each instance of the inference routine. For comparison, the true values of are also shown as horizontal dashed lines. We can see that the inference routine can correctly distinguish between these different regimes of , with slight overestimation in the moderate regime. Interestingly, despite random initializations of , the algorithm always immediately inflates to around 0.9, and then slowly decreases this value during inference, regardless of what value of generated the data.

We next demonstrate that accurately capturing power law structures in datasets will improve predictive performance. While fixing , we simulate three network datasets with 5,000 nodes from our model with discount parameters and , respectively, which therefore exhibit increasingly lighter-tailed degree distributions. The generated graphs have 117,300, 32,925, and 9,460 links, respectively. To establish a baseline model that does not exhibit power law degree distributions but is otherwise comparable to our model, we implement the generalized random graph where the node-specific weights are constructed from the gamma random variables , for some positive parameter , i.i.d. for every node . Note that the parameter controls the sparsity of the generated graph; larger values of imply denser graphs. It follows analogously to creftype 5.1 that

 P{Dn,i=k}∼kθ−12k+θ,

for , as . This model therefore does not exhibit power law behavior, as desired. We refer to this model as “Gamma” and the power law graph model as “BFRY”.

We ran an experiment holding out 20% of the edges in the simulated graphs as test sets, training the two models on the remaining 80% of the edges. We used a mini-batch size of 5,000 edges (note that the training dataset corresponds to almost 10 million observed edges). We ran each inference procedure for 20,000 steps of stochastic gradient ascent updates, using Adam (Kingma & Ba, 2015) to adjust the learning rates at each step. We repeated each experiment 5 times, each time holding out a different test set and using a different random initialization. Again, for this experiment we fixed . In Table 1 we report a mean log-likelihood metric for the test datasets, where the metric for each run is obtained by averaging the test log-likelihoods across the states for the last 4,000 steps of the inference procedure; the displayed intervals are at standard deviation about the metric, from across the 5 repeats. We also report a max log-likelihood metric, which simply records the maximum test log-likelihood across the last 4,000 steps of the inference procedure, instead of the average. The best performing method is highlighted in bold (which in each case was the BFRY model).

In each case, we see that the BFRY model achieves higher test log-likelihood metrics than the Gamma model, as expected, implying that accurately capturing a power law degree distribution improves predictive performance (when power law behavior is truly present in the network). In Table 3, we report the inferred values for , which were reasonably accurate, though we see slight overestimation for some regimes, as seen in the demonstration earlier. For the baseline Gamma model, we optimized the hyperparameter using gradient ascent maximizing the evidence lower bound of the model (c.f. Section 6.1), and the inferred values are also reported in Table 3.

Next, we ran similar experiments on the following network datasets, each of which are expected to exhibit power law degree distributions:

• [nolistsep]

• ‘USTop500Airports’: 500 nodes, 2,980 links

• ‘openflights’: 7,976 nodes, 15,243 links

• ‘polblogs’: 1,490 nodes, 9,517 links

Where appropriate, we saved only the upper triangular parts of the adjacency matrices. The ‘USTop500Airports’ dataset contains the (undirected, unweighted) flight connections between the 500 busiest US airports. The similar, though much larger, ‘openflights’ dataset contains the flight connections between non-US airports. Scale-free networks have been proposed for such traffic networks, detailed for these datasets by Colizza et al. (2007). The ‘polblogs’ dataset contains the links between political blogs (judged by hyperlinks between the front webpages of the blogs) in the period leading up to the 2004 US presidential election, which is observed to exhibit power law degree distributions by Adamic & Glance (2005). The ‘Facebook107’ dataset contains “friendships” between users of a Facebook app, collected by Leskovec & McAuley (2012); social networks are widely studied for their power law degree distributions.

For both the Gamma and BFRY models, we ran our variational inference procedure for 20,000 steps on each dataset. As before, we repeated the experiment 5 times for each network, each time holding out a different 20% of the edges in the network as a testing set. We selected the value of from among the grid

with 5-fold cross validation on the training set. We set the minibatch size to be equal to the number of nodes in the graph; for example, we used minibatches of 1,490 edges for the polblog dataset. The evaluation metrics on the test datasets are summarized in

Table 2, and the inferred hyperparameter values are reported in Table 3. We see that the BFRY model once again outperforms the Gamma baseline model, according to the test log-likelihood metrics.

Probabilistic inference on by the BFRY model provides some of the most interesting analyses here. With (underflowing our machine’s precision), the Facebook107 social network has the degree distribution with the heaviest tails, followed by the USTop500Airports traffic network with , the polblog citation network with , and the openflights network has the lightest tailed degree distribution with .

## 8 Future work

Future work could focus on implementing the latent factor modeling generalizations presented in Section 5, which are natural assumptions in many domains where networks are expected to exhibit power law degree distributions. Alternative approaches to inference on the sparsity parameter should also be explored, since controlling the sparsity in the graph was important for good predictive performance.

## Acknowledgements

The authors thank Remco van der Hofstad for helpful advice and anonymous reviewers for helpful feedback. J. Lee and S. Choi were partly supported by an Institute for Information & Communications Technology Promotion (IITP) grant, funded by the Korean government (MSIP) (No.2014-0-00147, Basic Software Research in Human-level Lifelong Machine Learning (Machine Learning Center)) and Naver, Inc. C. Heaukulani undertook this work in part while a visiting researcher at the Hong Kong University of Science and Technology, who along with L. F. James was funded by grant rgc-hkust 601712 of the Hong Kong Special Administrative Region.

## References

• Adamic & Glance (2005) Adamic, L. A. and Glance, N. The political blogosphere and the 2004 US election: divided they blog. In Proceedings of the 3rd international workshop on Link discovery, pp. 36–43, 2005.
• Airoldi et al. (2009) Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. Mixed membership stochastic blockmodels. In Advances in Neural Information Processing Systems, 2009.
• Albert & Barabási (2002) Albert, R. and Barabási, A-L. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002.
• Barabási & Albert (1999) Barabási, A. and Albert, R. Emergence of scaling in random networks. Science, 286(5439):509–512, 1999.
• Bertoin et al. (2006) Bertoin, J., Fujita, T., Roynette, B., and Yor, M. On a particular class of self-decomposable random variables: the durations of bessel excursions straddling independent exponential times. Probability and Mathematical Statistics, 26:315–366, 2006.
• Bollobás (1998) Bollobás, B. Random graphs. Springer, 1998.
• Bollobás & Riordan (2003) Bollobás, B. and Riordan, O. M. Mathematical results on scale-free random graphs. Handbook of graphs and networks: from the genome to the internet, pp. 1–34, 2003.
• Bollobás et al. (2001) Bollobás, B., Riordan, O., Spencer, J., and Tusnády, G. The degree sequence of a scale-free random graph process. Random Structures & Algorithms, 18(3):279–290, 2001.
• Bottou (2010) Bottou, L.

Large-scale machine learning with stochastic gradient descent.

In COMPSTAT, 2010.
• Britton et al. (2006) Britton, T., Deijfen, M., and Martin-Löf, A. Generating simple random graphs with prescribed degree distribution. Journal of Statistical Physics, 124(6):1377–1397, 2006.
• Cai & Broderick (2015) Cai, D. and Broderick, T. Completely random measures for modeling power laws in sparse graphs. In NIPS 2015 Workshop on Networks in the Social and Information Sciences, 2015.
• Caron & Fox (2014) Caron, F. and Fox, E. B. Sparse graphs using exchangeable random measures. arXiv preprint arXiv:1401.1137, 2014.
• Colizza et al. (2007) Colizza, V., Pastor-Satorras, R., and Vespignani, A. Reaction–diffusion processes and metapopulation models in heterogeneous networks. Nature Physics, 3(4):276–282, 2007.
• Crane & Dempsey (2015) Crane, H. and Dempsey, W. Atypical scaling behavior persists in real world interaction networks. arXiv preprint arXiv:1509.08184, 2015.
• Crane & Dempsey (2016) Crane, H. and Dempsey, W. Edge exchangeable models for network data. arXiv preprint arXiv:1603.04571, 2016.
• Devroye & James (2014) Devroye, L. and James, L. F. On simulation and properties of the stable law. Statistical methods & applications, 23(3):307–343, 2014.
• Dorogovtsev & Mendes (2002) Dorogovtsev, S. N. and Mendes, J. F. F. Evolution of networks. Advances in physics, 51(4):1079–1187, 2002.
• Ghahramani et al. (2007) Ghahramani, Z., Griffiths, T. L., and Sollich, P. Bayesian nonparametric latent feature models. Bayesian Statistics, 8:201–226, 2007. See also the discussion and rejoinder.
• Hoffman et al. (2013) Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. W. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
• James et al. (2015) James, L. F., Orbanz, P., and Teh, Y. W. Scaled subordinators and generalizations of the Indian buffet process. arXiv preprint arXiv:1510.07309, 2015.
• Jordan et al. (1999) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
• Kemp et al. (2006) Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T., and Ueda, N. Learning systems of concepts with an infinite relational model. In AAAI, 2006.
• Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: a method for stochastic optimization. In ICLR, 2015.
• Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. In ICLR, 2014.
• Knowles (2015) Knowles, D. A. Stochastic gradient variational Bayes for gamma approximating distributions. arXiv preprint arXiv:1509.01631, 2015.
• Lee et al. (2016) Lee, J., James, L. F., and Choi, S. Finite-dimensional BFRY priors and variational Bayesian inference for power law models. In Advances In Neural Information Processing Systems, pp. 3162–3170, 2016.
• Leskovec & McAuley (2012) Leskovec, J. and McAuley, J. J. Learning to discover social circles in ego networks. In Advances in Neural Information Processing Systems 25, 2012.
• Miller et al. (2009) Miller, K., Jordan, M. I., and Griffiths, T. L. Nonparametric latent feature models for link prediction. In Advances in neural information processing systems, 2009.
• Murray et al. (2006) Murray, I., Ghahramani, Z., and MacKay, D. J. C. Mcmc for doubly-intractable distributions. In UAI, 2006.
• Nowicki & Snijders (2001) Nowicki, K. and Snijders, T. A. B. Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455):1077–1087, 2001.
• Pitman & Yor (1997) Pitman, J. and Yor, M. The two-parameter Poisson–Dirichlet distribution derived from a stable subordinator. The Annals of Probability, pp. 855–900, 1997.
• Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
• Salimans & Knowles (2013) Salimans, T. and Knowles, D. A.

Fixed-form variational posterior approximation through stochastic linear regression.

Bayesian Analysis, 8(4):837–882, 2013.
• van der Hofstad (2016) van der Hofstad, R. Random graphs and complex networks: Volume 1. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2016.
• Veitch & Roy (2015) Veitch, V. and Roy, D. M. The class of random graphs arising from exchangeable random measures. arXiv preprint arXiv:1512.03099, 2015.

## 9 Proofs

We prove creftype 3.1 and creftype 5.1 in the paper. First consider the following redefinition of our model with slightly different notation; let be a random variable constrained on , with density

 fn(dw)=1Znw−α−1(1−e−w)1{0

where is a sequence of positive numbers satisfying

 limn→∞Cn=∞,limn→∞Cαn/n=0.

Note that as , and so the sequence of densities converges pointwise to the density of the BFRY distribution

 f(w)=αΓ(1−α)w−α−1(1−e−w)1{w>0},

and converges in distribution to a BFRY random variable. Let be i.i.d. copies of . A random simple graph is then defined to be a collection of Bernoulli random variables as follows:

 P{Xij=1∣ri,j}=ri,j1+ri,j,ri,j=UiUj,Ui=Wn,i√Ln,

where . We will write , where .

We begin with a sequence of Lemmas. Define a sequence of random variables , for every , by

 Vs,n\vcentcolon=WnCs−αn.

Let be i.i.d. copies of , and denote the empirical mean of these copies by

 ¯Vs,n\vcentcolon=1nn∑i=1Vs,n,i.

The expectation of is finite for all , and is computed as

where is the lower incomplete gamma function.

Let denote convergence in probability. The following lemma is a standard mean convergence result:

, as .

###### Proof.

For all , by Chebyshev’s inequality and the condition in Section 9,

as , as desired. ∎

The following lemma will be used to study various higher order moments in later results:

###### Lemma 9.2.

For ,

 Ms,n\vcentcolon=∑ni=1Wsn,i(∑ni=1Wn,i)sP→0,as n→∞.
###### Proof.

We have

 Ms,n=nCs−αn¯Vs,nnsCs−sαn¯Vs1,n=(Cαnn)s−1¯Vs,n¯Vs1,n.

As , the first factor on the right hand side clearly converges to zero (c.f. Section 9), and, by creftype 9.1, the second term converges to a constant in probability. ∎

Recall that is the degree of the -th node in the graph , given by Section 9. The following result will show up in later calculations involving the probability generating function (PGF) of the degree random variables :

###### Lemma 9.3.

For every collection with , for ,

 E[n∏i=1tDn,ii∣Wn,1=w1,…,Wn,n=wn]=∏i

for positive .

###### Proof.

The proof is given by Britton et al. (2006). ∎

The following result studies a representation of the PGF of the degree random variables and their higher order moments:

###### Lemma 9.4.

Fix a node . Define

 Fn,k(t;wk)\vcentcolon=∏i≠kLn,−k+wk+twkWn,iLn,−k+wk+wkWn,i,for |t|≤1, and wk>0,

where . Note that the -th derivative exists for all . For all , the following hold:

1. [nolistsep]

2. is uniformly bounded, for all ;

3. , as .

###### Proof.

In the case , is trivially bounded by 1 since . By the Taylor series expansion , we have

By creftype 9.1,

 Ln,−kLn,−k+wk=¯V1,n,−k¯V1,n,−k+wk/(n−1)/C1−αnP→1,

where is the empirical mean in Section 9 excluding the element . Furthermore, by creftype 9.2,

where is computed without . Combining, we have

 Fn,k(t;wk)P→exp{(t−1)wk}.

Before proceeding for , we define

 Qr,n,k(t;wk)\vcentcolon=∑i≠kWrn,i(Ln,−k+wk+twkWn,i)r,

for all . One can easily see that for all . For , we have

 ∑i≠kWn,iLn,−k+wk+twkCn≤Q1,n,k(t;wk)≤1,

and

Hence, by the squeeze theorem, . For , we have

 0≤Qr,n,k(t;wk)≤Mr,n,−kP→0,

by creftype 9.2. Hence, we have for .

Now we show that

 F(s)n,k(t;wk)=wkF(s−1)n,k(t;wk)Q1,n,k(t;wk)+s∑r=2as,rF(s−r)n,k(t;wk)Qr,n,k(t;wk),

for some constants for all and . We proceed by the mathematical induction. For ,

Now by the inductive hypothesis,

where

 as+1,2=as,2−w2k,as+1,r=as,r−as,r−1(r−1)wk for r≥2,

so the inductive argument holds.

Having (9), by mathematical induction, we can easily show that is uniformly bounded for all . Moreover,

 F(1)n,k(t;wk)=wkFn,k(t;wk)Q1,n,k(t;wk)P→wkexp{(t−1)wk},

by (9) and (9). Combining this with (9), by mathematical induction, we can show that for all ,

 F(s)n,k(t;wk)P→wskexp{(t−1)wk}.

We will now use our collected results to analyze the asymptotic distribution of the degree random variables; the following result characterizes this distribution:

###### Lemma 9.5.

Fix a node . Given , for some , the degree of node converges in distribution to a Poisson random variable with rate , as .

###### Proof.

The PGF of is given by

 E[tDn,k∣Wn,k=wk]=E[Fn,k(t;wk)],for |t|≤1.

Note that these expectations are under the -field generated by . For all , we will derive the limit of , as , which we note is given by the -th order derivatives of the PGF in Section 9, evaluated at the argument . It therefore suffices to show that , as , for all . By creftype 9.4, we know that is uniformly bounded and that , as . Therefore, by uniform integrability,

 limn→∞E[F(s)n,k(t;wk)]=E[limn→∞F(s)n,k(t;wk)]=wskexp{(t−1)wk}.

We are now ready to prove the main theorems in the paper.

###### Proof of creftype 3.1.

We will first verify that, for , for every node and for some constant as . By creftype 9.5, conditioned on , the degree converges in distribution to a Poisson random variable with rate . Then by dominated convergence,

By the asymptotics of the Gamma function, for , we have

 limn→∞P{Dn,k=y}=cy−1−α,

for some constant .

Next we show that, for any finite , the collection of random variables are asymptotically independent, as . We compute the (joint) probability generating function of , with for . By creftype 9.3,

Given , by a similar argument as in the proof of creftype 9.4, one can easily show that

 m∏j=i+1Ln,m+1:n+ℓn,1:m+titjwiWn,jLn,m+1:n+ℓn,1:m+wiWn,jP→1,as n→∞,

and

 n∏j=m+1Ln,m+1:n+ℓn,1:m+tiwiWn,jLn,m+1:n+ℓn,1:m+wiWn,jP→exp{(ti−1)wi},as n→∞.

Hence, again by a similar argument as in the proof of