1 Introduction
In statistical applications, random graphs serve as Bayesian models for network data, that is, data consisting of objects and the observed linkages between them. Here we will focus on models for random simple graphs (that is, graphs with edges that take binary values), which are appropriate for applications where we observe either the presence or absence of links between objects in the network. For example, in social networks, nodes may represent individuals and a link (i.e., a nonzero value of an edge) could represent friendship. In a proteinprotein interaction network, nodes may represent proteins and links could represent an observed physical or chemical interaction between proteins. Many domains involving network data (including social and proteinprotein interaction networks) have been shown to exhibit power law, i.e., heavytailed, degree distributions (Barabási & Albert, 1999). Models for random graphs with power law degree distributions, also called scalefree random graphs, have therefore become one of the most actively studied areas of graph theory and network science (Bollobás et al., 2001; Albert & Barabási, 2002; Dorogovtsev & Mendes, 2002). In this paper we present a model for simple, scalefree random graphs, which we apply as a probabilistic model for several network datasets.
The model we present in this paper is a special case of the generalized random graph defined by Britton et al. (2006), and studied further by van der Hofstad (2016, Ch. 6), which outlines a framework for defining scalefree random graphs, but does not provide practical constructions, much less algorithms for performing statistical inference on the model components given data. Here we provide one such practical construction, along with a variational inference routine (Jordan et al., 1999) for efficient posterior inference. What’s more, our construction readily generalizes to include the structure of latent factors/clusters, as captured by the popular stochastic blockmodels (Nowicki & Snijders, 2001; Airoldi et al., 2009), while maintaining power law behavior in the graph.
Applying Bayesian inference algorithms on network datasets is a challenge because likelihood computations, in general, scale with the number of edges in the graph, which is in a network with nodes. To help overcome these difficulties, we follow Hoffman et al. (2013) and develop a stochastic variational inference algorithm in which we approximate many likelihood computations on only subsets of the data, called minibatches. In the case of a network dataset, the minibatches are comprised of subsets of edges in the graph.
We apply this inference procedure to several network datasets that are commonly observed to possess power law structure. Our experiments show that accurately capturing this power law structure improves performance on tasks predicting missing edges in the networks.
2 Bayesian models for simple graphs
We represent a simple graph with nodes by an adjacency matrix , where if there is a link between nodes and and otherwise. Here we will only consider undirected graphs, in which case represents a symmetric matrix. Furthermore, we do not allow self links, so the diagonal entries in are meaningless. Most probabilistic models for simple graphs take the entries in to be conditionally independent Bernoulli random variables; in particular, for every , let be the (random) probability of a link between nodes and , and let . For every simple graph , we may then write the likelihood for the parameters given as
where in our case it should be clear that the product is only over such that and . Random simple graphs date back to the Erdös–Rényi model, which may be reviewed, along with the more general theory of random graphs, in the text by Bollobás (1998). A random graph is called scalefree when the fraction of nodes in the network having connections to other nodes behaves like for large values of and some exponent . More precisely, let denote the (random) degree of node , for every . Then is (asymptotically) scalefree when, for every node ,
for some constant , a power law exponent , and sufficiently large. Here the notation denotes that the ratio in the specified limit.
In order to model scalefree random graphs, Britton et al. (2006) suggested reparameterizing the model in Section 2
by a sequence of odds ratios
, for every , which factorize as , for some . The nodespecific factors are then modeled as for some sequence of nonnegative random variables and where . In a series of results, (Britton et al., 2006, Thms. 3.1 & 3.2) and (van der Hofstad, 2016, Cor. 6.11 & Thm. 6.13) assert conditions on the random variables so that the limiting distribution of the degreesis a mixed Poisson distribution. We will further detail these previous results in
Section 4. The distribution of is interpreted here as a prior distribution for the degree of node , and if its distribution has heavy tails, then so will the distribution of . Conversely, if the distribution of does not have heavy tails, then neither will the distribution of the degrees . We explore this alternative in Section 7.Previous authors did not suggest any particular choices for the distribution of , and so we elect to model them with BFRY random variables (Bertoin et al., 2006; Devroye & James, 2014), which have a heavytailed distribution and have recently played a role in the construction of several power law models in Bayesian statistics. Other heavy tailed distributions, such as those exhibited by log normal random variables, may also be used to model the , and these options may be explored. One benefit of the BFRY distribution is that the thickness of its tails, and thus the power law behavior of the resulting graph, may be straightforwardly controlled by the discount parameter .
3 A generalized random graph
Consider the model from the previous section, parameterized by the odds ratios . Define
and note that the conditional likelihood in Section 2 may be rewritten in terms of the degrees as
The random simple graph is called a generalized random graph, and we will henceforth write .
Let , which we call the discount parameter, and let be a sequence of positive values satisfying
Let the weights be i.i.d. with density
(These are truncated BFRY random variables and will be discussed, along with a method for simulation, in Section 3.1.) Then the corresponding generalized random graph has an (asymptotic) power law degree distribution with power law exponent . We summarize this construction in the following theorem:
Theorem 3.1.
For every , let be i.i.d. with density and let be the degrees of the generalized random graph , where is the sequence of odds ratios
and . Then the following hold:

[nolistsep]

For , , for every node and for some constant , as .

For any , the collection are asymptotically independent, as .
This construction is closely related to the model described by van der Hofstad (2016, Thm. 6.13), and the proof of creftype 3.1, which is provided in the supplementary material, follows analogously to the results by Britton et al. (2006, Thms. 3.1 & 3.2). Note that the power law exponent of the graph (as described by Section 2) is determined by the parameter , and takes values in . While power law exponents in has often been suggested in the past, it has more recently been shown that exponents within the range of our model is more appropriate in many domains (van der Hofstad, 2016, Ch. 1); (Crane & Dempsey, 2015).
3.1 Truncated BFRY random variables
A random variable with density function given by Section 3 is a ratio of gamma and beta random variables, upper truncated at . In particular let
be independent, then the ratio has density on (by construction), which is known as the BertoinFujitaRoynetteYor (BFRY) distribution (Bertoin et al., 2006; Devroye & James, 2014)
and has been used in the construction of power law models in some recent applications in machine learning
(James et al., 2015; Lee et al., 2016). The random variable is then obtained by upper truncating the random variable at . By our requirements on the sequence (c.f. Section 3), the density function of approaches the density function of the BFRY random variable as , that is,which is heavytailed with infinite moments. It is straightforward to simulate these truncated BFRY random variables by repeatedly simulating
and as in Section 3.1, accepting as a sample when .The truncation of at produces a random variable with finite mean (for ), which is essential when constructing the generalized random graph and motivates the construction by van der Hofstad (2016, Thm. 6.13) alluded to earlier; see Section 4. For simplicity, one could take , but the flexibility to set this parameter allows us to control other properties of the model. For example, in the next section we show how to vary this truncation level to control the sparsity of the graph.
3.2 Controlling power law and sparsity in the graph
The discount parameter controls the power law behavior of the graph, where decreasing results in heavier tails in the degree distribution of the nodes in the graph. We can visualize this behavior by simulating graphs at different values of . In Fig. 1, we set and show the number of nodes of varying degrees in two simulated graphs, one with and one with .
The degree distribution of the nodes in a graph of course affects the sparsity of the graph; to characterize this relationship, we can upper bound the expected number of links in the graph as follows:
Theorem 3.2.
Let be the number of positive edges in the graph. Then
The derivation of this result is provided in the supplementary material. While varying can thus control the sparsity of the graph in addition to the power law behavior, we often want to decouple these behaviors, in which case we could parameterize the truncation level as , for some sparsity parameter . Note the restriction must be enforced in order to ensure that the conditions in Section 3 are satisfied. In this case, the bound in creftype 3.2 becomes . The interpretation here is that increasing the upper bound increases the likelihood that any particular node will link to others, but does not affect the (asymptotic) power law characterized by creftype 3.1. In Fig. 2, we display the average number of positive edges in graphs that were simulated with fixed and varying values of the sparsity parameter . We note that in simulations, we encountered numerical issues in regimes.
4 Related work
Referring to the construction for generalized random graphs in Section 2, Britton et al. (2006, Thm. 3.1) shows that when the weights have finite first and second moments, then the limiting distribution of the degree is a mixed Poisson distribution. Most such distributions are lighttailed, however, in which case the degrees will not exhibit power law behavior. Britton et al. (2006, Thm. 3.2) therefore provides an alternative construction in which may have infinite moments (so that it may exhibit a heavy tail), which results in a graph with a power law exponent of . Finally, van der Hofstad (2016, Thm. 6.13) suggests yet another construction where the are upper truncated to be of order , where is the number of nodes in the graph. The resulting random variables therefore have finite moments, yet exhibit a heavy tail, and the resulting random graph has a heavy tailed degree distribution with an arbitrary power law exponent. None of these results suggest a particular choice for the distribution of , however, and so we have elected to use BFRY random variables (which are heavy tailed) that are upper truncated (so that they have finite moments). We note that the requirements on our truncation level (c.f. Section 3) is less strict than the criterion of the van der Hofstad (2016, Thm. 6.13) construction.
The reader may consult the surveys by Bollobás & Riordan (2003); Albert & Barabási (2002); Dorogovtsev & Mendes (2002) for a background on scalefree random graphs, which is too large to review here. While these models are numerous, the following recent pieces of work in the Bayesian statistics and machine learning communities may be of interest to the reader: Caron & Fox (2014); Veitch & Roy (2015); Crane & Dempsey (2016); Cai & Broderick (2015). This collection of work discusses power law degree distributions, albeit in some cases in multigraphs (i.e., graphs with nonnegative integervalued edges) and in some cases the power law behavior is not characterized, only numerically observed in simulations. Many of these models can be seen to invoke their power law properties from the Pitman–Yor process (Pitman & Yor, 1997) (or related stochastic processes), where the extent of this behavior is controlled by the discount parameter of the Pitman–Yor model, which, like the BFRY distribution, is related to a stable subordinator of index .
5 Incorporating latent factors
Latent factor models for relational data assume that a set of latent clusters underlie the network. For example, in a social network, the latent factors could be the unobserved hobbies or interests of individuals, which determine the observed friendships in the network. Bayesian models for latent factors in relational data are widespread, with some of the most popular based on stochastic blockmodels
, where models for unsupervised learning, or clustering, are used to infer the latent factors
(Nowicki & Snijders, 2001; Kemp et al., 2006; Airoldi et al., 2009; Miller et al., 2009). In this section, we present extensions of the generalized random graph that incorporate latent factors by scaling the odds ratios, while maintaining their power law degree distribution.We will first provide a general result showing how to incorporate random scaling variables into the model, followed by specific examples that model these scaling variables with latent clusters. Let the odds ratios in the generalized random graph be given by for some . Note that as and as , and so the edgespecific weight simply scales the link probability. The random graph then has the likelihood
where the normalization term in Section 3 is now
where the final equality follows simply because . So constructed, the odds ratios will influence the link probabilities in the generalized random graph, but will not affect the power law behavior of the degree distributions (under some assumptions on the random variables ). We summarize this construction in the following theorem, the proof for which is provided in the supplementary material:
Theorem 5.1.
Let be i.i.d. random variables with density function (in Section 3). Let be a collection of uniformly bounded random variables, where, for every , the collection is exchangeable. Let be the degrees of the random graph , where is the sequence of odds ratios
and where . Then the degrees satisfy statements (1) and (2) in creftype 3.1.
For example, we may construct stochastic blockmodels, such as those introduced by Nowicki & Snijders (2001), as follows: For every , let be a random variable taking values in , indicating which one (and only one) of different factors to associate with node . We want the latent cluster assignments for two nodes and to influence their link probability, which we could capture with a set of parameters , for . Then the parameter could represent, or influence, the probability of a link between nodes and . Taking a Bayesian approach, the indicator variables may be modeled with a Dirichletcategorical conjugate distribution and their values may be inferred via probabilistic inference. An example of such a model could be summarized as follows: Let
(1)  
(2)  
(3)  
(4) 
and construct the random graph as in creftype 5.1. Kemp et al. (2006) developed a nonparametric extension of a similar model that in a sense takes the limit , allowing an appropriate number of clusters to be automatically inferred from the data. In this case, the marginal law of the indicator variables is given by a Chinese restaurant process (with concentration parameter ).
Several generalizations of the stochastic blockmodel allow the clusters underlying the network to overlap, leading to mixed membership stochastic blockmodels (Airoldi et al., 2009) or the related latent feature relational models (Miller et al., 2009). To capture this structure, we may generalize the indicators to now represent a binary
vector with entry
indicating node is associated with cluster , now called a feature, and otherwise. One example of such a model could be summarized as follows:(5)  
(6)  
(7)  
(8) 
and construct the random graph as in creftype 5.1. Miller et al. (2009) derived a nonparametric extension of this model that in a sense takes the limit , in which case the marginal law of the vectors is that of an Indian buffet process (with mass parameter and concentration parameter ) (Ghahramani et al., 2007).
6 Variational inference
We derive a variational Bayesian inference algorithm (Jordan et al., 1999) that approximates the (optimal state of the) posterior distribution of the model components, given a network dataset. We approximate the required gradients in this procedure with stochastic gradient ascent (Bottou, 2010; Hoffman et al., 2013), computed on minibatches (i.e., subsets) of edges in the graph.
6.1 The variational lower bound
In variational inference, we approximate the posterior distribution on the latent variables with a variational distribution , the parameters of which are fit to maximize the following lower bound on the marginal likelihood
where is the likelihood function computed as in Section 3, and is the prior on represented by the density function in Section 3. The (non random) discount parameter is inferred by corresponding gradient ascent updates maximizing the likelihood of the model, which is described in Section 6.4.
We specify a mean field variational distribution . We considered several approximations for the marginals
including truncated BFRY and truncated gamma distributions, however, in our experiments we found that the following
rectified gamma distribution performed well:independently for every , where and denote the shape and rate parameters of the gamma distribution, respectively, and the notation emphasizes that this formula holds under the variational distribution .
6.2 Stochastic gradient ascent
We maximize the lower bound on the right hand side of Section 6.1 by stochastic gradient ascent, where on the th step of the algorithm, we make the following updates to the parameters in parallel
for and some sequence of positive numbers satisfying the Robbins–Monro criterion (Robbins & Monro, 1951) and and where
where denotes the observed edges (both links and nonlinks) in the dataset. We cannot evaluate the expectation (with respect to the rectified gamma distributions ) analytically, and so we elect to use a particular Monte Carlo approximation of this gradient detailed by Knowles (2015), which was developed for gamma variational distributions and easily applies to the rectified gamma case.
Briefly, for every , create the collection of Monte Carlo samples from the variational distribution as follows: Independently for , let , and set , where and
is the inverse of the cumulative distribution function for a gamma random variable. For convenience, we recall that
For every , the gradient with respect to the parameters is then approximated by
where
. This estimator is unbiased and has low enough variance that often a single sample suffices for the approximation
(Salimans & Knowles, 2013; Kingma & Welling, 2014). The gradient of is nonzero only when , in which case we may immediately obtain the partial derivative with respect to the rate parameter; in particular, we haveThe partial derivative with respect to the shape parameter does not have a closed form solution and must be approximated. Different approximation routines are suggested by Knowles (2015) for different regimes of the shape parameter , and we found these approximations to be accurate and efficient in our experiments.
6.3 Minibatches of edges in the graph
Computing the required gradients in Section 6.2 may be done in parallel, and this computation, whether performed analytically or with automatic differentiation methods, scales with the number of edges in the graph. This can be prohibitive for many network datasets, and we therefore introduce a further approximation where this gradient is evaluated on subsets (a.k.a. minibatches) of the dataset, a technique from stochastic gradient ascent (Bottou, 2010) adopted in the context of variational Bayesian inference by Hoffman et al. (2013). In the case of a network dataset, we may select minibatches that are subsets of the observed edges in the graph. In particular, write the gradient of Section 6.2 with respect to the variable (which is required by Section 6.2) as
where
is the gradient that ignores all but one edge of the graph. We may therefore compute the unbiased estimate of this gradient
on a minibatch of the observed edges.
6.4 Inference on the parameters and
Without good prior knowledge of how to set the discount parameter and the sparsity parameter controlling the power law and sparsity behaviors of the graph, respectively, we infer their values from the data. First consider the discount parameter, which we infer with gradient ascent. After every update to the latent variables , we fix them to their mean under the distribution , i.e., where , and take a step in the direction of the gradient
which is straightforward to derive from the density function in Section 3, and where the normalization term
is a function of and , if we let as suggested in Section 3.2. We do not have a closed form solution for this term when , and, unfortunately, inference on model parameters where the likelihood is difficult to evaluate is a challenging problem; for example, see the approaches taken by Murray et al. (2006) on such problems, which those authors call doubly intractable distributions. Accurate inference for is important in our model, because it controls the power law behavior of the graph. In our experiments, we approximate the gradient in Section 6.4 for (fixed ) by approximating (via Section 6.4) and with line integrals. In the Section 7, we demonstrate that this approximation works well in various regimes of , with slight overestimation for moderate values.
Similar approaches to infer may be derived with finite difference approximations; we did not find these approaches successful in our experiments, however, and so we instead select by cross validation.
7 Experiments
true  model  max test loglikel  avg test loglikel 

BFRY  57323.19 91.62  57675.72 31.71  
Gamma  71341.90 116.82  71841.66 47.38  
BFRY  21077.62 79.64  21289.75 34.23  
Gamma  24430.38 73.06  24701.06 11.31  
BFRY  7894.67 41.84  8027.42 51.08  
Gamma  8511.48 22.45  8601.50 15.42 
dataset  model  max test loglikel  avg test loglikel 

500Air  BFRY  1628.51 10.46  1654.20 6.79 
Gamma  1842.10 3.97  1870.35 0.28  
polblogs  BFRY  474.67 32.20  503.20 37.85 
Gamma  555.24 18.27  596.78 0.78  
Fb107  BFRY  18098.38 20.50  18209.94 12.86 
Gamma  18403.66 31.76  18568.05 2.79  
openfl  BFRY  16561.13 137.89  16947.70 177.21 
Gamma  17475.52 31.97  17746.79 6.65 
true  true  true  500Air  polblogs  Fb107  openfl  

BFRY –  0.33 0.00  0.53 0.00  0.68 0.00  0.23 0.03  0.64 0.06  0.00 0.00  0.67 0.21 
Gamma –  5.29 0.01  1.42 0.00  0.51 0.00  5.10 0.01  0.66 0.00  33.58 0.01  0.47 0.00 
BFRY –  –  –  –  1.08 0.16  1.40 0.00  0.80 0.0  1.28 0.10 
Inferred hyperparameters in the experiments.
We first demonstrate how the inference procedure in Section 6.4 can correctly differentiate between various regimes of . We ran an experiment where for each value , we simulated 10 datasets from the model with nodes, while fixing . For each simulated dataset, we ran an instance of the inference routine with randomly initialized. In Fig. 3, we show the trace plots of alpha during each instance of the inference routine. For comparison, the true values of are also shown as horizontal dashed lines. We can see that the inference routine can correctly distinguish between these different regimes of , with slight overestimation in the moderate regime. Interestingly, despite random initializations of , the algorithm always immediately inflates to around 0.9, and then slowly decreases this value during inference, regardless of what value of generated the data.
We next demonstrate that accurately capturing power law structures in datasets will improve predictive performance. While fixing , we simulate three network datasets with 5,000 nodes from our model with discount parameters and , respectively, which therefore exhibit increasingly lightertailed degree distributions. The generated graphs have 117,300, 32,925, and 9,460 links, respectively. To establish a baseline model that does not exhibit power law degree distributions but is otherwise comparable to our model, we implement the generalized random graph where the nodespecific weights are constructed from the gamma random variables , for some positive parameter , i.i.d. for every node . Note that the parameter controls the sparsity of the generated graph; larger values of imply denser graphs. It follows analogously to creftype 5.1 that
for , as . This model therefore does not exhibit power law behavior, as desired. We refer to this model as “Gamma” and the power law graph model as “BFRY”.
We ran an experiment holding out 20% of the edges in the simulated graphs as test sets, training the two models on the remaining 80% of the edges. We used a minibatch size of 5,000 edges (note that the training dataset corresponds to almost 10 million observed edges). We ran each inference procedure for 20,000 steps of stochastic gradient ascent updates, using Adam (Kingma & Ba, 2015) to adjust the learning rates at each step. We repeated each experiment 5 times, each time holding out a different test set and using a different random initialization. Again, for this experiment we fixed . In Table 1 we report a mean loglikelihood metric for the test datasets, where the metric for each run is obtained by averaging the test loglikelihoods across the states for the last 4,000 steps of the inference procedure; the displayed intervals are at standard deviation about the metric, from across the 5 repeats. We also report a max loglikelihood metric, which simply records the maximum test loglikelihood across the last 4,000 steps of the inference procedure, instead of the average. The best performing method is highlighted in bold (which in each case was the BFRY model).
In each case, we see that the BFRY model achieves higher test loglikelihood metrics than the Gamma model, as expected, implying that accurately capturing a power law degree distribution improves predictive performance (when power law behavior is truly present in the network). In Table 3, we report the inferred values for , which were reasonably accurate, though we see slight overestimation for some regimes, as seen in the demonstration earlier. For the baseline Gamma model, we optimized the hyperparameter using gradient ascent maximizing the evidence lower bound of the model (c.f. Section 6.1), and the inferred values are also reported in Table 3.
Next, we ran similar experiments on the following network datasets, each of which are expected to exhibit power law degree distributions:

[nolistsep]

‘USTop500Airports’: 500 nodes, 2,980 links

‘openflights’: 7,976 nodes, 15,243 links

‘polblogs’: 1,490 nodes, 9,517 links

‘Facebook107’: 1,034 nodes, 26,749 links
Where appropriate, we saved only the upper triangular parts of the adjacency matrices. The ‘USTop500Airports’ dataset contains the (undirected, unweighted) flight connections between the 500 busiest US airports. The similar, though much larger, ‘openflights’ dataset contains the flight connections between nonUS airports. Scalefree networks have been proposed for such traffic networks, detailed for these datasets by Colizza et al. (2007). The ‘polblogs’ dataset contains the links between political blogs (judged by hyperlinks between the front webpages of the blogs) in the period leading up to the 2004 US presidential election, which is observed to exhibit power law degree distributions by Adamic & Glance (2005). The ‘Facebook107’ dataset contains “friendships” between users of a Facebook app, collected by Leskovec & McAuley (2012); social networks are widely studied for their power law degree distributions.
For both the Gamma and BFRY models, we ran our variational inference procedure for 20,000 steps on each dataset. As before, we repeated the experiment 5 times for each network, each time holding out a different 20% of the edges in the network as a testing set. We selected the value of from among the grid
with 5fold cross validation on the training set. We set the minibatch size to be equal to the number of nodes in the graph; for example, we used minibatches of 1,490 edges for the polblog dataset. The evaluation metrics on the test datasets are summarized in
Table 2, and the inferred hyperparameter values are reported in Table 3. We see that the BFRY model once again outperforms the Gamma baseline model, according to the test loglikelihood metrics.Probabilistic inference on by the BFRY model provides some of the most interesting analyses here. With (underflowing our machine’s precision), the Facebook107 social network has the degree distribution with the heaviest tails, followed by the USTop500Airports traffic network with , the polblog citation network with , and the openflights network has the lightest tailed degree distribution with .
8 Future work
Future work could focus on implementing the latent factor modeling generalizations presented in Section 5, which are natural assumptions in many domains where networks are expected to exhibit power law degree distributions. Alternative approaches to inference on the sparsity parameter should also be explored, since controlling the sparsity in the graph was important for good predictive performance.
Acknowledgements
The authors thank Remco van der Hofstad for helpful advice and anonymous reviewers for helpful feedback. J. Lee and S. Choi were partly supported by an Institute for Information & Communications Technology Promotion (IITP) grant, funded by the Korean government (MSIP) (No.2014000147, Basic Software Research in Humanlevel Lifelong Machine Learning (Machine Learning Center)) and Naver, Inc. C. Heaukulani undertook this work in part while a visiting researcher at the Hong Kong University of Science and Technology, who along with L. F. James was funded by grant rgchkust 601712 of the Hong Kong Special Administrative Region.
References
 Adamic & Glance (2005) Adamic, L. A. and Glance, N. The political blogosphere and the 2004 US election: divided they blog. In Proceedings of the 3rd international workshop on Link discovery, pp. 36–43, 2005. URL http://www.cise.ufl.edu/research/sparse/matrices/Newman/polblogs.
 Airoldi et al. (2009) Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. Mixed membership stochastic blockmodels. In Advances in Neural Information Processing Systems, 2009.
 Albert & Barabási (2002) Albert, R. and Barabási, AL. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002.
 Barabási & Albert (1999) Barabási, A. and Albert, R. Emergence of scaling in random networks. Science, 286(5439):509–512, 1999.
 Bertoin et al. (2006) Bertoin, J., Fujita, T., Roynette, B., and Yor, M. On a particular class of selfdecomposable random variables: the durations of bessel excursions straddling independent exponential times. Probability and Mathematical Statistics, 26:315–366, 2006.
 Bollobás (1998) Bollobás, B. Random graphs. Springer, 1998.
 Bollobás & Riordan (2003) Bollobás, B. and Riordan, O. M. Mathematical results on scalefree random graphs. Handbook of graphs and networks: from the genome to the internet, pp. 1–34, 2003.
 Bollobás et al. (2001) Bollobás, B., Riordan, O., Spencer, J., and Tusnády, G. The degree sequence of a scalefree random graph process. Random Structures & Algorithms, 18(3):279–290, 2001.

Bottou (2010)
Bottou, L.
Largescale machine learning with stochastic gradient descent.
In COMPSTAT, 2010.  Britton et al. (2006) Britton, T., Deijfen, M., and MartinLöf, A. Generating simple random graphs with prescribed degree distribution. Journal of Statistical Physics, 124(6):1377–1397, 2006.
 Cai & Broderick (2015) Cai, D. and Broderick, T. Completely random measures for modeling power laws in sparse graphs. In NIPS 2015 Workshop on Networks in the Social and Information Sciences, 2015.
 Caron & Fox (2014) Caron, F. and Fox, E. B. Sparse graphs using exchangeable random measures. arXiv preprint arXiv:1401.1137, 2014.
 Colizza et al. (2007) Colizza, V., PastorSatorras, R., and Vespignani, A. Reaction–diffusion processes and metapopulation models in heterogeneous networks. Nature Physics, 3(4):276–282, 2007. URL https://sites.google.com/site/cxnets/usairtransportationnetwork.
 Crane & Dempsey (2015) Crane, H. and Dempsey, W. Atypical scaling behavior persists in real world interaction networks. arXiv preprint arXiv:1509.08184, 2015.
 Crane & Dempsey (2016) Crane, H. and Dempsey, W. Edge exchangeable models for network data. arXiv preprint arXiv:1603.04571, 2016.
 Devroye & James (2014) Devroye, L. and James, L. F. On simulation and properties of the stable law. Statistical methods & applications, 23(3):307–343, 2014.
 Dorogovtsev & Mendes (2002) Dorogovtsev, S. N. and Mendes, J. F. F. Evolution of networks. Advances in physics, 51(4):1079–1187, 2002.
 Ghahramani et al. (2007) Ghahramani, Z., Griffiths, T. L., and Sollich, P. Bayesian nonparametric latent feature models. Bayesian Statistics, 8:201–226, 2007. See also the discussion and rejoinder.
 Hoffman et al. (2013) Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. W. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 James et al. (2015) James, L. F., Orbanz, P., and Teh, Y. W. Scaled subordinators and generalizations of the Indian buffet process. arXiv preprint arXiv:1510.07309, 2015.
 Jordan et al. (1999) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
 Kemp et al. (2006) Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T., and Ueda, N. Learning systems of concepts with an infinite relational model. In AAAI, 2006.
 Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: a method for stochastic optimization. In ICLR, 2015.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. Autoencoding variational Bayes. In ICLR, 2014.
 Knowles (2015) Knowles, D. A. Stochastic gradient variational Bayes for gamma approximating distributions. arXiv preprint arXiv:1509.01631, 2015.
 Lee et al. (2016) Lee, J., James, L. F., and Choi, S. Finitedimensional BFRY priors and variational Bayesian inference for power law models. In Advances In Neural Information Processing Systems, pp. 3162–3170, 2016.
 Leskovec & McAuley (2012) Leskovec, J. and McAuley, J. J. Learning to discover social circles in ego networks. In Advances in Neural Information Processing Systems 25, 2012. URL https://snap.stanford.edu/data/egonetsFacebook.html.
 Miller et al. (2009) Miller, K., Jordan, M. I., and Griffiths, T. L. Nonparametric latent feature models for link prediction. In Advances in neural information processing systems, 2009.
 Murray et al. (2006) Murray, I., Ghahramani, Z., and MacKay, D. J. C. Mcmc for doublyintractable distributions. In UAI, 2006.
 Nowicki & Snijders (2001) Nowicki, K. and Snijders, T. A. B. Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455):1077–1087, 2001.
 Pitman & Yor (1997) Pitman, J. and Yor, M. The twoparameter Poisson–Dirichlet distribution derived from a stable subordinator. The Annals of Probability, pp. 855–900, 1997.
 Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.

Salimans & Knowles (2013)
Salimans, T. and Knowles, D. A.
Fixedform variational posterior approximation through stochastic linear regression.
Bayesian Analysis, 8(4):837–882, 2013.  van der Hofstad (2016) van der Hofstad, R. Random graphs and complex networks: Volume 1. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2016. URL http://www.win.tue.nl/~rhofstad/NotesRGCN.pdf.
 Veitch & Roy (2015) Veitch, V. and Roy, D. M. The class of random graphs arising from exchangeable random measures. arXiv preprint arXiv:1512.03099, 2015.
9 Proofs
We prove creftype 3.1 and creftype 5.1 in the paper. First consider the following redefinition of our model with slightly different notation; let be a random variable constrained on , with density
where is a sequence of positive numbers satisfying
Note that as , and so the sequence of densities converges pointwise to the density of the BFRY distribution
and converges in distribution to a BFRY random variable. Let be i.i.d. copies of . A random simple graph is then defined to be a collection of Bernoulli random variables as follows:
where . We will write , where .
We begin with a sequence of Lemmas. Define a sequence of random variables , for every , by
Let be i.i.d. copies of , and denote the empirical mean of these copies by
The expectation of is finite for all , and is computed as
where is the lower incomplete gamma function.
Let denote convergence in probability. The following lemma is a standard mean convergence result:
Lemma 9.1.
, as .
Proof.
The following lemma will be used to study various higher order moments in later results:
Lemma 9.2.
For ,
Proof.
We have
As , the first factor on the right hand side clearly converges to zero (c.f. Section 9), and, by creftype 9.1, the second term converges to a constant in probability. ∎
Recall that is the degree of the th node in the graph , given by Section 9. The following result will show up in later calculations involving the probability generating function (PGF) of the degree random variables :
Lemma 9.3.
For every collection with , for ,
for positive .
Proof.
The proof is given by Britton et al. (2006). ∎
The following result studies a representation of the PGF of the degree random variables and their higher order moments:
Lemma 9.4.
Fix a node . Define
where . Note that the th derivative exists for all . For all , the following hold:

[nolistsep]

is uniformly bounded, for all ;

, as .
Proof.
In the case , is trivially bounded by 1 since . By the Taylor series expansion , we have
By creftype 9.1,
where is the empirical mean in Section 9 excluding the element . Furthermore, by creftype 9.2,
where is computed without . Combining, we have
Before proceeding for , we define
for all . One can easily see that for all . For , we have
and
Hence, by the squeeze theorem, . For , we have
by creftype 9.2. Hence, we have for .
Now we show that
for some constants for all and . We proceed by the mathematical induction. For ,
Now by the inductive hypothesis,
where
so the inductive argument holds.
Having (9), by mathematical induction, we can easily show that is uniformly bounded for all . Moreover,
by (9) and (9). Combining this with (9), by mathematical induction, we can show that for all ,
∎
We will now use our collected results to analyze the asymptotic distribution of the degree random variables; the following result characterizes this distribution:
Lemma 9.5.
Fix a node . Given , for some , the degree of node converges in distribution to a Poisson random variable with rate , as .
Proof.
The PGF of is given by
Note that these expectations are under the field generated by . For all , we will derive the limit of , as , which we note is given by the th order derivatives of the PGF in Section 9, evaluated at the argument . It therefore suffices to show that , as , for all . By creftype 9.4, we know that is uniformly bounded and that , as . Therefore, by uniform integrability,
∎
We are now ready to prove the main theorems in the paper.
Proof of creftype 3.1.
We will first verify that, for , for every node and for some constant as . By creftype 9.5, conditioned on , the degree converges in distribution to a Poisson random variable with rate . Then by dominated convergence,
By the asymptotics of the Gamma function, for , we have
for some constant .
Next we show that, for any finite , the collection of random variables are asymptotically independent, as . We compute the (joint) probability generating function of , with for . By creftype 9.3,
Given , by a similar argument as in the proof of creftype 9.4, one can easily show that
and
Hence, again by a similar argument as in the proof of creftype 9.4, we have