Parameter Estimation Procedures for Exponential-Family Random Graph Models on Count-Valued Networks: A Comparative Simulation Study

11/03/2021
by   Peng Huang, et al.
University of California, Irvine
0

The exponential-family random graph models (ERGMs) have emerged as an important framework for modeling social and other networks. ERGMs for valued networks are less well-studied than their unvalued counterparts, and pose particular computational challenges. Networks with edge values on the non-negative integers (count-valued networks) are an important such case, with applications ranging from migration and trade flow data to data on frequency of interactions and encounters. Here, we propose an efficient maximum pseudo-likelihood estimation (MPLE) scheme for count-valued ERGMs, and compare its performance with existing Contrastive Divergence (CD) and Monte Carlo Maximum Likelihood Estimation (MCMLE) approaches via a simulation study based on migration flow networks in two U.S states. Our results suggest that edge value variance is a key factor in method performance, with high-variance edges posing a particular challenge for CD. MCMLE can work well but requires careful seeding in the high-variance case, and the MPLE itself performs well when edge variance is high.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 12

page 13

page 16

page 17

page 24

page 25

page 26

02/28/2018

Fast Maximum Likelihood estimation via Equilibrium Expectation for Large Network Data

Complex network data may be analyzed by constructing statistical models ...
09/11/2020

A simulation study of semiparametric estimation in copula models based on minimum Alpha-Divergence

The purpose of this paper is to introduce two semiparametric methods for...
02/10/2020

Introducing Graph Cumulants: What is the Variance of Your Social Network?

In an increasingly interconnected world, understanding and summarizing t...
08/08/2017

Exponential Random Graph Models with Big Networks: Maximum Pseudolikelihood Estimation and the Parametric Bootstrap

With the growth of interest in network data across fields, the Exponenti...
11/16/2018

On the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme

Chen and Lio (Computational Statistics and Data Analysis 54: 1581-1591, ...
09/02/2020

Improving ERGM Starting Values Using Simulated Annealing

Much of the theory of estimation for exponential family models, which in...
06/09/2021

ergm 4.0: New features and improvements

The ergm package supports the statistical analysis and simulation of net...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Binary relations - relations in which edges can be approximated as simply “present” or “absent” - form the backbone of the social network field, with decades of theoretical, methodological, and empirical progress in understanding their structure and function. Valued relations, while by no means neglected, are less well-understood, and our tools for studying them less well-developed. Yet, the “strength of social ties” is at core of many scientific questions in a range of social settings [21]. Examples of valued relations include the frequency of interaction in interpersonal contact networks [5], communication volume within and among organizations [16, 12], encounters among non-human animals [19], and trade flows among nations [48]. The need for rich information about social relations is particularly acute for networks involving interactions among aggregate entities such as nations, geographical areas, gangs, or formal organizations: because ties in such networks are themselves frequently aggregations of lower-level interactions, it is frequently the case that one’s interest is not in the mere existence of trade, migration, homicide, communication, or other interactions, but their volume, frequency, or other quantitative features. In such settings, modeling edge states is of considerable substantive importance.

The earliest statistical modeling of valued was accomplished via network regression methods [30]

; these provide only least-squares estimates of covariate effects, although autocorrelation-robust null hypothesis tests for such effects are well-known

[13], and some generalization via generalized linear models (GLMs) and related techniques is possible. Some forms of dependence can, further, be controlled semi-parametrically using latent structure models [e.g., 37, 25, 46, 1]

, allowing estimation of covariate effects while accounting for unobserved mechanisms that can be written in terms of mixing on unobserved variables. Parametric models for valued graphs with general classes of dependence effects have been longer in coming, the current state of the art being exponential family random graph models (ERGMs) defined on sets of valued graphs

[15, 31, 33]; but see also Robins et al. [38] for a pioneering example using categorical data and pseudo-likelihood estimation). Although ERGMs for valued graphs are not complete in the sense that they are for unvalued graphs (i.e., for most types of edge values, it is not always possible to write an arbitrary distribution on the order- valued graphs in ERGM form), they are still highly general families, able to flexibly specify a wide range of effects. Since their introduction, they have been applied in a number of settings, ranging from networks of collaboration in government, and networks of migration flows, to networks of functional connectivity between brain regions [26, 40, 43, 49].

Notwithstanding their broad applicability, parameter estimation for ERGMs is often computationally demanding, a problem that is especially acute for valued networks. The major computational cost comes from the normalizing factor in the ERGM likelihood function, which is generally an intractable function involving the sum or integral of an exponentiated potential over the set of all possible network configurations. Although much is made over the fact that these sums have too many elements to explicitly evaluate [except in the case of extremely small unvalued graphs, e.g. 45], this is not the major obstacle to computation: rather, the difficulty rises from the extreme roughness (i.e., high variance) of the exponentiated potential over the support, which (in the absence of an explicit analytical solution) renders naive attempts at numerical approximation ineffective. This problem can be amplified in the valued case, particularly where edge values vary greatly; valued edges can also pose challenges for some approximate estimation procedures that are successful in the case of unvalued ERGMs, as they must now explore a larger per edge state space. This high cost of estimation puts a priority on computationally efficient approximation methods. However, there has not been to date a systematic study of how well such methods perform, either in terms of improved computational efficiency or quality of estimation.

This paper provides a look at this issue, evaluating estimation quality and computational cost for a number of alternative valued ERGM estimation techniques. We focus on here on ERGMs for count-valued networks, i.e. relations whose edges take values on the unbounded non-negative integers, evaluating estimators via a simulation study based on intercounty migration-flow networks in two U.S. states. We vary the variance of edge value and the (node) size of the network, to simulate different data structures. The methods examined include the two currently implemented “standard” strategies - contrastive divergence [CD; 32] and Markov Chain Monte Carlo maximum likelihood estimation [MCMLE; 29] - as well as one approach not previously used in this setting, maximum pseudo-likelihood estimation (MPLE). MPLE is a workhorse approximation method in the binary ERGM case [42]

, but requires special implementation measures for the count-data case, and to our knowledge has not previously been used for count-data ERGMs with general dependence. For all methods, we evaluate their computational speed, bias, variability, accuracy, calibration of estimated standard errors and confidence coverage.

Our experiment reveals that the edge value variance has a substantial impact on the performance of computational methods for count-valued ERGMs. For small-edge-variance networks, estimators of all methods studied here offer high-quality point estimation, although their second order properties vary. Specifically, CD greatly overestimates standard errors, while MPLE underestimates them for dependence terms but overestimates them for non-dependence terms [as has been seen in binary ERGMs, e.g. 35, 17]. All methods have roughly comparable wall-clock time in this regime, although MPLE with parallel computing and subsampling can be the fastest. Overall, we find that MCMLE offers the best performance for small-edge-variance data, at some computational cost. When the edge variance becomes large, however, CD and CD-MCMLE (i.e. MCMLE seeded by CD) show very poor performance over all desiderata we evaluate, while MPLE and MPLE-MCMLE return high-quality estimates; the cost of MCMLE becomes much higher in this regime, however (especially for larger networks), while MPLE remains computationally efficient. Our findings thus suggest that MPLE-seeded MCMLE is the safer choice for high-quality estimation, with the MPLE itself a potentially viable approximation (particularly for large graphs with large edge variance) when MCMLE is computationally infeasible.

The remainder of the paper proceeds as follows. Section 2 briefly reviews ERGMs for valued networks, with applicable estimation strategies discussed in Section 3. Our simulation study design is described in Section 4, with results reported in Section 5. Section 6 discusses implications for method selection, and Section 7 concludes the paper.

2 Count-Valued ERGMs

An ERGM family for count data can be written as,

(1)

where

is a realization of the network random variable

on support , where the elements of are graphs whose edges take values on the set . (Here, we further assume that is a subset of the order- count-valued graphs, though generalization is possible.)

is a vector of sufficient statistics, determined by exogenous covariates

and the graph state , with corresponding parameter vector . Finally, is the reference measure, which defines the limiting behavior of the model as . Often tacitly taken to be constant for binary ERGMs, the reference measure is essential for valued ERGMs, as it determines the marginal distribution of edge values under the reference [31]. Leaving leads to a marginal Boltzmann baseline distribution, while choosing

(2)

where is the set of edge variables, and is the value of the edge, leads to a marginal Poisson baseline distribution of edge values. Other choices are also possible, some of which may have specific substantive interpretations [see e.g. 10, 11, for examples in the binary case].

As with binary ERGMs, we may specify the conditional probability that a given

edge variable will take a specified value. Again interpreting and as random adjacency matrices, let (respectively ) refer to the set of all edge variables other than the th, and let the notation refer to the network formed by with the th edge variable set to value . Then we have

(3)
(4)

While the derivation is identical to the binary case (as can be appreciated by noting that Eq. 3 would reduce to the usual logistic form if were restricted to be

), we note the computationally important difference that the conditional edge probability itself now has a non-trivial normalizing factor. In the general case, this has no analytical solution, and since it involves an infinite sum it cannot be explicitly evaluated otherwise. Although this does not impact e.g. the acceptance calculations for typical Markov Chain Monte Carlo (MCMC) algorithms (since the conditional odds of one graph versus another does not depend upon either normalizing factor), it does affect computation for the MPLE (which does depend on the conditional edge probability). Here, we formulate a finite sum approximation to Eq.

3 for MPLE, as described below.

3 Estimation Strategies for Count-Valued ERGMs

While many approaches to parameter estimation are possible, we focus here on approximations to the maximum likelihood estimator (MLE). Here, we briefly review the strategies employed, including special considerations for the count-data case.

3.1 Monte Carlo Maximum Likelihood Estimation

There are currently two widely used schemes for MCMC-based maximum likelihood estimation: stochastic approximation [41, 47]

, which is based on attempting to match the expected sufficient statistics to their observed values (exploiting the coincidence of methods-of-moments and MLE for exponential families); and the Geyer-Thompson method

[20, 28] (supplemented in current implementations by Hummel stepping [27]), which uses an importance sampling scheme to directly optimize the log-likelihood surface. We here employ the latter, in its statnet implementation [28, 34].

MCMLE methods are the current gold-standard techniques for ERGM maximum likelihood estimation, with good theoretical properties [41, 22] and strong performance in simulation studies for binary networks [17]. An important bottleneck impacting the use of MCMLE, however, is the ability to produce relatively high-quality draws from the specified ERGM distribution (without which, the algorithms will not converge correctly). While it is known that conventional MCMC algorithms can in principle mix arbitrarily slowly [41, 7], in actual practice this problem has been observed primarily in badly specified models that are degenerate or near-degenerate, and hence of limited relevance in typical social network applications [29]. That said, estimation time can still become long on very large networks, particularly for models with strong edgewise dependence.

This cost issue becomes more acute for valued ERGMs, especially where edge values are highly variable. Intuitively, good MCMC mixing requires the Markov chain to explore the space of high-probability graphs, whose size increases substantially when edge values vary over large range. For instance, for a simple random walk MCMC algorithm that proposes perturbing edges at random,111Practical implementations often use slightly different proposals, but the basic intuition carries. toggles may be needed to ensure that every edge variable in an unvalued graph has a high probability of having the “opportunity” to change state. If edges typically vary over some interval of order , then a similar random walk scheme that increments or decrements edge values will need for each edge to have the “opportunity” to cover its range of values. For networks with large counts (e.g., migrant-flow networks), one can easily obtain , in which case simulation costs can rapidly become prohibitive. Although this problem can be alleviated by coarsening edge values to a much smaller range [as done e.g. by 49, 50], this both loses information and distorts the resulting model (since e.g., coarsening artifically reduces the entropy of the graph distribution). In principle, improved MCMC algorithms offer a better way to address this problem in the long-term, but at present our experience has been that current implementations do not scale well for high-variance count models. As we show below, the estimation quality of MCMLE is challenged by MPLE for high- valued ERGMs, and its computational cost greatly increases in this regime.

3.2 Contrastive Divergence

One alternative to either numerical approximation of expected statistics or of log likelihood ratios (as in MCMLE) is to use a local approximation to the gradient of the likelihood in regions of the support “near” the observed data. This is the essential idea behind contrastive divergence (CD) [24]

, a method originally introduced in the machine learning literature for scalable inference that is particularly well-suited to ERGMs and other exponential families

[4, 32]. CD can be employed for both valued and unvalued graphs, and greatly reduces computational time by using using only very short MCMC chains starting at the observed data, depending on neither sample convergence nor burn-in. It is, however, an approximate technique that optimizes a function closely related to the pseudo-likelihood [4], and thus shares some of the drawbacks of the MPLE. Here, we compare CD with other estimators for count-valued ERGMs, showing that it has good first-order performance but poor calibration for small-edge-variance data, with estimation quality badly deteriorating when edge variance is high.

3.3 Maximum Pseudo-Likelihood Estimation

Although MPLE has not to our knowledge been studied or implemented for count-valued ERGMs, it is an otherwise well-known technique (being the first practical method for general ERGM estimation [42]). Maximum pseudo-likelihood estimation (MPLE) optimizes the product of the conditional likelihoods of each edge variable (the eponymous pseudo-likelihood [6]

). In the unvalued case, this reduces to a logistic regression problem, allowing the MPLE to be obtained using standard regression algorithms

[a fact that was instrumental in its early adoption, see e.g. 3]. In the special case of edgewise independent ERGMs, the MPLE coincides with the MLE; this ceases to be true for dependence models, though the MPLE is generally close enough to the MLE to be used as a standard method for initializing MCMLE estimators, and its first-order performance on large networks can be very good [2, 39]. Because it does not fully account for interactions among edge variables, the pseudo-likelihood function tends to be excessively concentrated, leading to poor calibration of standard error estimates [as shown in binary ERGMs: 35, 17]. However, MPLE computation can be quite efficient, further aided by the fact that (1) the pseudo-likelihood itself can be approximated by subsampling edge variables (rather than computing on all of them), and (2) the calculations in question are embarrassingly parallel (making it possible to greatly reduce wallclock time on multi-core CPUs). We hence consider this approach here, as one of the methods to be compared.

As noted above, MPLE computation in the count-data context is more complex than in the binary case, and to our knowledge it has not been previously studied for count-valued ERGMs with dyadic dependence. We thus consider it here in greater detail. As in the binary case, the MPLE is defined by

(5)

where the conditional probabilities in question are given by Eq. 3 and 4. Per Eq. 4, these latter conditionals depend upon a sum over the possible edge states of products of two factors: one involves the ratio of the reference measure at the observed edge value versus its alternative values, and the other involves the exponentiated difference in sufficient statistics between the observed network and the same network with the focal edge taking on alternative values. For the former, we observe that (in the case of the Poissonian reference), we have

(6)

while the latter is simply

where is the “generalized” changescore

There is not, in general, a simple form for the sum of these terms over all . However, we observe that the ratio of Eq. 6 falls very rapidly (as roughly ) for , and it is hence possible in practice to approximate the infinite sum by truncation. More generally, we recommend the following techniques for improving computational performance:

3.3.1 Pre-caching of ratios and differences

We note that neither the ratio of reference measures nor the changescores depend upon . Considerable computational savings can hence be had by pre-computing the ratios of Eq. 6 and the values for the necessary range of values on each edge. This carries a storage cost that scales with the product of the range and the number of edge variables used, but avoids frequent recalculation of these (expensive) quantities on each pseudo-likelihood evaluation.

3.3.2 Edge sum truncation and/or coarsening

Per Eq. 3, the conditional log-likelihood of each edge variable involves as sum over . As noted above, we may approximate this sum by instead evaluating it over , where is large enough to be dominated by the decline in . Where the marginal distributions of each edge variable can be approximated as even roughly Poissonian, choosing with e.g. is an extremely conservative approach. (This is based on the observation that a Poisson random variable with expectation

has a 99.9% quantile of

, assuming conservatively the expectation is the maximum observed value). can be further reduced as its max observed value increases (because the quantile of grows with ).

Truncation from above is adequate for small networks, or networks with low edge counts. However, when edge counts become extremely large, considerable computational effort may be wasted in computing conditional probabilities for small values when the observed value is large, or for large when the observed value is small. Using the same Poissonian approximation, we may further improve performance by working with the edgewise doubly-truncated sum over , with and

. Because it is common to have network effects that can strongly suppress edges, however, we also recommend retaining some very small edge values as a buffer. Valued social networks usually have right-skewed distribution of edge values, so adding a few small edge values can also effectively increase the cover rate of the empirical distribution, without significant increase in computation load. Our code by default retain the terms for

(although we encourage extending to its mean whenever possible), leading to sums of the form . This retains terms per sampled edge, which is often a substantial savings as values become large.

When dealing with extremely large counts, storing and computing even terms can become prohibitive (particularly if many edge variables are needed for adequate statistical power). In such cases, coarsening is another option. To coarsen, we select evenly spaced values from , and compute the associated contributions to the edge sum only for these terms. We also include, however, the gap (in terms of the number of “skipped” values) between subsequent calculated terms, and weight each computed terms by the number of elements in the gap; this is equivalent to approximating the sum via a step function, with knots at the computed values. Our experience with this method has been promising, although problems can ensue if the sum becomes heavily concentrated on a range of terms that lie within adjacent knots. We thus do not employ this technique in this paper, although we offer it as a promising target for future research. Of course, other approximation methods are also possible (e.g., integral approximations), and may be useful in the large-count regime.

3.3.3 Edge variable sampling

Although the exact calculation of the pseudo-likelihood is at least , the log pseudo-likelihood can easily be approximated by random sampling of edge variables; this reduces both storage and computational cost. We have experimented with uniform random sampling, as well as weighted (i.e., importance) sampling schemes analogous to the “Tie-No Tie” proposal method frequently used in ERGM MCMC [36] and a scheme that attempts to sample the full range of edge values uniformly. Simulation experiments suggested little if any benefit to importance sampling versus random sampling (see Section 8.2 in Appendix), however, and we thus do not pursue this direction here.

3.3.4 Parallel evaluation of conditional log-likelihoods

Because the log of the pseudo-likelihood is linearly separable, its calculation is an embarrassingly parallel problem. In practice, we divide sampled edge variables into batches, and calculate their conditional log-likelihoods independently on different cores. Since MPLE computation is dominated by the pseudo-likelihood calculation, this leads to wallclock time reductions that scale as the inverse of the number of available cores. This (combined with edge variable sampling) can make the MPLE an attractive choice for very large valued networks, especially when many cores are available.

Taken together, the above allow the MPLE to be used even for large networks with highly dispersed counts (although not all of them are needed when counts are less variable, or on smaller networks). As we show, MPLE is very fast and the resulting estimator can have low bias and high accuracy for valued networks; it offers high-quality calibration of uncertainty when the edge variance is large, but is prone to overconfidence for dependence terms (i.e., underestimation of the standard error) and conservative for nondependence terms when the edge variance is small.

4 Study design

We evaluate the above estimation techniques via a parameter recovery study, in which we generate networks from a realistic generative model based on an initial fit to real-world social networks, estimate models to the simulated draws using each respective technique, and then examine the properties of the resulting estimators. Our generative model was created by fitting a valued ERGM to an empirical case (see below) using MCMLE; we then obtained 500 high-quality draws from the fitted ERGM using MCMC. For each draw, we obtained point and standard error estimates from each of the three study methods (MPLE, CD, MCMLE), evaluating the results with respect to wallclock estimation time, bias, variance of the estimator, overall accuracy, and calibration (accuracy of estimated standard errors and confidence coverage). All modeling and analysis was performed using statnet [23], specifically using the ergm 3.10.4 [28], ergm.count 3.4.0 [34], and sna 2.4 [9] libraries. Our MPLE implementation also made use of the Rcpp library [18]. The following subsections detail the data and model used, the setup of the estimation procedures, and the performance metrics.

4.1 Case study and model definition

We based our experiment on the migration-flow network among the 29 counties comprising the U.S. state of Utah over the period of 2011-2015, as measured by the American Community Survey [44]

. Since the variance of edge value influences both the computational load and the structure of the support space, we construct two datasets to examine how edge variance alters the performances of estimation methods for valued ERGMs. The first dataset uses the count of migrant population between each directed pair of counties as the value of edges, which ranges in integer from 0 to 8,023 (standard deviation 542). Given its high variance in edge value, we call it the difficult case. For comparison, the edge value of another dataset is the coarsened migrant count, where we take the natural logarithm of migrant count (plus one) and round it to the nearest integer. Its edge value ranges from 0 to 9 (standard deviation 2.2), and we call it the “easy case” with small edge-variance. We also construct a third dataset of the coarsened migrant count among the 100 North Carolina counties, to compare with the easy case above in an investigation of how network size influences estimation of valued ERGMs. Results for the third dataset are reported in Section 

8.1 in Appendix.

Our ground-truth model is created by fitting a count-valued ERGM to the above networks. The sufficient statistics for the model contain a Sum term (the intercept), a Nonzero term (i.e., the count of nonzero edges), three exogenous covariates, and one dependence term (a Mutual term). The exogenous covariates are the population sizes of the sending and receiving counties (called Nodeocov and Nodeicov respectively), and the distance between counties (Edgecov) (all on natural log scale). The dependence term, mutuality, measures the reciprocity of the network, defined as

(7)

Although our aim here is to produce a deliberately simple model for purposes of evaluation (as opposed to a substantively detailed model of migration), our choice of statistics was informed by prior work on migration, and previous empirical analyses on migration-flow networks in particular [8, 26, 49, 51].

Estimation for the ground-truth model was performed by MCMLE (seeded by CD), using ergm function (with 1 million MCMC interval for the difficult case, 100 thousand MCMC interval for the easy case).222By default, statnet sets MCMC burnin as 16 times the length of interval. We use this default setting unless otherwise specified. Simulation of draws from the fitted model was peformed by simulate function with MCMC (100 million interval for the difficult case, and 10 million interval for the easy case). Diagnostics of simulated networks indicate low serial autocorrelation, and the mean of sufficient statistics from simulated networks matches with that of observed networks.

4.2 Methods under evaluation

As described above, we estimate parameters from the simulated network draws using three procedures: Contrastive Divergence (CD), Monte Carlo Maximum Likelihood Estimation (MCMLE), and Maximum Pseudo-Liklihood Estimation (MPLE). Estimation for each method was performed as follows.

For CD, we use the default settings of the ergm.count package, performing 8 Metropolis-Hastings steps, raising one proposal in each step. We also tried using more steps or more proposal within each step for CD; since it does not significantly improve its estimation quality (see Section 8.3 in Appendix), we keep the most time efficient setting in our comparison. For MCMLE, we made three adjustments from the default setting in ergm.count. First, we set the proposal distribution of Metropolis-Hastings algorithm in MCMC to be random, where every dyad has equal chance to be toggled. By default, the proposal distribution in statnet favors toggling nonzero edges than nulls, with the rationale that social networks tend to be sparse (for binary networks); we remove this penalty towards nulls since the binary density of the network (0.5) is high in migration networks (removing the need for biased proposals), and the random proposal reduces the computational time of the MCMLE. Second, we set the maximum number of iterations in the MCMLE algorithm (i.e, importance sampling/estimation cycles) to be 500, to offer maximum opportunity for the MCMLE to converge. When MCMLE still failed to converge, we reran the entire estimation procedure a second time, and reported the fraction of these failures in the results section. Third, the default setting for the MCMC thinning interval is 1,024, which we follow that for the easy, i.e. small-edge-variance, case. This chain length turned out to be too short for MCMLE to have good mixing properties in the difficult, i.e. large-edge-variance, case, so we increased it to 10,000 (which was found to be adequate in this setting). Lastly, ergm.count by default uses point estimation from CD as starting values for MCMLE (here referred to as CD-MCMLE), which we employ for the easy/small-edge-variance case. For the difficult/large-edge-variance case, since CD-MCMLE failed to offer satisfactory estimation, we also tried seeding MCMLE with MPLE (sampling 50%, or 406 dyads) and compare the resulting MPLE-MCMLE with the CD-MCMLE.

For MPLE, we use the procedure described in Section 3.3. For specifications of edge support truncation, we employed the conservative setting to have edge support of every edge ranging from 0 to times the max edge value in the network, where for the easy/small-edge-variance case, and for the difficult/large-edge-variance case. One can further speed up estimation by employing edgewise truncation for the edge support, but we found non-edgewise truncation to be fast enough in this small network. To examine the effect of edge sample size on estimation quality, we consider model fits using 406, 609, 812 edge variables, which consist 50%, 75%, and 100% of the edges, respectively. We also consider the impact of multiple cores on execution time, calculating the wallclock time for models estimated using 1,4, and 20 cores, respectively. In addition to random sampling, we also experimented with other sampling protocols. These did not significantly alter the performance of MPLE (see Seciont 8.2 Appendix), and we therefore focus on the randomly subsampled case.

All models were fit on a 44-core server, with 256GB RAM. The processors are dual Intel Xeon E5-2699 2.2GHz CPUs (22 cores/CPU). Estimation using R 3.6.3 was performed on Ubuntu 20.04.1. All procedures reported are based on a single core, except for the multi-core MPLE conditions.

4.3 Evaluation criteria

Since the methods of interest involve speed-quality trade-offs, it is necessary to evaluate these two dimensions simultaneously. To evaluate computational cost, we compute the wallclock time for each method, as mean seconds needed to fit the target model to a simulated network. Since the speed of MPLE is dependent upon the number of parallel processes, we repeat the process using 1, 4, and 20 cores, respectively.

To evaluate estimation quality, we consider the bias, variance, overall accuracy, and calibration of each estimator using the following metrics.

We first compute the absolute relative bias (ARB) of estimators for each coefficient, using

(8)

taking the average across experiments, where is the estimator and is the true value from the model that simulated the networks. The smaller the ARB, the less bias is introduced by the estimation procedure.

We also compute the variability for each estimator, via (true) standard error of the estimated coefficient. Formally,

(9)

The smaller the variability, the more efficient and more precise the estimator is.

While bias and variance are each important, we are also interested in the total accuracy of the estimator (the extent to which it deviates, on average, from the true value). We measure this via the root-mean-square error (RMSE) i.e.

(10)

The smaller the RMSE, the more accurate the estimator is on average.

Finally, we consider how well calibrated each estimator is, in terms of the associated estimates of uncertainty. To evaluate the bias in our second moment estimate, we compare the real standard error and the estimated standard error using

(11)

A positive number suggests the method is conservative, while a negative number suggests the method is overconfident. We also examine confidence coverage, specifically the proportion of cases in which the nominal 95% confidence interval (CI) for each parameter actually covers the true coefficient. Specifically, the coverage rate is computed by

(12)

where

for 95% CI and large degrees of freedom. The closer to 95% the coverage is, the better calibrated the estimate is. Coverage rates above 95% suggest that the method is conservative, while coverage rates less than 95% suggest that the method is overconfident.

5 Results

5.1 The small-edge-variance (easy) case

Figures 1 and 2 display the performance of each method on the small-edge-variance data. Panel A in Figure 1 shows the absolute relative bias (ARB) of the coefficient estimates. It suggests that all methods produce very little numerical bias, less than 2.1% across all covariates and methods. Biases from MPLE generally decrease with more edges sampled, and both full MPLE (i.e. with all edges sampled) and MCMLE have the smallest bias, less than 1% across all covariates. This finding is consistent with research on binary ERGMs finding that the MPLE introduces little bias in the first-order estimation [17, 39]. The lack of appreciable bias is an encouraging sign, suggesting the point estimation of Valued ERGMs for small-edge-variance data is easy to acquire using whichever method we tested.

Figure 1: Bias, variability, and RMSE of small-edge-variance data
Figure 2: Calibration, confidence coverage, and wall-clock time of small-edge-variance data

We also evaluate the (im)precision or variability of estimation (sometimes called efficiency), i.e. the true standard error of each estimator. Panel B in Figure 1 shows that the variability of the MPLE decreases with more edges sampled, and that full MPLE and MCMLE are the most efficient methods. In general, variations of estimators using all methods are close to each other, suggesting that they have similar efficiency for the small-edge-variance data.

We then evaluate the total accuracy of estimations using the root-mean-square error (RMSE). A more holistic metric, the accuracy measurement combines bias and variation of estimation evaluated above, and smaller RMSE is preferred. Panel C of Figure 1 displays RMSE scores, whose distribution is almost identical to the variability score in Panel B. The similarity between RMSE and variability reveals that biases contribute very little to the total RMSE, with accuracy being dominated by the performance in variability. For valued ERGMs on small-edge-variance data, methods with good variability score thus also have decent accuracy. Full MPLE and MCMLE has the smallest RMSE, though RMSE for all methods under evaluation differ mildly.

Besides performance in coefficient estimation, performance in estimating uncertainties is also evaluated, shown in the first two panels in Figure 2. Panel D displays calibration of each method. An indicator of the bias in standard error estimation, a positive calibration score suggests overestimation of the uncertainty, and a negative calibration suggests underestimation. Noticeably, CD overestimates standard errors for all covariates by a large margin; CD’s calibration scores are all above 2, indicating that the estimated standard error is more than 7 () times its true value. We experimented with different tuning parameters for CD, but its calibration did not reduce to a reasonable level (see Section 8.3 Appendix). Hence, we conclude that in this small-edge-variance case, CD is too conservative to offer useful information about uncertainties of covariates for valued ERGMs.

For MPLE, we find that it underestimates the uncertainties for the dependence term (mutual), but overestimates uncertainties for non-dependence terms, though the degree of inflation is smaller. Previous simulation studies found similar patterns for MPLE on binary ERGMs [35, 17]. Interestingly, the bias in standard error estimation increases with the sample size for MPLE. This is in part because the bias of estimation in statistical uncertainty is trivial when the numerical uncertainty is the main source of uncertainty for the MPLE (as is the case with small sample sizes); as the numerical uncertainties decrease with more edges sampled, the bias in statistical uncertainty becomes non-negligible. The calibration of MPLE is not as far off as that of CD, but caveats should be given when interpreting its standard error estimation for dependence terms.

While CD and the MPLE show varying degrees of error in standard deviation, MCMLE has almost perfect calibration, suggesting that MCMLE is the best method for standard error estimation of small-edge-variance data.

Another metric that considers uncertainty estimation, confidence coverage is the proportion of model fits in which the true value of a given coefficient is covered by the estimated 95% confidence interval (CI), as is shown in Panel E of Fig 2, where the dotted line is the 95% reference line and the solid line represents 100%. The figure tells a similar story to Panel D’s calibration score, since confidence coverage performance is largely determined by performance in calibration of uncertainty when the bias of coefficient is small. The figures show that CD overestimates standard errors so much that its CIs always cover the true value, making them conservative but uninformative. The MPLE’s CIs cover the true values more than 95% for non-dependence terms, but under-cover the true values for the dependence term, with this deviation becoming larger as sample size increases; however, the CI performance is still reasonable even for dependence terms (above 80% for 95% CI). On the other hand, MCMLE has coverage rates that are extremely close to 95%, showing its characteristic calibration advantage for small-edge-variance data of Valued ERGMs estimation.

Lastly, Panel F of Figure 2 displays the wall-clock time of each method. As expected, the wall-clock time for computing the MPLE can be greatly compressed by using a sample of edges to approximate the joint pseudolikelihood function, or by using multiple cores to calculate conditional likelihoods. The fastest methods are CD and MPLE using 20 processors. Interestingly, MCMLE has reasonable performance in estimation time, which is just one order of magnitude slower than the fastest methods. This, together with its prominent estimation quality, suggests MCMLE to be the ideal method for Valued ERGMs estimation on the small-edge-variance data.

5.2 The large-edge-variance (difficult) case

Figures 3 and 4 display the performance of each method on the large-edge-variance data. For contrastive divergence, looking across metrics (all displayed in log scale) in Figure 3, we find that CD introduces very large numerical biases (Panel A), has variance greatly higher than the rest of methods (Panel B), and produces much larger errors compared to the other methods (Panel C). They together suggest that coefficient estimation of CD is uninformative and inefficient for large-edge-variance data. Panel D in Figure 4 reveals that CD overestimates uncertainties for every covariate except the nonzero term, whose uncertainty was underestimated by a large margin; Panel E shows that confidence intervals of CD rarely cover the true value of coefficients except the dependence term, which might be attributed to its overestimation of uncertainties. These together indicate that CD fails to offer useful estimation of both coefficients and standard errors for the large-edge-variance data. Section 8.3 in Appendix shows that increasing tuning parameters does not significantly alters CD’s poor performance.

Figure 3: Bias, variability, and RMSE of large-edge-variance data
Figure 4: Calibration, confidence coverage, and wall-clock time of large-edge-variance data

Results of CD-MCMLE suggest that following CD with MCMLE generally improves the estimation quality, but its performance is still far from ideal. For point estimation, estimators of CD-MCMLE overall introduce less bias, variance, and error than CD, usually by a large margin. However, CD-MCMLE still shows a remarkable 214% bias in estimating nonzero term, whose variance and error from CD-MCMLE are even higher than those of CD. Uncertainty estimation of CD-MCMLE is also poorly calibrated, with underestimated standard errors for all covariates. Its confidence interval also shows less than optimal coverage. These indicate that CD-MCMLE is not an effective estimation approach for Valued ERGMs when the edge variance is large.

MPLE, on the contrary, shows excellent performance on all metrics, compared to CD and CD-MCMLE. MPLE shows very little bias; the largest bias appears in estimation for the dependence term, and is still less than 0.5% with only 406dyads sampled (which is even smaller than biases of any estimator in the small-edge-variance data). The variances and errors of the MPLE are also greatly smaller than those of CD or CD-MCMLE, and they decreases as more edges are sampled for MPLE (though sample size has little impact on these metrics). Figure 4 further reveals that the MPLE is well calibrated for uncertainty, impressively (including for the dependence term), with the maximum less than 2.5% bias in standard error estimation. The tiny bias in both coefficient and standard error estimation makes the MPLE’s confidence coverage very close to the 95% benchmark; the only noticeable deviation happens to the nonzero term, where the full MPLE reports CI covering 90% of the true value, slightly lower than the 95% benchmark.

Overall, we thus see that, for the large-edge-variance case, CD and CD-MCMLE perform poorly, while MPLE offers high-quality estimates. It is striking that MCMLE has excellent performance for the small-edge-variance data, but seems to fail badly when the edge variance becomes large. One possibility is that the high bias of CD for large-variance data leads the MCMLE to start from a point far away from the maximizer. And the Geyer-Thompson method of MCMLE is known to work poorly in regimes farther away from the maximizer [27]. CD-MCMLE actually failed to converge in 500 iterations for 3 of the 500 sample networks in the first round, and was given another attempt to get convergence. Since MPLE the offers accurate point estimation, we thus experimented with the use of the 406dyads MPLE as starting values for MCMLE (called here MPLE-MCMLE).

Seeding MCMLE with MPLE greatly improves its performance compared to CD-MCMLE. MPLE-MCMLE has the smallest bias, variability and RMSE as Figure 3 shows. Figure 4 suggests that MPLE-MCMLE also calibrates the uncertainty well, and offers confidence coverage very close to the 95% benchmark. This reveals that, similar to ERGMs for binary networks, good performance of (Geyer-Thompson) MCMLE relies heavily on a good seed for Valued ERGMs. When the edge variance is large, seeding MCMLE with MPLE is significantly better than with CD, considering their distinct estimation quality.

Panel F in Figure 4 shows wall-clock time of different methods, the main figure displaying it in the original scale and the upper-right figure in the log scale. We find that CD-MCMLE is orders of magnitudes slower than the rest of the methods, and that seeding MCMLE with MPLE not only improves its estimation quality, but also reduces the computational time by a factor of nearly 15. That being said, MPLE-MCMLE is still about 10 times slower than MPLE when the latter uses only 1 processor. Based on previous experiments, MPLE can be orders of magnitude faster by employing multiple processors on large networks, and is therefore the fastest method for estimating Valued ERGMs with large edge variance.

The last piece of the puzzle for the large-edge-variance data regards the competition between MPLE and MPLE-MCMLE. A closer look at its point estimation performance suggests that MPLE-MCMLE has lower bias, variance, and errors than MPLE, but their difference is tiny, and all MPLEs perform very well already in these dimensions. Their calibration scores are close to each other, while MPLE-MCMLE has better performance for confidence coverage of the dependence term, which is almost the only noticeable flaw of MPLE. In terms of wall-clock time, MPLE-MCMLE is at least an order of magnitude slower than MPLE when multi-core computation is not available. In general, MPLE offers high-quality estimates in the cases studied here, with MPLE-MCMLE offering even better estimation at greatly increased computational cost. This suggests a strategy of employing the MPLE as a default tool for estimation of count-valued ERGM parameters on large-edge-variance data, with strategies such as increasing MPLE sample size or following MPLE with MCMLE employed to improve estimation quality when computationally feasible.

6 Discussion

Overall, our experiments reveal that the variance of the edge value makes a substantial difference in the performance of estimation methods for count-valued ERGMs. By turns, supplemental experiments suggest that network size does not alter the relative advantages of these methods, as we see similar performance on small-edge-variance Utah data (N=29) and on small-edge-variance North Carolina data (N=100) (see Section 8.1 in the Appendix). For small-edge-variance data, all methods perform very well in point estimation, while CD greatly overestimates the uncertainty, and MPLE underestimates uncertainties of dependence terms but overestimates them for non-dependence terms. Wall-clock time of the methods are close to each other, where the fastest methods are CD and MPLE with subsampling and parallel computing.

For large-edge-variance data, CD fails to offer reliable estimation for both coefficients and standard errors, and so does MCMLE seeded by CD. MPLE, on the other hand, offers high-quality estimation for both coefficients and, surprisingly, standard errors as well. MCMLE seeded by MPLE offers even better estimation, although it is already an order of magnitude slower than the slowest MPLE, which can be further accelerated by subsampling edge variables and/or parallel computing.

The results suggest that for small-edge-variance data, MCMLE can serve as the standard estimation method, considering its excellent estimation quality and reasonably fast computation. CD, as the most computationally efficient method, can be utilized for tasks that only require point estimations of coefficients, such as exploratory analysis, prediction or estimating models for network simulation. For tasks that are too expensive to perform MCMLE, such as high-dimensional models or large networks, MPLE with subsampling and parallel computing can be a feasible option, although caveats should be given when interpreting standard error estimates. As for networks with large edge variance, MPLE could be the go-to method with reliable first- and second-order estimates and the fastest computation. If time allows, one can increase the sample size of MPLE, or refine the MPLE with MCMLE to get further enhancements in estimation quality.

The experiment reveals that MCMC-based estimation methods perform worse when the edge-variance becomes large. One potential reason is their difficulty in reproducing the dichotomized density of the networks (i.e. the proportion of nonzero edges). With larger range of edge values, the toggle of values between zero and one become more unlikely, and the difficulty of matching the target density increases. (This is, of course, a special case of zero-inflation, a common phenomenon in count data models.) This is reflected in the traceplots of CD-MCMLE for large-edge-variance data, where the count of nonzero edge in our MCMC simulations was always far from the observed. This issue can be worsen by the mismatch of the MCMC algorithm design and the typical property of valued networks. Compared to their binary counterparts, valued social networks are frequently denser (in the dichotomized density). However, MCMC algorithms in existing software are optimized for sparse, unvalued social networks. We observed improvement in computational time when switching from the toggling proposal that favors toggling nonzero edges to the random proposal, which is the one that offers the highest likelihood of toggling empty edges among existing algorithms. To improve the performance of MCMC-based methods for Valued ERGMs, future research could consider experimenting with MCMC algorithms that pay more attention to the toggling between value zero and one, such as proposals that favors toggling zero-value edges.

As with any simulation study, one trades off the “realism” of performance on a realistic case against some degree of generality. Although we vary the network size and density of edge value to emulate different application settings, we cannot rule out the possibility that some methods studied here may perform better or worse under other conditions. We encourage future research using simulation studies based on different use cases and model specifications, including non-Poissonian reference measures. Given that we find the variance of edge value plays an important role in influencing the performance of valued ERGM estimation, it would be of interest to experiment with more fine-grained classification of the scale of edge variance, in search of a rule of thumb about when MCMLE and MPLE would be the better choice.

Our study also suggests the continuing relevance and necessity of the MPLE to ERGM methodology. Our implementation of MPLE for valued ERGMs enables estimation for large-edge-variance data in feasible time and with high-quality results. With good overall accuracy, high speed, and excellent tunability, MPLE would be an excellent general use estimation method for valued ERGMs if its calibration could be improved for small-edge-variance networks. Our findings suggest the value of work on methods that may help further improve calibration of MPLE in the count-valued case; such advances may build on methods that shown to help calibration for binary ERGMs, such as bootstrap resampling [14, 39] and regularization [17].

7 Conclusion

ERGMs, especially for valued networks, can be computationally expensive to estimate. In search of a fast and reliable computational method, we implemented MPLE for count-valued ERGMs, and performed a comparative simulation study using three methods: CD, MCMLE, and MPLE. We found that the variance of edge value is critical in determining the performance of computational methods for valued ERGMs, while the network size does not. For small-edge-variance networks, point estimates are easy to acquire using whichever method, while CD greatly overestimates uncertainties and MPLE underestimates them for dependence terms and overestimates them for non-dependence terms. All methods have similar wall-clock time. For large-edge-variance networks, CD and CD-MCMLE perform poorly, while MPLE and MPLE-MCMLE offer strong performance for estimating both coefficient and uncertainties, although MPLE is orders of magnitude faster than MPLE-MCMLE.

On the basis of this study, we recommend that researchers choose computational methods based on the variance of edge values. For small-edge-variance data, MCMLE should be the default method, although CD is suitable for point estimations; MPLE is well-suited for large networks and high-dimensional models, especially with a large number of available processors, but caveats should be given for interpreting its standard errors. For large-edge-variance networks, MPLE is a solid method, and researchers can design the size of edge sample and determine whether to follow MPLE with MCMLE based on the computational resources at hand and the requirements of estimation quality. In conclusion, this paper offers MPLE implementation that enables computation of Valued ERGMs for networks with large size or high edge variance, and guidance for choosing and tuning computational methods for Valued ERGM estimation.

Acknowledgements

This research was supported by NSF award SES-1826589.

References

  • Aicher et al. [2014] Aicher, C., Jacobs, A. Z. and Clauset, A. (2014) Learning latent block structure in weighted networks. Journal of Complex Networks, 3, 221–248.
  • An [2016] An, W. (2016) Fitting ERGMs on big networks. Social Science Research, 59, 107–119.
  • Anderson et al. [1999]

    Anderson, C., Wasserman, S. and Crouch, B. (1999) A p* primer: Logit models for social networks.

    Social Networks, 21, 37–66.
  • Asuncion et al. [2010] Asuncion, A., Liu, Q., A. Ihler, P. and Smyth (2010) Particle filtered MCMC-MLE with connections to contrastive divergence. In International Conference on Machine Learning (ICML).
  • Bernard et al. [1979] Bernard, H. R., Killworth, P., and Sailer, L. (1979) Informant accuracy in social networks IV: A comparison of clique-level structure in behavioral and cognitive network data. Social Networks, 2, 191–218.
  • Besag [1974] Besag, J. (1974) Spatial Interaction and the Statistical Analysis of Lattice Systems. Journal of the Royal Statistical Society: Series B (Methodological), 36, 192–225.
  • Bhamidi et al. [2011] Bhamidi, S., Bresler, G. and Sly, A. (2011) Mixing time of exponential random graphs. The Annals of Applied Probability, 2146–2170.
  • Boyle et al. [2014] Boyle, P., Keith H., H., Vaughan, R. and Vaughan, R. (2014) Exploring Contemporary Migration. Abingdon, United Kingdom: Routledge.
  • Butts [2008] Butts, C. T. (2008) Social Network Analysis with sna. Journal of Statistical Software, 24, 1–51.
  • Butts [2019] — (2019) A dynamic process interpretation of the sparse ERGM reference model. Journal of Mathematical Sociology, 43, 40–57.
  • Butts [2020] — (2020) A dynamic process reference model for sparse networks with reciprocity. Journal of Mathematical Sociology, forthcoming.
  • Butts et al. [2007] Butts, C. T., Petrescu-Prahova, M. and Cross, B. R. (2007) Responder communication networks in the World Trade Center Disaster: Implications for modeling of communication within emergency settings. Journal of Mathematical Sociology, 31, 121–147.
  • Dekker et al. [2007] Dekker, D., Krackhardt, D. and Snijders, T. A. B. (2007) Sensitivity of MRQAP tests to collinearity and autocorrelation conditions. Psychometrika, 72, 563–581.
  • Desmarais and Cranmer [2012a] Desmarais, B. and Cranmer, S. (2012a) Statistical mechanics of networks: Estimation and uncertainty. Physica A: Statistical Mechanics and its Applications, 391, 1865–1876.
  • Desmarais and Cranmer [2012b] Desmarais, B. A. and Cranmer, S. J. (2012b) Statistical Inference for Valued-Edge Networks: The Generalized Exponential Random Graph Model. PLoS ONE, 7, e30136.
  • Drabek et al. [1981] Drabek, T. E., Tamminga, H. L., Kilijanek, T. S. and Adams, C. R. (1981) Managing Multiorganizational Emergency Responses: Emergent Search and Rescue Networks in Natural Disaster and Remote Area Settings. No. Monograph 33 in Program on Technology, Environment, and Man. Boulder, CO: Institute of Behavioral Sciences, University of Colorado.
  • van Duijn et al. [2009] van Duijn, M. A. J., Gile, K. J. and Handcock, M. S. (2009) A framework for the comparison of maximum pseudo-likelihood and maximum likelihood estimation of exponential family random graph models. Social Networks, 31, 52–62.
  • Eddelbuettel et al. [2011] Eddelbuettel, D., François, R., Allaire, J., Ushey, K., Kou, Q., Russel, N., Chambers, J. and Bates, D. (2011) Rcpp: Seamless r and c++ integration. Journal of statistical software, 40, 1–18.
  • Faust [2011] Faust, K. (2011) Animal social networks. The SAGE handbook of social network analysis, 148, 166.
  • Geyer and Thompson [1992] Geyer, C. J. and Thompson, E. A. (1992) Constrained Monte Carlo Maximum Likelihood for Dependent Data. Journal of the Royal Statistical Society: Series B (Methodological), 54, 657–683.
  • Granovetter [1973] Granovetter, M. S. (1973) The Strength of Weak Ties. American Journal of Sociology, 78, 1360–1380.
  • Handcock [2003] Handcock, M. S. (2003) Statistical models for social networks: Inference and degeneracy. In Dynamic Social Network Modeling and Analysis (eds. R. Breiger, K. M. Carley and P. Pattison), 229–240. Washington, DC: National Academies Press.
  • Handcock et al. [2008] Handcock, M. S., Hunter, D. R., Butts, C. T., Goodreau, S. M. and Morris, M. (2008) statnet: Software Tools for the Representation, Visualization, Analysis and Simulation of Network Data. Journal of Statistical Software, 24, 1548–7660.
  • Hinton [2002] Hinton, G. E. (2002) Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation, 14, 1771–1800.
  • Hoff et al. [2002] Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002) Latent space approaches to social network analysis. Journal of the American Statistical Association, 97, 1090–1098.
  • Huang and Butts [2020] Huang, P. and Butts, C. T. (2020) Intercounty Migration in an Immobile Era: A Network Perspective. Annual Meeting of the American Sociological Association.
  • Hummel et al. [2012] Hummel, R. M., Hunter, D. R. and Handcock, M. S. (2012) Improving Simulation-Based Algorithms for Fitting ERGMs. Journal of Computational and Graphical Statistics, 21, 920–939. URL: https://doi.org/10.1080/10618600.2012.679224.
  • Hunter et al. [2008] Hunter, D. R., Handcock, M. S., Butts, C. T., Goodreau, S. M. and Morris, M. (2008) ergm: A Package to Fit, Simulate and Diagnose Exponential-Family Models for Networks. Journal of statistical software, 24, nihpa54860.
  • Hunter et al. [2012] Hunter, D. R., Krivitsky, P. N. and Schweinberger, M. (2012) Computational statistical methods for social network models. Journal of Computational and Graphical Statistics, 21, 856–882.
  • Krackhardt [1988] Krackhardt, D. (1988) Predicting with networks: Nonparametric multiple regression analyses of dyadic data. Social Networks, 10, 359–382.
  • Krivitsky [2012] Krivitsky, P. N. (2012) Exponential-family random graph models for valued networks. Electronic journal of statistics, 6, 1100–1128.
  • Krivitsky [2017] — (2017) Using contrastive divergence to seed Monte Carlo MLE for exponential-family random graph models. Computational Statistics & Data Analysis, 107, 149–161.
  • Krivitsky and Butts [2013] Krivitsky, P. N. and Butts, C. T. (2013) Modeling valued networks with statnet. The Statnet Development Team.
  • Krivitsky et al. [2012] Krivitsky, P. N., Handcock, M. S. and Hunter, D. R. (2012) Package ‘ergm.count’. Journal of Statistics, 6, 1100–1128.
  • Lubbers and Snijders [2007] Lubbers, M. J. and Snijders, T. A. B. (2007) A comparison of various approaches to the exponential random graph model: A reanalysis of 102 student networks in school classes. Social Networks, 29, 489–507.
  • Morris et al. [2008] Morris, M., Handcock, M. S. and Hunter, D. R. (2008) Specification of exponential-family random graph models: Terms and computational aspects. Journal of Statistical Software, 24, 1–24.
  • Nowicki and Snijders [2001] Nowicki, K. and Snijders, T. A. B. (2001) Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96, 1077–1087.
  • Robins et al. [1999] Robins, G. L., Pattison, P. E. and Wasserman, S. (1999) Logit models and logistic regressions for social networks, iii. valued relations. Psychometrika, 64, 371–394.
  • Schmid and Desmarais [2017] Schmid, C. S. and Desmarais, B. A. (2017) Exponential random graph models with big networks: Maximum pseudolikelihood estimation and the parametric bootstrap. In 2017 IEEE International Conference on Big Data (Big Data), 116–121.
  • Simpson et al. [2013] Simpson, S. L., Bowman, F. D. and Laurienti, P. J. (2013) Analyzing complex functional brain networks: Fusing statistics and network science to understand the brain. Statistics surveys, 7, 1–36.
  • Snijders [2002] Snijders, T. A. B. (2002) Markov Chain Monte Carlo Estimation of Exponential Random Graph Models. Journal of Social Structure, 3, 1–40.
  • Strauss and Ikeda [1990] Strauss, D. and Ikeda, M. (1990) Pseudolikelihood Estimation for Social Networks. Journal of the American Statistical Association, 85, 204–212.
  • Ulibarri and Scott [2017] Ulibarri, N. and Scott, T. A. (2017) Linking Network Structure to Collaborative Governance. Journal of Public Administration Research and Theory, 27, 163–181.
  • U.S. Census Bureau [2018] U.S. Census Bureau (2018) County-to-County Migration Flows: 2011-2015 ACS. https://www.census.gov/data/tables/2015/demo/geographic-mobility/county-to-county-migration-2011-2015.html. Accessed: 1/15/2019.
  • Vega Yon et al. [2021] Vega Yon, G. G., Slaughter, A. and de la Haye, K. (2021) Exponential random graph models for little networks. Social Networks, 64, 225–238.
  • Vu et al. [2013] Vu, D. Q., Hunter, D. R. and Schweinberger, M. (2013) Model-based clustering of large networks. The Annals of Applied Statistics, 7, 1010 – 1039.
  • Wang et al. [2009] Wang, P., Robins, G. and Pattison, P. (2009) PNet: Program for the Simulation and Estimation of Exponential Random Graph (p*) Models. The University of Melbourne.
  • Ward et al. [2013] Ward, M. D., Ahlquist, J. S. and Rozenas, A. (2013) Gravity’s Rainbow: A dynamic latent space model for the world trade network. Network Science, 1, 95–118.
  • Windzio [2018] Windzio, M. (2018) The network of global migration 1990–2013. Social Networks, 53, 20–29.
  • Windzio et al. [2019] Windzio, M., Teney, C. and Lenkewitz, S. (2019) A network analysis of intra-EU migration flows: how regulatory policies, economic inequalities and the network-topology shape the intra-EU migration space. Journal of Ethnic and Migration Studies, 1–19.
  • Zipf [1946] Zipf, G. K. (1946) The P1 P2/D Hypothesis: On the Intercity Movement of Persons. American Sociological Review, 11, 677–686.

8 Appendix

8.1 Performance on a larger network with small-edge-variance

We also conducted the experiment on the coarsened migrant count among 100 North Carolina counties, following the same procedure as used in the main experiments. Having discovered that the variance of the edge value makes a substantial difference in the performance of estimation methods for Valued ERGMs, this experiment, in comparison with the small-edge-variance Utah migration data, aims to study the influence of network size. (Ideally, we would also experiment with large networks with large edge variance. Unfortunately, this turned out to be prohibitively computationally expensive for a simulation study, and we were unable to pursue it.) The North Carolina migration network has 100 nodes, and its binary density is 0.36, lower than the Utah migration network. The edge value ranges from 0 to 4, by taking logarithm with base 10 and rounded to its nearest integer (standard deviation 0.81). We made three modifications from the setting for the small-edge-variance Utah case, to better fit to the North Carolina data, and also for robustness checks. Since the larger network has lower binary density, we use the default Tie-Dyad sampling proposal for MCMC in MCMLE estimation (which favors sampling nonzero edge for toggling), while the Utah data used random sampling proposal for MCMLE. We also remove the nonzero term, and replace it with a dependence term: . This is a valued version of mixed-2-star, which calculates the total “migration current” flowing through (inside and out of) each node [26]. Lastly, the larger networks contains more dyads, so for MPLE, we set the edge sample size at 1000, 5000, and 9900, taking 10%, 50% and 100% of the total dyads respectively.

Figure 5: Bias, variability, and RMSE for small-edge-variance, large network data
Figure 6: Calibration, confidence coverage, and wall-clock time for small-edge-variance, large network data

Figures 5 and 6 show the results of the experiment on the small-edge-variance, large network. The results generally resemble those of the small-edge-variance small network. Figure 5 indicates that all methods introduce very little numerical bias; variance of MPLE decreases with more edges sampled, and full MPLE and MCMLE are the most efficient. RMSE, again, follows a very similar distribution as that of the variability, suggesting that biases contribute little to the errors in accuracy measurement. Panel D in Figure 6 shows that CD also greatly over-estimates the uncertainty of all covariates in the large network; MPLE again slightly overestimates the uncertainty of non-dependence terms, but underestimates the uncertainty of both dependence terms. These lead to the result that CD covers the true value in its 95% confidence interval all the time, while MPLE over-covers the true value for non-dependence term and under-cover those of the dependence terms. MCMLE, again, offers high quality uncertainty estimation and confidence coverage. Lastly, the wall-clock time suggests that MCMLE is still a decent method in time consumption, and CD is the fastest method. The main difference, as compared to the small network case, is that MPLE is relatively slower, because of the large sample size of edge variables. The larger sample size also reduces the bias of MPLE, making the 1000 dyads MPLE less biased than CD. Overall, we conclude that in our experiment, network size makes little difference in the performance of estimation methods for Valued ERGMs.

8.2 Performance of MPLE with Different Sampling Methods

Figure 7 shows the accuracy, as calculated by RMSE, of MPLE using three edge variable sampling methods, in the small-edge-variance case. The methods are random sampling, which is the one reported in the Results section, the TNT (“Tie-No-Tie”) sampling, which puts equal weight for sampling a zero-value (empty) edge and a nonzero-value edge, and the “flatval” sampling, which put equal weight for each possible edge value. The figure suggests that these three sampling methods introduce similar levels of error, with slightly lower accuracy for flatval sampling when 406 dyads are sampled. We observed this similarity in performance in other dimensions we evaluate, and conclude that sampling methods do not alter the performance of MPLE for estimation of Valued ERGMs.

Figure 7: Accuracy of MPLE with Different Sampling Methods

8.3 Estimation Quality of Contrastive Divergence with Different Parameters

Section 5 showed that Contrastive Divergence (CD) is uninformative in standard error estimation for the small-edge-variance data, and uninformative in coefficient estimation for the large-edge-variance data. To determine whether tuning parameters for CD would improve its performance, we varied the two main parameters of CD and examined their performance. The main tuning parameters for CD are steps, which is the number of Metropolis-Hastings steps, and multiplicity, which is the number of proposal for each step. The default setting in ergm.count package for CD is 8 steps and 1 multiplicity, which was the setting reported in Section 5. Here we tried increasing either or both parameters by 10 times. Table 1 shows calibration and time of CDs compared to MCMLE for the small-edge-variance data. It suggests that increasing steps does help reduce overestimation of uncertainty for CD (while increasing multiplicity almost does not), yet CD still overestimates the standard error by more than 2.5 () times. In the mean time, MCMLE has much better calibration, while consuming similar amount of time. Hence, we conclude that for small-edge-variance data, it is faster and better to remedy conservatism of CD by following it with MCMLE, as compared to increasing steps or multiplicity of CD.

Tables 2 reports absolute relative bias (in percentage) and wall-clock time of CDs compared to MPLE with 406 sampled dyads and without parallel computing, for the large-edge-variance data. It indicates that increasing steps or multiplicity of CD does not consistently reduce the biases of each covariate. The table also shows that all CDs have much greater biases compared to MPLE for every covariate, whose computational time is at the same scale of CD, and can be effectively reduced by using more than 1 processors. So MPLE, compared to CD, offers much better coefficient estimation without time loss for the large-edge-variance data. In summary, this section shows that increasing tuning parameters does not bring significant improvement for CD in its overestimation of uncertainty for the small-edge-variance data, or its large bias of coefficient estimation for the large-edge-variance data.

Contrastive Divergence MCMLE
Steps 8 80 8 80
Multiplicity 1 1 10 10
Sum 2.50 1.32 3.05 2.05 0.03
Nonzero 2.05 1.25 2.82 0.96 0.09
Edgecov 2.55 1.34 3.09 2.03 -0.04
Nodeicov 2.34 1.30 3.01 1.87 -0.02
Nodeocov 2.37 1.23 3.05 1.90 0.01
Mutual 2.03 1.13 2.85 1.61 0.01
Wall-clock time (s) 1.08 11.68 3.42 46.58 13.55
Table 1: Calibration and time of CD vs. MCMLE for small-edge-variance data
Contrastive Divergence MPLE
Steps 8 80 8 80 (406d)
Multiplicity 1 1 10 10 (1core)
Sum (%) 87.6 129.2 137.6 101.3 0.2
Nonzero (%) 377.7 13.3 4.9 108.5 0.1
Edgecov (%) 94.2 186.7 110.2 70.8 0.1
Nodeicov (%) 68.4 119.1 114.8 89.7 0.1
Nodeocov (%) 68 116.3 116.6 89.5 0.1
Mutual (%) 76.9 237.5 65.1 37.1 0.4
Wall-clock time (s) 31.8 39.7 34.7 103.3 79.7
Table 2: Bias and time of CD vs. MPLE for large-edge-variance data