1 Introduction
Binary relations  relations in which edges can be approximated as simply “present” or “absent”  form the backbone of the social network field, with decades of theoretical, methodological, and empirical progress in understanding their structure and function. Valued relations, while by no means neglected, are less wellunderstood, and our tools for studying them less welldeveloped. Yet, the “strength of social ties” is at core of many scientific questions in a range of social settings [21]. Examples of valued relations include the frequency of interaction in interpersonal contact networks [5], communication volume within and among organizations [16, 12], encounters among nonhuman animals [19], and trade flows among nations [48]. The need for rich information about social relations is particularly acute for networks involving interactions among aggregate entities such as nations, geographical areas, gangs, or formal organizations: because ties in such networks are themselves frequently aggregations of lowerlevel interactions, it is frequently the case that one’s interest is not in the mere existence of trade, migration, homicide, communication, or other interactions, but their volume, frequency, or other quantitative features. In such settings, modeling edge states is of considerable substantive importance.
The earliest statistical modeling of valued was accomplished via network regression methods [30]
; these provide only leastsquares estimates of covariate effects, although autocorrelationrobust null hypothesis tests for such effects are wellknown
[13], and some generalization via generalized linear models (GLMs) and related techniques is possible. Some forms of dependence can, further, be controlled semiparametrically using latent structure models [e.g., 37, 25, 46, 1], allowing estimation of covariate effects while accounting for unobserved mechanisms that can be written in terms of mixing on unobserved variables. Parametric models for valued graphs with general classes of dependence effects have been longer in coming, the current state of the art being exponential family random graph models (ERGMs) defined on sets of valued graphs
[15, 31, 33]; but see also Robins et al. [38] for a pioneering example using categorical data and pseudolikelihood estimation). Although ERGMs for valued graphs are not complete in the sense that they are for unvalued graphs (i.e., for most types of edge values, it is not always possible to write an arbitrary distribution on the order valued graphs in ERGM form), they are still highly general families, able to flexibly specify a wide range of effects. Since their introduction, they have been applied in a number of settings, ranging from networks of collaboration in government, and networks of migration flows, to networks of functional connectivity between brain regions [26, 40, 43, 49].Notwithstanding their broad applicability, parameter estimation for ERGMs is often computationally demanding, a problem that is especially acute for valued networks. The major computational cost comes from the normalizing factor in the ERGM likelihood function, which is generally an intractable function involving the sum or integral of an exponentiated potential over the set of all possible network configurations. Although much is made over the fact that these sums have too many elements to explicitly evaluate [except in the case of extremely small unvalued graphs, e.g. 45], this is not the major obstacle to computation: rather, the difficulty rises from the extreme roughness (i.e., high variance) of the exponentiated potential over the support, which (in the absence of an explicit analytical solution) renders naive attempts at numerical approximation ineffective. This problem can be amplified in the valued case, particularly where edge values vary greatly; valued edges can also pose challenges for some approximate estimation procedures that are successful in the case of unvalued ERGMs, as they must now explore a larger per edge state space. This high cost of estimation puts a priority on computationally efficient approximation methods. However, there has not been to date a systematic study of how well such methods perform, either in terms of improved computational efficiency or quality of estimation.
This paper provides a look at this issue, evaluating estimation quality and computational cost for a number of alternative valued ERGM estimation techniques. We focus on here on ERGMs for countvalued networks, i.e. relations whose edges take values on the unbounded nonnegative integers, evaluating estimators via a simulation study based on intercounty migrationflow networks in two U.S. states. We vary the variance of edge value and the (node) size of the network, to simulate different data structures. The methods examined include the two currently implemented “standard” strategies  contrastive divergence [CD; 32] and Markov Chain Monte Carlo maximum likelihood estimation [MCMLE; 29]  as well as one approach not previously used in this setting, maximum pseudolikelihood estimation (MPLE). MPLE is a workhorse approximation method in the binary ERGM case [42]
, but requires special implementation measures for the countdata case, and to our knowledge has not previously been used for countdata ERGMs with general dependence. For all methods, we evaluate their computational speed, bias, variability, accuracy, calibration of estimated standard errors and confidence coverage.
Our experiment reveals that the edge value variance has a substantial impact on the performance of computational methods for countvalued ERGMs. For smalledgevariance networks, estimators of all methods studied here offer highquality point estimation, although their second order properties vary. Specifically, CD greatly overestimates standard errors, while MPLE underestimates them for dependence terms but overestimates them for nondependence terms [as has been seen in binary ERGMs, e.g. 35, 17]. All methods have roughly comparable wallclock time in this regime, although MPLE with parallel computing and subsampling can be the fastest. Overall, we find that MCMLE offers the best performance for smalledgevariance data, at some computational cost. When the edge variance becomes large, however, CD and CDMCMLE (i.e. MCMLE seeded by CD) show very poor performance over all desiderata we evaluate, while MPLE and MPLEMCMLE return highquality estimates; the cost of MCMLE becomes much higher in this regime, however (especially for larger networks), while MPLE remains computationally efficient. Our findings thus suggest that MPLEseeded MCMLE is the safer choice for highquality estimation, with the MPLE itself a potentially viable approximation (particularly for large graphs with large edge variance) when MCMLE is computationally infeasible.
The remainder of the paper proceeds as follows. Section 2 briefly reviews ERGMs for valued networks, with applicable estimation strategies discussed in Section 3. Our simulation study design is described in Section 4, with results reported in Section 5. Section 6 discusses implications for method selection, and Section 7 concludes the paper.
2 CountValued ERGMs
An ERGM family for count data can be written as,
(1) 
where
is a realization of the network random variable
on support , where the elements of are graphs whose edges take values on the set . (Here, we further assume that is a subset of the order countvalued graphs, though generalization is possible.)is a vector of sufficient statistics, determined by exogenous covariates
and the graph state , with corresponding parameter vector . Finally, is the reference measure, which defines the limiting behavior of the model as . Often tacitly taken to be constant for binary ERGMs, the reference measure is essential for valued ERGMs, as it determines the marginal distribution of edge values under the reference [31]. Leaving leads to a marginal Boltzmann baseline distribution, while choosing(2) 
where is the set of edge variables, and is the value of the edge, leads to a marginal Poisson baseline distribution of edge values. Other choices are also possible, some of which may have specific substantive interpretations [see e.g. 10, 11, for examples in the binary case].
As with binary ERGMs, we may specify the conditional probability that a given
edge variable will take a specified value. Again interpreting and as random adjacency matrices, let (respectively ) refer to the set of all edge variables other than the th, and let the notation refer to the network formed by with the th edge variable set to value . Then we have(3)  
(4) 
While the derivation is identical to the binary case (as can be appreciated by noting that Eq. 3 would reduce to the usual logistic form if were restricted to be
), we note the computationally important difference that the conditional edge probability itself now has a nontrivial normalizing factor. In the general case, this has no analytical solution, and since it involves an infinite sum it cannot be explicitly evaluated otherwise. Although this does not impact e.g. the acceptance calculations for typical Markov Chain Monte Carlo (MCMC) algorithms (since the conditional odds of one graph versus another does not depend upon either normalizing factor), it does affect computation for the MPLE (which does depend on the conditional edge probability). Here, we formulate a finite sum approximation to Eq.
3 for MPLE, as described below.3 Estimation Strategies for CountValued ERGMs
While many approaches to parameter estimation are possible, we focus here on approximations to the maximum likelihood estimator (MLE). Here, we briefly review the strategies employed, including special considerations for the countdata case.
3.1 Monte Carlo Maximum Likelihood Estimation
There are currently two widely used schemes for MCMCbased maximum likelihood estimation: stochastic approximation [41, 47]
, which is based on attempting to match the expected sufficient statistics to their observed values (exploiting the coincidence of methodsofmoments and MLE for exponential families); and the GeyerThompson method
[20, 28] (supplemented in current implementations by Hummel stepping [27]), which uses an importance sampling scheme to directly optimize the loglikelihood surface. We here employ the latter, in its statnet implementation [28, 34].MCMLE methods are the current goldstandard techniques for ERGM maximum likelihood estimation, with good theoretical properties [41, 22] and strong performance in simulation studies for binary networks [17]. An important bottleneck impacting the use of MCMLE, however, is the ability to produce relatively highquality draws from the specified ERGM distribution (without which, the algorithms will not converge correctly). While it is known that conventional MCMC algorithms can in principle mix arbitrarily slowly [41, 7], in actual practice this problem has been observed primarily in badly specified models that are degenerate or neardegenerate, and hence of limited relevance in typical social network applications [29]. That said, estimation time can still become long on very large networks, particularly for models with strong edgewise dependence.
This cost issue becomes more acute for valued ERGMs, especially where edge values are highly variable. Intuitively, good MCMC mixing requires the Markov chain to explore the space of highprobability graphs, whose size increases substantially when edge values vary over large range. For instance, for a simple random walk MCMC algorithm that proposes perturbing edges at random,^{1}^{1}1Practical implementations often use slightly different proposals, but the basic intuition carries. toggles may be needed to ensure that every edge variable in an unvalued graph has a high probability of having the “opportunity” to change state. If edges typically vary over some interval of order , then a similar random walk scheme that increments or decrements edge values will need for each edge to have the “opportunity” to cover its range of values. For networks with large counts (e.g., migrantflow networks), one can easily obtain , in which case simulation costs can rapidly become prohibitive. Although this problem can be alleviated by coarsening edge values to a much smaller range [as done e.g. by 49, 50], this both loses information and distorts the resulting model (since e.g., coarsening artifically reduces the entropy of the graph distribution). In principle, improved MCMC algorithms offer a better way to address this problem in the longterm, but at present our experience has been that current implementations do not scale well for highvariance count models. As we show below, the estimation quality of MCMLE is challenged by MPLE for high valued ERGMs, and its computational cost greatly increases in this regime.
3.2 Contrastive Divergence
One alternative to either numerical approximation of expected statistics or of log likelihood ratios (as in MCMLE) is to use a local approximation to the gradient of the likelihood in regions of the support “near” the observed data. This is the essential idea behind contrastive divergence (CD) [24]
, a method originally introduced in the machine learning literature for scalable inference that is particularly wellsuited to ERGMs and other exponential families
[4, 32]. CD can be employed for both valued and unvalued graphs, and greatly reduces computational time by using using only very short MCMC chains starting at the observed data, depending on neither sample convergence nor burnin. It is, however, an approximate technique that optimizes a function closely related to the pseudolikelihood [4], and thus shares some of the drawbacks of the MPLE. Here, we compare CD with other estimators for countvalued ERGMs, showing that it has good firstorder performance but poor calibration for smalledgevariance data, with estimation quality badly deteriorating when edge variance is high.3.3 Maximum PseudoLikelihood Estimation
Although MPLE has not to our knowledge been studied or implemented for countvalued ERGMs, it is an otherwise wellknown technique (being the first practical method for general ERGM estimation [42]). Maximum pseudolikelihood estimation (MPLE) optimizes the product of the conditional likelihoods of each edge variable (the eponymous pseudolikelihood [6]
). In the unvalued case, this reduces to a logistic regression problem, allowing the MPLE to be obtained using standard regression algorithms
[a fact that was instrumental in its early adoption, see e.g. 3]. In the special case of edgewise independent ERGMs, the MPLE coincides with the MLE; this ceases to be true for dependence models, though the MPLE is generally close enough to the MLE to be used as a standard method for initializing MCMLE estimators, and its firstorder performance on large networks can be very good [2, 39]. Because it does not fully account for interactions among edge variables, the pseudolikelihood function tends to be excessively concentrated, leading to poor calibration of standard error estimates [as shown in binary ERGMs: 35, 17]. However, MPLE computation can be quite efficient, further aided by the fact that (1) the pseudolikelihood itself can be approximated by subsampling edge variables (rather than computing on all of them), and (2) the calculations in question are embarrassingly parallel (making it possible to greatly reduce wallclock time on multicore CPUs). We hence consider this approach here, as one of the methods to be compared.As noted above, MPLE computation in the countdata context is more complex than in the binary case, and to our knowledge it has not been previously studied for countvalued ERGMs with dyadic dependence. We thus consider it here in greater detail. As in the binary case, the MPLE is defined by
(5) 
where the conditional probabilities in question are given by Eq. 3 and 4. Per Eq. 4, these latter conditionals depend upon a sum over the possible edge states of products of two factors: one involves the ratio of the reference measure at the observed edge value versus its alternative values, and the other involves the exponentiated difference in sufficient statistics between the observed network and the same network with the focal edge taking on alternative values. For the former, we observe that (in the case of the Poissonian reference), we have
(6) 
while the latter is simply
where is the “generalized” changescore
There is not, in general, a simple form for the sum of these terms over all . However, we observe that the ratio of Eq. 6 falls very rapidly (as roughly ) for , and it is hence possible in practice to approximate the infinite sum by truncation. More generally, we recommend the following techniques for improving computational performance:
3.3.1 Precaching of ratios and differences
We note that neither the ratio of reference measures nor the changescores depend upon . Considerable computational savings can hence be had by precomputing the ratios of Eq. 6 and the values for the necessary range of values on each edge. This carries a storage cost that scales with the product of the range and the number of edge variables used, but avoids frequent recalculation of these (expensive) quantities on each pseudolikelihood evaluation.
3.3.2 Edge sum truncation and/or coarsening
Per Eq. 3, the conditional loglikelihood of each edge variable involves as sum over . As noted above, we may approximate this sum by instead evaluating it over , where is large enough to be dominated by the decline in . Where the marginal distributions of each edge variable can be approximated as even roughly Poissonian, choosing with e.g. is an extremely conservative approach. (This is based on the observation that a Poisson random variable with expectation
has a 99.9% quantile of
, assuming conservatively the expectation is the maximum observed value). can be further reduced as its max observed value increases (because the quantile of grows with ).Truncation from above is adequate for small networks, or networks with low edge counts. However, when edge counts become extremely large, considerable computational effort may be wasted in computing conditional probabilities for small values when the observed value is large, or for large when the observed value is small. Using the same Poissonian approximation, we may further improve performance by working with the edgewise doublytruncated sum over , with and
. Because it is common to have network effects that can strongly suppress edges, however, we also recommend retaining some very small edge values as a buffer. Valued social networks usually have rightskewed distribution of edge values, so adding a few small edge values can also effectively increase the cover rate of the empirical distribution, without significant increase in computation load. Our code by default retain the terms for
(although we encourage extending to its mean whenever possible), leading to sums of the form . This retains terms per sampled edge, which is often a substantial savings as values become large.When dealing with extremely large counts, storing and computing even terms can become prohibitive (particularly if many edge variables are needed for adequate statistical power). In such cases, coarsening is another option. To coarsen, we select evenly spaced values from , and compute the associated contributions to the edge sum only for these terms. We also include, however, the gap (in terms of the number of “skipped” values) between subsequent calculated terms, and weight each computed terms by the number of elements in the gap; this is equivalent to approximating the sum via a step function, with knots at the computed values. Our experience with this method has been promising, although problems can ensue if the sum becomes heavily concentrated on a range of terms that lie within adjacent knots. We thus do not employ this technique in this paper, although we offer it as a promising target for future research. Of course, other approximation methods are also possible (e.g., integral approximations), and may be useful in the largecount regime.
3.3.3 Edge variable sampling
Although the exact calculation of the pseudolikelihood is at least , the log pseudolikelihood can easily be approximated by random sampling of edge variables; this reduces both storage and computational cost. We have experimented with uniform random sampling, as well as weighted (i.e., importance) sampling schemes analogous to the “TieNo Tie” proposal method frequently used in ERGM MCMC [36] and a scheme that attempts to sample the full range of edge values uniformly. Simulation experiments suggested little if any benefit to importance sampling versus random sampling (see Section 8.2 in Appendix), however, and we thus do not pursue this direction here.
3.3.4 Parallel evaluation of conditional loglikelihoods
Because the log of the pseudolikelihood is linearly separable, its calculation is an embarrassingly parallel problem. In practice, we divide sampled edge variables into batches, and calculate their conditional loglikelihoods independently on different cores. Since MPLE computation is dominated by the pseudolikelihood calculation, this leads to wallclock time reductions that scale as the inverse of the number of available cores. This (combined with edge variable sampling) can make the MPLE an attractive choice for very large valued networks, especially when many cores are available.
Taken together, the above allow the MPLE to be used even for large networks with highly dispersed counts (although not all of them are needed when counts are less variable, or on smaller networks). As we show, MPLE is very fast and the resulting estimator can have low bias and high accuracy for valued networks; it offers highquality calibration of uncertainty when the edge variance is large, but is prone to overconfidence for dependence terms (i.e., underestimation of the standard error) and conservative for nondependence terms when the edge variance is small.
4 Study design
We evaluate the above estimation techniques via a parameter recovery study, in which we generate networks from a realistic generative model based on an initial fit to realworld social networks, estimate models to the simulated draws using each respective technique, and then examine the properties of the resulting estimators. Our generative model was created by fitting a valued ERGM to an empirical case (see below) using MCMLE; we then obtained 500 highquality draws from the fitted ERGM using MCMC. For each draw, we obtained point and standard error estimates from each of the three study methods (MPLE, CD, MCMLE), evaluating the results with respect to wallclock estimation time, bias, variance of the estimator, overall accuracy, and calibration (accuracy of estimated standard errors and confidence coverage). All modeling and analysis was performed using statnet [23], specifically using the ergm 3.10.4 [28], ergm.count 3.4.0 [34], and sna 2.4 [9] libraries. Our MPLE implementation also made use of the Rcpp library [18]. The following subsections detail the data and model used, the setup of the estimation procedures, and the performance metrics.
4.1 Case study and model definition
We based our experiment on the migrationflow network among the 29 counties comprising the U.S. state of Utah over the period of 20112015, as measured by the American Community Survey [44]
. Since the variance of edge value influences both the computational load and the structure of the support space, we construct two datasets to examine how edge variance alters the performances of estimation methods for valued ERGMs. The first dataset uses the count of migrant population between each directed pair of counties as the value of edges, which ranges in integer from 0 to 8,023 (standard deviation 542). Given its high variance in edge value, we call it the difficult case. For comparison, the edge value of another dataset is the coarsened migrant count, where we take the natural logarithm of migrant count (plus one) and round it to the nearest integer. Its edge value ranges from 0 to 9 (standard deviation 2.2), and we call it the “easy case” with small edgevariance. We also construct a third dataset of the coarsened migrant count among the 100 North Carolina counties, to compare with the easy case above in an investigation of how network size influences estimation of valued ERGMs. Results for the third dataset are reported in Section
8.1 in Appendix.Our groundtruth model is created by fitting a countvalued ERGM to the above networks. The sufficient statistics for the model contain a Sum term (the intercept), a Nonzero term (i.e., the count of nonzero edges), three exogenous covariates, and one dependence term (a Mutual term). The exogenous covariates are the population sizes of the sending and receiving counties (called Nodeocov and Nodeicov respectively), and the distance between counties (Edgecov) (all on natural log scale). The dependence term, mutuality, measures the reciprocity of the network, defined as
(7) 
Although our aim here is to produce a deliberately simple model for purposes of evaluation (as opposed to a substantively detailed model of migration), our choice of statistics was informed by prior work on migration, and previous empirical analyses on migrationflow networks in particular [8, 26, 49, 51].
Estimation for the groundtruth model was performed by MCMLE (seeded by CD), using ergm function (with 1 million MCMC interval for the difficult case, 100 thousand MCMC interval for the easy case).^{2}^{2}2By default, statnet sets MCMC burnin as 16 times the length of interval. We use this default setting unless otherwise specified. Simulation of draws from the fitted model was peformed by simulate function with MCMC (100 million interval for the difficult case, and 10 million interval for the easy case). Diagnostics of simulated networks indicate low serial autocorrelation, and the mean of sufficient statistics from simulated networks matches with that of observed networks.
4.2 Methods under evaluation
As described above, we estimate parameters from the simulated network draws using three procedures: Contrastive Divergence (CD), Monte Carlo Maximum Likelihood Estimation (MCMLE), and Maximum PseudoLiklihood Estimation (MPLE). Estimation for each method was performed as follows.
For CD, we use the default settings of the ergm.count package, performing 8 MetropolisHastings steps, raising one proposal in each step. We also tried using more steps or more proposal within each step for CD; since it does not significantly improve its estimation quality (see Section 8.3 in Appendix), we keep the most time efficient setting in our comparison. For MCMLE, we made three adjustments from the default setting in ergm.count. First, we set the proposal distribution of MetropolisHastings algorithm in MCMC to be random, where every dyad has equal chance to be toggled. By default, the proposal distribution in statnet favors toggling nonzero edges than nulls, with the rationale that social networks tend to be sparse (for binary networks); we remove this penalty towards nulls since the binary density of the network (0.5) is high in migration networks (removing the need for biased proposals), and the random proposal reduces the computational time of the MCMLE. Second, we set the maximum number of iterations in the MCMLE algorithm (i.e, importance sampling/estimation cycles) to be 500, to offer maximum opportunity for the MCMLE to converge. When MCMLE still failed to converge, we reran the entire estimation procedure a second time, and reported the fraction of these failures in the results section. Third, the default setting for the MCMC thinning interval is 1,024, which we follow that for the easy, i.e. smalledgevariance, case. This chain length turned out to be too short for MCMLE to have good mixing properties in the difficult, i.e. largeedgevariance, case, so we increased it to 10,000 (which was found to be adequate in this setting). Lastly, ergm.count by default uses point estimation from CD as starting values for MCMLE (here referred to as CDMCMLE), which we employ for the easy/smalledgevariance case. For the difficult/largeedgevariance case, since CDMCMLE failed to offer satisfactory estimation, we also tried seeding MCMLE with MPLE (sampling 50%, or 406 dyads) and compare the resulting MPLEMCMLE with the CDMCMLE.
For MPLE, we use the procedure described in Section 3.3. For specifications of edge support truncation, we employed the conservative setting to have edge support of every edge ranging from 0 to times the max edge value in the network, where for the easy/smalledgevariance case, and for the difficult/largeedgevariance case. One can further speed up estimation by employing edgewise truncation for the edge support, but we found nonedgewise truncation to be fast enough in this small network. To examine the effect of edge sample size on estimation quality, we consider model fits using 406, 609, 812 edge variables, which consist 50%, 75%, and 100% of the edges, respectively. We also consider the impact of multiple cores on execution time, calculating the wallclock time for models estimated using 1,4, and 20 cores, respectively. In addition to random sampling, we also experimented with other sampling protocols. These did not significantly alter the performance of MPLE (see Seciont 8.2 Appendix), and we therefore focus on the randomly subsampled case.
All models were fit on a 44core server, with 256GB RAM. The processors are dual Intel Xeon E52699 2.2GHz CPUs (22 cores/CPU). Estimation using R 3.6.3 was performed on Ubuntu 20.04.1. All procedures reported are based on a single core, except for the multicore MPLE conditions.
4.3 Evaluation criteria
Since the methods of interest involve speedquality tradeoffs, it is necessary to evaluate these two dimensions simultaneously. To evaluate computational cost, we compute the wallclock time for each method, as mean seconds needed to fit the target model to a simulated network. Since the speed of MPLE is dependent upon the number of parallel processes, we repeat the process using 1, 4, and 20 cores, respectively.
To evaluate estimation quality, we consider the bias, variance, overall accuracy, and calibration of each estimator using the following metrics.
We first compute the absolute relative bias (ARB) of estimators for each coefficient, using
(8) 
taking the average across experiments, where is the estimator and is the true value from the model that simulated the networks. The smaller the ARB, the less bias is introduced by the estimation procedure.
We also compute the variability for each estimator, via (true) standard error of the estimated coefficient. Formally,
(9) 
The smaller the variability, the more efficient and more precise the estimator is.
While bias and variance are each important, we are also interested in the total accuracy of the estimator (the extent to which it deviates, on average, from the true value). We measure this via the rootmeansquare error (RMSE) i.e.
(10) 
The smaller the RMSE, the more accurate the estimator is on average.
Finally, we consider how well calibrated each estimator is, in terms of the associated estimates of uncertainty. To evaluate the bias in our second moment estimate, we compare the real standard error and the estimated standard error using
(11) 
A positive number suggests the method is conservative, while a negative number suggests the method is overconfident. We also examine confidence coverage, specifically the proportion of cases in which the nominal 95% confidence interval (CI) for each parameter actually covers the true coefficient. Specifically, the coverage rate is computed by
(12) 
where
for 95% CI and large degrees of freedom. The closer to 95% the coverage is, the better calibrated the estimate is. Coverage rates above 95% suggest that the method is conservative, while coverage rates less than 95% suggest that the method is overconfident.
5 Results
5.1 The smalledgevariance (easy) case
Figures 1 and 2 display the performance of each method on the smalledgevariance data. Panel A in Figure 1 shows the absolute relative bias (ARB) of the coefficient estimates. It suggests that all methods produce very little numerical bias, less than 2.1% across all covariates and methods. Biases from MPLE generally decrease with more edges sampled, and both full MPLE (i.e. with all edges sampled) and MCMLE have the smallest bias, less than 1% across all covariates. This finding is consistent with research on binary ERGMs finding that the MPLE introduces little bias in the firstorder estimation [17, 39]. The lack of appreciable bias is an encouraging sign, suggesting the point estimation of Valued ERGMs for smalledgevariance data is easy to acquire using whichever method we tested.
We also evaluate the (im)precision or variability of estimation (sometimes called efficiency), i.e. the true standard error of each estimator. Panel B in Figure 1 shows that the variability of the MPLE decreases with more edges sampled, and that full MPLE and MCMLE are the most efficient methods. In general, variations of estimators using all methods are close to each other, suggesting that they have similar efficiency for the smalledgevariance data.
We then evaluate the total accuracy of estimations using the rootmeansquare error (RMSE). A more holistic metric, the accuracy measurement combines bias and variation of estimation evaluated above, and smaller RMSE is preferred. Panel C of Figure 1 displays RMSE scores, whose distribution is almost identical to the variability score in Panel B. The similarity between RMSE and variability reveals that biases contribute very little to the total RMSE, with accuracy being dominated by the performance in variability. For valued ERGMs on smalledgevariance data, methods with good variability score thus also have decent accuracy. Full MPLE and MCMLE has the smallest RMSE, though RMSE for all methods under evaluation differ mildly.
Besides performance in coefficient estimation, performance in estimating uncertainties is also evaluated, shown in the first two panels in Figure 2. Panel D displays calibration of each method. An indicator of the bias in standard error estimation, a positive calibration score suggests overestimation of the uncertainty, and a negative calibration suggests underestimation. Noticeably, CD overestimates standard errors for all covariates by a large margin; CD’s calibration scores are all above 2, indicating that the estimated standard error is more than 7 () times its true value. We experimented with different tuning parameters for CD, but its calibration did not reduce to a reasonable level (see Section 8.3 Appendix). Hence, we conclude that in this smalledgevariance case, CD is too conservative to offer useful information about uncertainties of covariates for valued ERGMs.
For MPLE, we find that it underestimates the uncertainties for the dependence term (mutual), but overestimates uncertainties for nondependence terms, though the degree of inflation is smaller. Previous simulation studies found similar patterns for MPLE on binary ERGMs [35, 17]. Interestingly, the bias in standard error estimation increases with the sample size for MPLE. This is in part because the bias of estimation in statistical uncertainty is trivial when the numerical uncertainty is the main source of uncertainty for the MPLE (as is the case with small sample sizes); as the numerical uncertainties decrease with more edges sampled, the bias in statistical uncertainty becomes nonnegligible. The calibration of MPLE is not as far off as that of CD, but caveats should be given when interpreting its standard error estimation for dependence terms.
While CD and the MPLE show varying degrees of error in standard deviation, MCMLE has almost perfect calibration, suggesting that MCMLE is the best method for standard error estimation of smalledgevariance data.
Another metric that considers uncertainty estimation, confidence coverage is the proportion of model fits in which the true value of a given coefficient is covered by the estimated 95% confidence interval (CI), as is shown in Panel E of Fig 2, where the dotted line is the 95% reference line and the solid line represents 100%. The figure tells a similar story to Panel D’s calibration score, since confidence coverage performance is largely determined by performance in calibration of uncertainty when the bias of coefficient is small. The figures show that CD overestimates standard errors so much that its CIs always cover the true value, making them conservative but uninformative. The MPLE’s CIs cover the true values more than 95% for nondependence terms, but undercover the true values for the dependence term, with this deviation becoming larger as sample size increases; however, the CI performance is still reasonable even for dependence terms (above 80% for 95% CI). On the other hand, MCMLE has coverage rates that are extremely close to 95%, showing its characteristic calibration advantage for smalledgevariance data of Valued ERGMs estimation.
Lastly, Panel F of Figure 2 displays the wallclock time of each method. As expected, the wallclock time for computing the MPLE can be greatly compressed by using a sample of edges to approximate the joint pseudolikelihood function, or by using multiple cores to calculate conditional likelihoods. The fastest methods are CD and MPLE using 20 processors. Interestingly, MCMLE has reasonable performance in estimation time, which is just one order of magnitude slower than the fastest methods. This, together with its prominent estimation quality, suggests MCMLE to be the ideal method for Valued ERGMs estimation on the smalledgevariance data.
5.2 The largeedgevariance (difficult) case
Figures 3 and 4 display the performance of each method on the largeedgevariance data. For contrastive divergence, looking across metrics (all displayed in log scale) in Figure 3, we find that CD introduces very large numerical biases (Panel A), has variance greatly higher than the rest of methods (Panel B), and produces much larger errors compared to the other methods (Panel C). They together suggest that coefficient estimation of CD is uninformative and inefficient for largeedgevariance data. Panel D in Figure 4 reveals that CD overestimates uncertainties for every covariate except the nonzero term, whose uncertainty was underestimated by a large margin; Panel E shows that confidence intervals of CD rarely cover the true value of coefficients except the dependence term, which might be attributed to its overestimation of uncertainties. These together indicate that CD fails to offer useful estimation of both coefficients and standard errors for the largeedgevariance data. Section 8.3 in Appendix shows that increasing tuning parameters does not significantly alters CD’s poor performance.
Results of CDMCMLE suggest that following CD with MCMLE generally improves the estimation quality, but its performance is still far from ideal. For point estimation, estimators of CDMCMLE overall introduce less bias, variance, and error than CD, usually by a large margin. However, CDMCMLE still shows a remarkable 214% bias in estimating nonzero term, whose variance and error from CDMCMLE are even higher than those of CD. Uncertainty estimation of CDMCMLE is also poorly calibrated, with underestimated standard errors for all covariates. Its confidence interval also shows less than optimal coverage. These indicate that CDMCMLE is not an effective estimation approach for Valued ERGMs when the edge variance is large.
MPLE, on the contrary, shows excellent performance on all metrics, compared to CD and CDMCMLE. MPLE shows very little bias; the largest bias appears in estimation for the dependence term, and is still less than 0.5% with only 406dyads sampled (which is even smaller than biases of any estimator in the smalledgevariance data). The variances and errors of the MPLE are also greatly smaller than those of CD or CDMCMLE, and they decreases as more edges are sampled for MPLE (though sample size has little impact on these metrics). Figure 4 further reveals that the MPLE is well calibrated for uncertainty, impressively (including for the dependence term), with the maximum less than 2.5% bias in standard error estimation. The tiny bias in both coefficient and standard error estimation makes the MPLE’s confidence coverage very close to the 95% benchmark; the only noticeable deviation happens to the nonzero term, where the full MPLE reports CI covering 90% of the true value, slightly lower than the 95% benchmark.
Overall, we thus see that, for the largeedgevariance case, CD and CDMCMLE perform poorly, while MPLE offers highquality estimates. It is striking that MCMLE has excellent performance for the smalledgevariance data, but seems to fail badly when the edge variance becomes large. One possibility is that the high bias of CD for largevariance data leads the MCMLE to start from a point far away from the maximizer. And the GeyerThompson method of MCMLE is known to work poorly in regimes farther away from the maximizer [27]. CDMCMLE actually failed to converge in 500 iterations for 3 of the 500 sample networks in the first round, and was given another attempt to get convergence. Since MPLE the offers accurate point estimation, we thus experimented with the use of the 406dyads MPLE as starting values for MCMLE (called here MPLEMCMLE).
Seeding MCMLE with MPLE greatly improves its performance compared to CDMCMLE. MPLEMCMLE has the smallest bias, variability and RMSE as Figure 3 shows. Figure 4 suggests that MPLEMCMLE also calibrates the uncertainty well, and offers confidence coverage very close to the 95% benchmark. This reveals that, similar to ERGMs for binary networks, good performance of (GeyerThompson) MCMLE relies heavily on a good seed for Valued ERGMs. When the edge variance is large, seeding MCMLE with MPLE is significantly better than with CD, considering their distinct estimation quality.
Panel F in Figure 4 shows wallclock time of different methods, the main figure displaying it in the original scale and the upperright figure in the log scale. We find that CDMCMLE is orders of magnitudes slower than the rest of the methods, and that seeding MCMLE with MPLE not only improves its estimation quality, but also reduces the computational time by a factor of nearly 15. That being said, MPLEMCMLE is still about 10 times slower than MPLE when the latter uses only 1 processor. Based on previous experiments, MPLE can be orders of magnitude faster by employing multiple processors on large networks, and is therefore the fastest method for estimating Valued ERGMs with large edge variance.
The last piece of the puzzle for the largeedgevariance data regards the competition between MPLE and MPLEMCMLE. A closer look at its point estimation performance suggests that MPLEMCMLE has lower bias, variance, and errors than MPLE, but their difference is tiny, and all MPLEs perform very well already in these dimensions. Their calibration scores are close to each other, while MPLEMCMLE has better performance for confidence coverage of the dependence term, which is almost the only noticeable flaw of MPLE. In terms of wallclock time, MPLEMCMLE is at least an order of magnitude slower than MPLE when multicore computation is not available. In general, MPLE offers highquality estimates in the cases studied here, with MPLEMCMLE offering even better estimation at greatly increased computational cost. This suggests a strategy of employing the MPLE as a default tool for estimation of countvalued ERGM parameters on largeedgevariance data, with strategies such as increasing MPLE sample size or following MPLE with MCMLE employed to improve estimation quality when computationally feasible.
6 Discussion
Overall, our experiments reveal that the variance of the edge value makes a substantial difference in the performance of estimation methods for countvalued ERGMs. By turns, supplemental experiments suggest that network size does not alter the relative advantages of these methods, as we see similar performance on smalledgevariance Utah data (N=29) and on smalledgevariance North Carolina data (N=100) (see Section 8.1 in the Appendix). For smalledgevariance data, all methods perform very well in point estimation, while CD greatly overestimates the uncertainty, and MPLE underestimates uncertainties of dependence terms but overestimates them for nondependence terms. Wallclock time of the methods are close to each other, where the fastest methods are CD and MPLE with subsampling and parallel computing.
For largeedgevariance data, CD fails to offer reliable estimation for both coefficients and standard errors, and so does MCMLE seeded by CD. MPLE, on the other hand, offers highquality estimation for both coefficients and, surprisingly, standard errors as well. MCMLE seeded by MPLE offers even better estimation, although it is already an order of magnitude slower than the slowest MPLE, which can be further accelerated by subsampling edge variables and/or parallel computing.
The results suggest that for smalledgevariance data, MCMLE can serve as the standard estimation method, considering its excellent estimation quality and reasonably fast computation. CD, as the most computationally efficient method, can be utilized for tasks that only require point estimations of coefficients, such as exploratory analysis, prediction or estimating models for network simulation. For tasks that are too expensive to perform MCMLE, such as highdimensional models or large networks, MPLE with subsampling and parallel computing can be a feasible option, although caveats should be given when interpreting standard error estimates. As for networks with large edge variance, MPLE could be the goto method with reliable first and secondorder estimates and the fastest computation. If time allows, one can increase the sample size of MPLE, or refine the MPLE with MCMLE to get further enhancements in estimation quality.
The experiment reveals that MCMCbased estimation methods perform worse when the edgevariance becomes large. One potential reason is their difficulty in reproducing the dichotomized density of the networks (i.e. the proportion of nonzero edges). With larger range of edge values, the toggle of values between zero and one become more unlikely, and the difficulty of matching the target density increases. (This is, of course, a special case of zeroinflation, a common phenomenon in count data models.) This is reflected in the traceplots of CDMCMLE for largeedgevariance data, where the count of nonzero edge in our MCMC simulations was always far from the observed. This issue can be worsen by the mismatch of the MCMC algorithm design and the typical property of valued networks. Compared to their binary counterparts, valued social networks are frequently denser (in the dichotomized density). However, MCMC algorithms in existing software are optimized for sparse, unvalued social networks. We observed improvement in computational time when switching from the toggling proposal that favors toggling nonzero edges to the random proposal, which is the one that offers the highest likelihood of toggling empty edges among existing algorithms. To improve the performance of MCMCbased methods for Valued ERGMs, future research could consider experimenting with MCMC algorithms that pay more attention to the toggling between value zero and one, such as proposals that favors toggling zerovalue edges.
As with any simulation study, one trades off the “realism” of performance on a realistic case against some degree of generality. Although we vary the network size and density of edge value to emulate different application settings, we cannot rule out the possibility that some methods studied here may perform better or worse under other conditions. We encourage future research using simulation studies based on different use cases and model specifications, including nonPoissonian reference measures. Given that we find the variance of edge value plays an important role in influencing the performance of valued ERGM estimation, it would be of interest to experiment with more finegrained classification of the scale of edge variance, in search of a rule of thumb about when MCMLE and MPLE would be the better choice.
Our study also suggests the continuing relevance and necessity of the MPLE to ERGM methodology. Our implementation of MPLE for valued ERGMs enables estimation for largeedgevariance data in feasible time and with highquality results. With good overall accuracy, high speed, and excellent tunability, MPLE would be an excellent general use estimation method for valued ERGMs if its calibration could be improved for smalledgevariance networks. Our findings suggest the value of work on methods that may help further improve calibration of MPLE in the countvalued case; such advances may build on methods that shown to help calibration for binary ERGMs, such as bootstrap resampling [14, 39] and regularization [17].
7 Conclusion
ERGMs, especially for valued networks, can be computationally expensive to estimate. In search of a fast and reliable computational method, we implemented MPLE for countvalued ERGMs, and performed a comparative simulation study using three methods: CD, MCMLE, and MPLE. We found that the variance of edge value is critical in determining the performance of computational methods for valued ERGMs, while the network size does not. For smalledgevariance networks, point estimates are easy to acquire using whichever method, while CD greatly overestimates uncertainties and MPLE underestimates them for dependence terms and overestimates them for nondependence terms. All methods have similar wallclock time. For largeedgevariance networks, CD and CDMCMLE perform poorly, while MPLE and MPLEMCMLE offer strong performance for estimating both coefficient and uncertainties, although MPLE is orders of magnitude faster than MPLEMCMLE.
On the basis of this study, we recommend that researchers choose computational methods based on the variance of edge values. For smalledgevariance data, MCMLE should be the default method, although CD is suitable for point estimations; MPLE is wellsuited for large networks and highdimensional models, especially with a large number of available processors, but caveats should be given for interpreting its standard errors. For largeedgevariance networks, MPLE is a solid method, and researchers can design the size of edge sample and determine whether to follow MPLE with MCMLE based on the computational resources at hand and the requirements of estimation quality. In conclusion, this paper offers MPLE implementation that enables computation of Valued ERGMs for networks with large size or high edge variance, and guidance for choosing and tuning computational methods for Valued ERGM estimation.
Acknowledgements
This research was supported by NSF award SES1826589.
References
 Aicher et al. [2014] Aicher, C., Jacobs, A. Z. and Clauset, A. (2014) Learning latent block structure in weighted networks. Journal of Complex Networks, 3, 221–248.
 An [2016] An, W. (2016) Fitting ERGMs on big networks. Social Science Research, 59, 107–119.

Anderson et al. [1999]
Anderson, C., Wasserman, S. and Crouch, B. (1999) A p* primer: Logit models for social networks.
Social Networks, 21, 37–66.  Asuncion et al. [2010] Asuncion, A., Liu, Q., A. Ihler, P. and Smyth (2010) Particle filtered MCMCMLE with connections to contrastive divergence. In International Conference on Machine Learning (ICML).
 Bernard et al. [1979] Bernard, H. R., Killworth, P., and Sailer, L. (1979) Informant accuracy in social networks IV: A comparison of cliquelevel structure in behavioral and cognitive network data. Social Networks, 2, 191–218.
 Besag [1974] Besag, J. (1974) Spatial Interaction and the Statistical Analysis of Lattice Systems. Journal of the Royal Statistical Society: Series B (Methodological), 36, 192–225.
 Bhamidi et al. [2011] Bhamidi, S., Bresler, G. and Sly, A. (2011) Mixing time of exponential random graphs. The Annals of Applied Probability, 2146–2170.
 Boyle et al. [2014] Boyle, P., Keith H., H., Vaughan, R. and Vaughan, R. (2014) Exploring Contemporary Migration. Abingdon, United Kingdom: Routledge.
 Butts [2008] Butts, C. T. (2008) Social Network Analysis with sna. Journal of Statistical Software, 24, 1–51.
 Butts [2019] — (2019) A dynamic process interpretation of the sparse ERGM reference model. Journal of Mathematical Sociology, 43, 40–57.
 Butts [2020] — (2020) A dynamic process reference model for sparse networks with reciprocity. Journal of Mathematical Sociology, forthcoming.
 Butts et al. [2007] Butts, C. T., PetrescuPrahova, M. and Cross, B. R. (2007) Responder communication networks in the World Trade Center Disaster: Implications for modeling of communication within emergency settings. Journal of Mathematical Sociology, 31, 121–147.
 Dekker et al. [2007] Dekker, D., Krackhardt, D. and Snijders, T. A. B. (2007) Sensitivity of MRQAP tests to collinearity and autocorrelation conditions. Psychometrika, 72, 563–581.
 Desmarais and Cranmer [2012a] Desmarais, B. and Cranmer, S. (2012a) Statistical mechanics of networks: Estimation and uncertainty. Physica A: Statistical Mechanics and its Applications, 391, 1865–1876.
 Desmarais and Cranmer [2012b] Desmarais, B. A. and Cranmer, S. J. (2012b) Statistical Inference for ValuedEdge Networks: The Generalized Exponential Random Graph Model. PLoS ONE, 7, e30136.
 Drabek et al. [1981] Drabek, T. E., Tamminga, H. L., Kilijanek, T. S. and Adams, C. R. (1981) Managing Multiorganizational Emergency Responses: Emergent Search and Rescue Networks in Natural Disaster and Remote Area Settings. No. Monograph 33 in Program on Technology, Environment, and Man. Boulder, CO: Institute of Behavioral Sciences, University of Colorado.
 van Duijn et al. [2009] van Duijn, M. A. J., Gile, K. J. and Handcock, M. S. (2009) A framework for the comparison of maximum pseudolikelihood and maximum likelihood estimation of exponential family random graph models. Social Networks, 31, 52–62.
 Eddelbuettel et al. [2011] Eddelbuettel, D., François, R., Allaire, J., Ushey, K., Kou, Q., Russel, N., Chambers, J. and Bates, D. (2011) Rcpp: Seamless r and c++ integration. Journal of statistical software, 40, 1–18.
 Faust [2011] Faust, K. (2011) Animal social networks. The SAGE handbook of social network analysis, 148, 166.
 Geyer and Thompson [1992] Geyer, C. J. and Thompson, E. A. (1992) Constrained Monte Carlo Maximum Likelihood for Dependent Data. Journal of the Royal Statistical Society: Series B (Methodological), 54, 657–683.
 Granovetter [1973] Granovetter, M. S. (1973) The Strength of Weak Ties. American Journal of Sociology, 78, 1360–1380.
 Handcock [2003] Handcock, M. S. (2003) Statistical models for social networks: Inference and degeneracy. In Dynamic Social Network Modeling and Analysis (eds. R. Breiger, K. M. Carley and P. Pattison), 229–240. Washington, DC: National Academies Press.
 Handcock et al. [2008] Handcock, M. S., Hunter, D. R., Butts, C. T., Goodreau, S. M. and Morris, M. (2008) statnet: Software Tools for the Representation, Visualization, Analysis and Simulation of Network Data. Journal of Statistical Software, 24, 1548–7660.
 Hinton [2002] Hinton, G. E. (2002) Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation, 14, 1771–1800.
 Hoff et al. [2002] Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002) Latent space approaches to social network analysis. Journal of the American Statistical Association, 97, 1090–1098.
 Huang and Butts [2020] Huang, P. and Butts, C. T. (2020) Intercounty Migration in an Immobile Era: A Network Perspective. Annual Meeting of the American Sociological Association.
 Hummel et al. [2012] Hummel, R. M., Hunter, D. R. and Handcock, M. S. (2012) Improving SimulationBased Algorithms for Fitting ERGMs. Journal of Computational and Graphical Statistics, 21, 920–939. URL: https://doi.org/10.1080/10618600.2012.679224.
 Hunter et al. [2008] Hunter, D. R., Handcock, M. S., Butts, C. T., Goodreau, S. M. and Morris, M. (2008) ergm: A Package to Fit, Simulate and Diagnose ExponentialFamily Models for Networks. Journal of statistical software, 24, nihpa54860.
 Hunter et al. [2012] Hunter, D. R., Krivitsky, P. N. and Schweinberger, M. (2012) Computational statistical methods for social network models. Journal of Computational and Graphical Statistics, 21, 856–882.
 Krackhardt [1988] Krackhardt, D. (1988) Predicting with networks: Nonparametric multiple regression analyses of dyadic data. Social Networks, 10, 359–382.
 Krivitsky [2012] Krivitsky, P. N. (2012) Exponentialfamily random graph models for valued networks. Electronic journal of statistics, 6, 1100–1128.
 Krivitsky [2017] — (2017) Using contrastive divergence to seed Monte Carlo MLE for exponentialfamily random graph models. Computational Statistics & Data Analysis, 107, 149–161.
 Krivitsky and Butts [2013] Krivitsky, P. N. and Butts, C. T. (2013) Modeling valued networks with statnet. The Statnet Development Team.
 Krivitsky et al. [2012] Krivitsky, P. N., Handcock, M. S. and Hunter, D. R. (2012) Package ‘ergm.count’. Journal of Statistics, 6, 1100–1128.
 Lubbers and Snijders [2007] Lubbers, M. J. and Snijders, T. A. B. (2007) A comparison of various approaches to the exponential random graph model: A reanalysis of 102 student networks in school classes. Social Networks, 29, 489–507.
 Morris et al. [2008] Morris, M., Handcock, M. S. and Hunter, D. R. (2008) Specification of exponentialfamily random graph models: Terms and computational aspects. Journal of Statistical Software, 24, 1–24.
 Nowicki and Snijders [2001] Nowicki, K. and Snijders, T. A. B. (2001) Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96, 1077–1087.
 Robins et al. [1999] Robins, G. L., Pattison, P. E. and Wasserman, S. (1999) Logit models and logistic regressions for social networks, iii. valued relations. Psychometrika, 64, 371–394.
 Schmid and Desmarais [2017] Schmid, C. S. and Desmarais, B. A. (2017) Exponential random graph models with big networks: Maximum pseudolikelihood estimation and the parametric bootstrap. In 2017 IEEE International Conference on Big Data (Big Data), 116–121.
 Simpson et al. [2013] Simpson, S. L., Bowman, F. D. and Laurienti, P. J. (2013) Analyzing complex functional brain networks: Fusing statistics and network science to understand the brain. Statistics surveys, 7, 1–36.
 Snijders [2002] Snijders, T. A. B. (2002) Markov Chain Monte Carlo Estimation of Exponential Random Graph Models. Journal of Social Structure, 3, 1–40.
 Strauss and Ikeda [1990] Strauss, D. and Ikeda, M. (1990) Pseudolikelihood Estimation for Social Networks. Journal of the American Statistical Association, 85, 204–212.
 Ulibarri and Scott [2017] Ulibarri, N. and Scott, T. A. (2017) Linking Network Structure to Collaborative Governance. Journal of Public Administration Research and Theory, 27, 163–181.
 U.S. Census Bureau [2018] U.S. Census Bureau (2018) CountytoCounty Migration Flows: 20112015 ACS. https://www.census.gov/data/tables/2015/demo/geographicmobility/countytocountymigration20112015.html. Accessed: 1/15/2019.
 Vega Yon et al. [2021] Vega Yon, G. G., Slaughter, A. and de la Haye, K. (2021) Exponential random graph models for little networks. Social Networks, 64, 225–238.
 Vu et al. [2013] Vu, D. Q., Hunter, D. R. and Schweinberger, M. (2013) Modelbased clustering of large networks. The Annals of Applied Statistics, 7, 1010 – 1039.
 Wang et al. [2009] Wang, P., Robins, G. and Pattison, P. (2009) PNet: Program for the Simulation and Estimation of Exponential Random Graph (p*) Models. The University of Melbourne.
 Ward et al. [2013] Ward, M. D., Ahlquist, J. S. and Rozenas, A. (2013) Gravity’s Rainbow: A dynamic latent space model for the world trade network. Network Science, 1, 95–118.
 Windzio [2018] Windzio, M. (2018) The network of global migration 1990–2013. Social Networks, 53, 20–29.
 Windzio et al. [2019] Windzio, M., Teney, C. and Lenkewitz, S. (2019) A network analysis of intraEU migration flows: how regulatory policies, economic inequalities and the networktopology shape the intraEU migration space. Journal of Ethnic and Migration Studies, 1–19.
 Zipf [1946] Zipf, G. K. (1946) The P1 P2/D Hypothesis: On the Intercity Movement of Persons. American Sociological Review, 11, 677–686.
8 Appendix
8.1 Performance on a larger network with smalledgevariance
We also conducted the experiment on the coarsened migrant count among 100 North Carolina counties, following the same procedure as used in the main experiments. Having discovered that the variance of the edge value makes a substantial difference in the performance of estimation methods for Valued ERGMs, this experiment, in comparison with the smalledgevariance Utah migration data, aims to study the influence of network size. (Ideally, we would also experiment with large networks with large edge variance. Unfortunately, this turned out to be prohibitively computationally expensive for a simulation study, and we were unable to pursue it.) The North Carolina migration network has 100 nodes, and its binary density is 0.36, lower than the Utah migration network. The edge value ranges from 0 to 4, by taking logarithm with base 10 and rounded to its nearest integer (standard deviation 0.81). We made three modifications from the setting for the smalledgevariance Utah case, to better fit to the North Carolina data, and also for robustness checks. Since the larger network has lower binary density, we use the default TieDyad sampling proposal for MCMC in MCMLE estimation (which favors sampling nonzero edge for toggling), while the Utah data used random sampling proposal for MCMLE. We also remove the nonzero term, and replace it with a dependence term: . This is a valued version of mixed2star, which calculates the total “migration current” flowing through (inside and out of) each node [26]. Lastly, the larger networks contains more dyads, so for MPLE, we set the edge sample size at 1000, 5000, and 9900, taking 10%, 50% and 100% of the total dyads respectively.
Figures 5 and 6 show the results of the experiment on the smalledgevariance, large network. The results generally resemble those of the smalledgevariance small network. Figure 5 indicates that all methods introduce very little numerical bias; variance of MPLE decreases with more edges sampled, and full MPLE and MCMLE are the most efficient. RMSE, again, follows a very similar distribution as that of the variability, suggesting that biases contribute little to the errors in accuracy measurement. Panel D in Figure 6 shows that CD also greatly overestimates the uncertainty of all covariates in the large network; MPLE again slightly overestimates the uncertainty of nondependence terms, but underestimates the uncertainty of both dependence terms. These lead to the result that CD covers the true value in its 95% confidence interval all the time, while MPLE overcovers the true value for nondependence term and undercover those of the dependence terms. MCMLE, again, offers high quality uncertainty estimation and confidence coverage. Lastly, the wallclock time suggests that MCMLE is still a decent method in time consumption, and CD is the fastest method. The main difference, as compared to the small network case, is that MPLE is relatively slower, because of the large sample size of edge variables. The larger sample size also reduces the bias of MPLE, making the 1000 dyads MPLE less biased than CD. Overall, we conclude that in our experiment, network size makes little difference in the performance of estimation methods for Valued ERGMs.
8.2 Performance of MPLE with Different Sampling Methods
Figure 7 shows the accuracy, as calculated by RMSE, of MPLE using three edge variable sampling methods, in the smalledgevariance case. The methods are random sampling, which is the one reported in the Results section, the TNT (“TieNoTie”) sampling, which puts equal weight for sampling a zerovalue (empty) edge and a nonzerovalue edge, and the “flatval” sampling, which put equal weight for each possible edge value. The figure suggests that these three sampling methods introduce similar levels of error, with slightly lower accuracy for flatval sampling when 406 dyads are sampled. We observed this similarity in performance in other dimensions we evaluate, and conclude that sampling methods do not alter the performance of MPLE for estimation of Valued ERGMs.
8.3 Estimation Quality of Contrastive Divergence with Different Parameters
Section 5 showed that Contrastive Divergence (CD) is uninformative in standard error estimation for the smalledgevariance data, and uninformative in coefficient estimation for the largeedgevariance data. To determine whether tuning parameters for CD would improve its performance, we varied the two main parameters of CD and examined their performance. The main tuning parameters for CD are steps, which is the number of MetropolisHastings steps, and multiplicity, which is the number of proposal for each step. The default setting in ergm.count package for CD is 8 steps and 1 multiplicity, which was the setting reported in Section 5. Here we tried increasing either or both parameters by 10 times. Table 1 shows calibration and time of CDs compared to MCMLE for the smalledgevariance data. It suggests that increasing steps does help reduce overestimation of uncertainty for CD (while increasing multiplicity almost does not), yet CD still overestimates the standard error by more than 2.5 () times. In the mean time, MCMLE has much better calibration, while consuming similar amount of time. Hence, we conclude that for smalledgevariance data, it is faster and better to remedy conservatism of CD by following it with MCMLE, as compared to increasing steps or multiplicity of CD.
Tables 2 reports absolute relative bias (in percentage) and wallclock time of CDs compared to MPLE with 406 sampled dyads and without parallel computing, for the largeedgevariance data. It indicates that increasing steps or multiplicity of CD does not consistently reduce the biases of each covariate. The table also shows that all CDs have much greater biases compared to MPLE for every covariate, whose computational time is at the same scale of CD, and can be effectively reduced by using more than 1 processors. So MPLE, compared to CD, offers much better coefficient estimation without time loss for the largeedgevariance data. In summary, this section shows that increasing tuning parameters does not bring significant improvement for CD in its overestimation of uncertainty for the smalledgevariance data, or its large bias of coefficient estimation for the largeedgevariance data.
Contrastive Divergence  MCMLE  
Steps  8  80  8  80  
Multiplicity  1  1  10  10  
Sum  2.50  1.32  3.05  2.05  0.03 
Nonzero  2.05  1.25  2.82  0.96  0.09 
Edgecov  2.55  1.34  3.09  2.03  0.04 
Nodeicov  2.34  1.30  3.01  1.87  0.02 
Nodeocov  2.37  1.23  3.05  1.90  0.01 
Mutual  2.03  1.13  2.85  1.61  0.01 
Wallclock time (s)  1.08  11.68  3.42  46.58  13.55 
Contrastive Divergence  MPLE  
Steps  8  80  8  80  (406d) 
Multiplicity  1  1  10  10  (1core) 
Sum (%)  87.6  129.2  137.6  101.3  0.2 
Nonzero (%)  377.7  13.3  4.9  108.5  0.1 
Edgecov (%)  94.2  186.7  110.2  70.8  0.1 
Nodeicov (%)  68.4  119.1  114.8  89.7  0.1 
Nodeocov (%)  68  116.3  116.6  89.5  0.1 
Mutual (%)  76.9  237.5  65.1  37.1  0.4 
Wallclock time (s)  31.8  39.7  34.7  103.3  79.7 
Comments
There are no comments yet.