1 Introduction
The exponential family random graph modeling (ERGM) framework (holland1981exponential; frank1986markov; snijders2006new; hunter2006inference) (known in older work by the term (wasserman1996logit)) has emerged as an important approach to the statistical analysis of social network data, providing a highly general way of specifying distributions on graphs and allowing the complex dependence structure of edges in a network to be specified in terms of local structural properties (robins2007recent). A wide variety of features have been proposed as potential instantiations of the different types of driving forces governing the formation of social networks (morris2008specification), with the potential to accommodate an increasingly rich range of network types. At the same time, however, poor specifications can lead to unrealistic model behavior (handcock2003assessing; schweinberger2011instability), and in practice considerable domain expertise can be required to select terms that implement the correct dependence structure for a specific setting (lusher2013exponential).
Choosing among competing model specifications can be viewed as a model selection problem, making information criteria such as Akaike Information Criterion (AIC) (akaike1973information) and Bayesian Information Criterion (BIC)(schwarz1978estimating) natural choices for adjudication. However, these criteria rely on a number of theoretical assumptions that are frequently problematic in a network modeling context. First, edge variables in typical network models are nonindependent, making it difficult to determine the effective sample size needed for sizecorrected AIC and BIC calculations (hunter2008goodness); indeed, at this time the theoretical justification for these criteria is unclear in the case of models for single networks with dyadic dependence (though see kolaczyk2015question; schweinberger2017exponential, for some possible directions)
. Second, likelihood calculations for complex ERGMs rely on stochastic approximations (e.g., bridge sampling) that become expensive for large networks and where high precision is needed. This is not a barrier to parameter estimation (which typically relies on quantities such as likelihood ratios between identically specified models with similar parameters that can be more precisely computed), but makes fine distinctions between the likelihoods of similarly performing but differently specified models difficult.
The theoretical (if not computational) challenges noted above can be avoided in a Bayesian context by performing model selection by Bayes factors
(raftery1995bayesian). ERGM applications to date require fairly expensive, highquality posterior simulation using methods such as the Reversible Jump Exchange Algorithm (caimo2013bayesian), a tailored version of the conventional Reversible Jump MCMC algorithm (green1995reversible) combined with the exchange algorithm (Murray:2006:MDD:3020419.3020463) by caimo2011bayesian to deal with the double intractability of the posterior density, as the ERGM likelihood cannot be computed analytically in general. This approach has the advantage of being theoretically principled in the setting of fixed sample size, but the need to obtain a highquality approximation of the Bayes factor can be computationally demanding. Promising directions in this area include efforts like those of bouranis2018bayesian to reduce the computational cost by adjusting the posterior samples yielded by an analytically tractable approximation of the ERGM likelihood, e.g. the pseudolikelihood. However, its magnitude adjustment step requires approximating the true ERGM likelihood evaluated at MLE, which raises the same issue encountered in the calculation of AIC/BIC; thus, an efficient and easily used method for obtaining Bayes factors for general ERGMs remains elusive. Moreover, the Bayes factor itself is not always an ideal tool for model selection. The Bayes factor (and its multimodel generalizations) provides an answer to the question, “which of a set of proposed models is more likely to be the true data generating process?” assuming that the models being evaluated are a prioriequally probable and that one is correct. While the equal probability assumption (i.e., uniform model priors) can be adjusted, the hidden assumption that one proposed model is correct – or, at least, that among incorrect models the model “more likely” to be correct is also “better” – is not entirely innocent
(Bernardo94; spiegelhalter2002bayesian). When no available model is correct, the model preferred by the Bayes factor may or may not have other desirable properties (e.g., better predictive performance for some task of interest), and indeed the Bayes factor may heavily weight aspects of model performance that are not in practice those most valued by the analyst.Relatedly, the Bayes factor can be sensitive to model features, such as the tail weight of the parameter priors, that are often chosen on semiarbitrary grounds (and that in practice often have little impact on estimation). This creates the risk that model selection will be unduly influenced by choices of the analyst that are difficult to constrain and that are otherwise of minimal substantive importance. Moreover, the approach is only applicable to Bayesian inference, which is not at present widely used for ERGMs due to computational challanges. Thus, while the Bayes factor can be an important tool in the network analyst’s arsenal, it also poses considerable difficulty in practice.
As an attempt to compensate for the difficulties of likelihoodbased criteria, goodreau2007advances; hunter2008goodness introduced graphlevel “goodnessoffit” (GOF) plots to assess the fit of ERGMs, in which several graphlevel statistics (e.g., degree distributions, edgewise shared partner distributions) of observed networks are compared against those of simulated networks from the estimated model. The underlying idea is that draws from a fitted ERGM should have structural properties similar to the observed one, and, in particular, that network properties not explicitly used to fit the model (“out of model” properties) should be reproduced by those that were used (“in model” properties). The properties used to assess a model are usually chosen on substantive grounds, though some efforts have been made to suggest relatively “generic” statistics of broad utility (e.g. hunter2008goodness; wang2013exponential1; wang2013exponential2; shore2015spectral). While this approach has proven useful in practice, it is properly a model adequacy checking strategy rather than a model selection strategy: it provides ways to identify performance deficiencies in a chosen model, but it does not provide a general rubric for choosing among competing models. Likewise, the GOF approach is not designed to provide strong information regarding the predictive performance of a fitted model. Rather, it only answers the question of how well networks drawn from a model fit to a specific data set reproduce other features of that data set. This is useful for detecting when a fitted model is incapable
of producing realistic behavior, but it does not establish that the model will predict well (either in the context of extrapolation to new structures or interpolation of heldout data).
While it could perhaps be argued that predictive performance is not always a major concern for ERGMs, lack of predictive power at the very least suggests limitations of a model that should be borne in mind when using it. Moreover, predictive performance is clearly a consideration in many applications. For example, studies of international conflict (hoff2004modeling; maoz2006structural) or bill cosponsorship (fowler2006connecting; cranmer2011inferential)
are concerned with consequential relations for which predictions are of significant interest. These could include e.g. the ability to forecast future network states from past network states (or merely from covariates), or conditional prediction of edge states given covariates and/or information on other edges. In the latter case, performance involving specific edge states (as opposed e.g. to unlabeled properties such as the degree variance) is of obvious importance: it is important to know whether a predicted conflict is between e.g. China and United States versus Bulgaria and Croatia. A model that successfully reproduced the degree distribution for a given year’s conflict network might not perform well at predicting the degree distribution for the next year’s network, nor at predicting who will be in a conflict with whom. Such a model might be judged satisfactory via conventional adequacy checks, but its value for understanding global conflict would be questionable at best.
The above suggest more explicit predictive metrics as potentially useful tools for model selection (as well as model assessment). In many fields, crossvalidation (CV) techniques have been fruitfully used in this role, allowing one to assess how well the predictions from a model generalize to a new data set; flexible, easily understood, and able to be linked directly with performance outcomes of substantive interest, CV methods are welladapted to model selection (see arlot2010survey, for a recent review). Typically, CV divides the data set into a training set (to which the model is fit) and a test set
(against which the fitted model’s predictions are evaluated under prespecified loss functions), selecting the model with the smallest estimated loss. Many variants of this procedure exist (e.g., leaveoneout CV,
fold CV, bootstrap CV, etc.), but all share the common feature of assessing predictive performance on a data subset that was held out during parameter estimation. Classical CV for regression models with independent and identically distributed (i.i.d.) data was proposed as early as geisser1975predictive, and CV procedures have been tailored for the purpose of performance evaluation in latent variable modeling of relational data (e.g., hoff2008modeling; dabbs2016comparison; li2016network; chen2018network), where edge variables are conditionally independent given latent variables and hence can be straightforwardly held out. Likewise, there is work on applying CV to ERGMs where multiple networks from the same population model are available, and entire networks can be held out (stewart2019multilevel). Such work suggests considerable potential utility in applying CV to the more typical setting of general ERGMs on single graphs realizations.A difficulty with standard CV techniques in a general ERGM setting is the direct dependence of edge variables, which makes it impossible to simply omit edge variables without changing the underlying probability model.^{1}^{1}1This is related to the inconsistency of dependence models under naive subsampling, when the presence of unmeasured vertices is not accounted for (shalizi2013consistency); procedures that do allow consistent estimation are discussed by schweinberger2017exponential. Intuitively, the presence of an edge variable in such a model is itself informative, and this information must be retained for predictions to be meaningful. This same issue arises in the context of ERGM estimation from networks with missing edge data, where the presence of edge variables must still be accounted for even when their states are unknown. handcock2010modeling introduced an estimation scheme for handling such data in the case of ignorable missingness, which resolves this difficulty by integrating over the unknown states of the missing edge variables (and thereby preserving the impact of their interactions with the variables whose states are observed). The missing data case suggests the key to obtaining CVlike procedures for ERGMs: while one cannot meaningfully hold out edge variables, one can hold out the states (i.e., whether a given edge variable contains an edge or a null) of edge variables, retaining their presence but treating them as missing data. Building on this intuition, wang2016multiple proposed a heldout evaluation scheme for evaluating modelbased imputation that was analogous to CV, which they dubbed HeldOut Predictive Evaluation (HOPE). Unlike CV, the edge variables in the validation set under HOPE are only marked as missing (i.e. NA) in the model training phase, instead of being completely eliminated. Thus, the trained model accounts for the presence of the edge variables, but it is not given information on their states. Testing is then performed by conditional prediction of the heldout edge states from the fitted model conditional on the edge variables that are not heldout. Since the core techniques needed to perform this procedure (estimation with missing edge data and conditional procedure) are supported in standard ERGM software, it can be used without the need for custom software implementations or other special considerations.
While HOPE was originally introduced in the context of imputation assessment, it is a general CVanalogue and the idea of holding out a portion of data as missing can be used for a wide range of evaluation tasks. For instance, koskinen2018outliers proposed a modelbased approach for the identification of influential nodes in a network by assessing the sensitivity of estimated parameters when all edge variables associated with the corresponding node is held out. In this paper, we introduce the use of HOPE to perform model selection for ERGMs, with an emphasis on simple metrics and procedures that are applicable to a wide range of network data. Using HOPE, researchers can gain information on how well the model is able to predict heldout portions of the data from other edge observations; where the model performs poorly based on heldout data, which can point to weaknesses in parameterization; and how one model’s predictive performance compares to others. Because such predictive assessments automatically correct for overfitting (which by definition improves insample performance while harming outofsample performance), they can be quantitatively compared across specifications in a way that other metrics often cannot. Taken together, these assessments can assist researchers in improving models and facilitate the comparison of fit across multiple models, thus facilitating model selection.
The remainder of the paper is structured as follows. In section 2, we review the ERGM framework with special emphasis on the estimation of ERGM in the presence of missing data, accompanied by some notation necessary for the purpose of the proposed methodology. In section 3, we elaborate on the idea of HOPE and introduce two general strategies for holdingout data along with a set of metrics that measure a model’s performance based on several structural features and at different granularities. Section 4 provides two data examples that demonstrate the effectiveness of HOPE and compare it to existing approaches. Finally, we conclude with a discussion and potential directions for future work.
2 ERGM estimation and simulation process
Consider a random graph of fixed order , where is the node (vertex) set and is the stochastic adjacency matrix of , such that is the variable representing the state of an edge variable from node to node . For valued networks, edge variable represents the edges’ strengths or frequencies, e.g. trade flows between countries or numbers of interactions between individuals. In most social networks, selfedges are absent or not of substantive interest, and therefore we assume in this paper (though ERGMs can still be fit to networks with loops, and HOPE is still applicable). Lower case will represent the realized value of the edge variable. Letting be the set of all possible values of such matrices, an ERGM form for the probability mass function of is given by
(2.1) 
where
(2.2) 
The userdefined statistic
can incorporate any finite network statistics ranging from exogenous covariates (e.g., homophily based on nodes’ characteristics) to endogenous network effects (e.g., transitive closure) that are represented as counts of those local features in the observed network. The model parameter vector is
and its mapping to canonical parameters is . For notational convenience, we assume that all covariates are implicitly incorporated into . The reference measure determines the support and the basic shape of corresponding ERGM distribution (see krivitsky2012exponential, for an illustration). While the above can be applied to a wide range of relational data types, our HOPE procedure is generally applicable to any network model for which the estimation of parameters in the presence of missing data is feasible. To avoid notational and related complications caused by valued edges, however, we illustrate the key idea of HOPE using canonical binary ERGMs with counting measures (i.e., and ).While various approaches can be used to estimate given an observation from , we focus here on the maximum likelihood estimation (MLE, presently the most widely used class of answer). Importantly, the MLE can be obtained when part of the data (i.e., edge states) are ignorably missing, which is a critical component of HOPE. To estimate where is possibly only partially observed, we follow the scheme introduced by handcock2010modeling for constructing the facevalue likelihood based solely on the observed part of the data,
(2.3) 
where is the observed portion of , and represents all configurations for missing portion of the data that are concordant to . Note (2.3) is a valid likelihood as it correctly describes the probability of the observed data marginalizing across the unobserved portion, and we can estimate the model parameters under relatively general conditions (see handcock2010modeling, for more details).
Maximizing (2.3) with respect to yields the MLE, , from which one can conduct conditional ERGM simulations (handcock2010modeling) to predict (or impute) the states of missing edge variables given the states of observed (i.e., retained) edge variables,
(2.4) 
where .
The software package MPNet (wang2014mpnet) and the ergm package (hunter2008ergm) of the statnet (handcock2008statnet) software suite for R (R2018) have implemented simulationbased algorithms for approximating the MLE of under model (2.3). Additionally, both allow simulations from the estimated parameters with or without missing data (the former is used in goodnessoffit assessments).
In this paper, we implement the HOPE procedure in R using the ergm package. To run conditional simulations from ERGMs, the observed part of the graph can be fixed using the constraint argument in the simulate.ergm function. The procedure is repeated for each set of heldout edge variables in observed data, and the simulated networks are then evaluated based on several criteria from various aspects. This makes HOPE analogous to crossvalidation in that we are evaluating how well we can predict data that are not used for model training.
3 Heldout Predictive Evaluation
In this section, we first describe two possible ways that relational data can be heldout from an adjacency matrix. We then elaborate on the implementation procedure for HOPE. Finally, we introduce some generic metrics that can be used to evaluate a model’s performance at different granularities.
3.1 Heldout strategies and general procedure
To perform HOPE for the purpose of model selection, the entire index set of edge variables (, if is directed) is partitioned into subsets, , where
. Intuitively, our procedure operates by holding out the values of edge variables in one subset while fitting to the remainder, using the resulting estimate to predict the values of the heldout data. (Compare with conventional crossvalidation, in which the variables themselves  and not merely their values  are held out. While this distinction is immaterial for independence models, it is consequential for typical ERGMs.) While other schemes are also feasible, two natural options for holding out network data are random sampling of edges and removal of all edge values associated with a randomly chosen vertex. For convenience of notation, we assume below that network size is an even number; however, generalization to the oddnumbered case is immediate.

Random edge removal: is randomly divided into nonoverlapping, equallysized batches of size edge variables (if the graph is undirected), edge variables are selected (if the graph is directed), with each batch being heldout in turn. A similar strategy was used by hoff2008modeling for the purpose of selecting the optimal number of dimensions in latent space structure. It is worth mentioning that leaveoneout can be argued as a special case for random sample where , as each edge variable forms a batch by itself and is held out successively.

Node removal: All edge variables involving a particular node are simultaneously heldout. koskinen2018outliers used a similar strategy to identify influential observations (nodes) under the ERGM framework.
The general procedure for HOPE then proceeds as follows. For :
1. Fit an ERGM to partially observed data based on (2.3), which in turn gives the corresponding heldout data MLE, .
2. Simulate graphs from the conditional distribution defined in (2.4) as with their corresponding adjacency matrices . Note, that for any , , because all edge variables in are fixed when simulating from the conditional distribution (2.4).
3. Evaluate the ability of the model to accurately predict the heldout data under any of several error metrics. The choice of which metrics to use will depend on which structural features users deem substantively important for their model.
When the graph is large, a subset of , namely , can be considered to reduce the computational burden, with metrics being calculated only on these subsets. In such cases, both uniformly random and stratified subset selection approaches may be considered; the former is simpler (and always applicable), but the latter may lead to more efficient approximation of the completepartition solution.
3.2 Error metrics
It is natural to assess the predictive accuracy of a model at three levels: the dyad, the node, and the graph as a whole. Here, we outline a number of broadly useful choices of error metrics at each level of assessment.
3.2.1 Dyad level metrics
Dyadlevel metrics focus on how well the model performs in predicting the presence and absence of edges. Our suggested measures are based on two types of edge predictions. First, we may simply consider whether an edge is predicted to be present or absent in any given simulated draw. Comparing the true states of heldout edges in each iteration (present/absent) with their imputed states leads to an aggregate confusion matrix, , a twobytwo table of true versus predicted values over all
. While the confusion matrix is pleasingly direct, it does not discriminate between differences in predicted probabilities and may be informationpoor in sparse networks. For that reason, we also employ marginal probability estimates of edge presence, as computed from heldout models. Specifically, for
, we estimate the marginal probability by , which is equivalent to the posterior mean for the binomial probability under a Jeffrey’s prior; the use of a local Bayesian estimate here provides shrinkage in the case of highly sparse predictions, and ensures that predictive summaries remain wellbehaved even for poor models. It is worth mentioning that the marginal probability can be calculated exactly via change scores (snijders2006new) under leave1out, but we do not pursue this direction as experiments show that is usually a good approximation to the marginal probability when is sufficiently large (e.g. 500).From the above, we may then assess predictive performance by applying standard accuracy assessments to and/or . In this paper, we employ predictive accuracies for edges (Edge ACC), nulls (Null ACC), and edge variables (Overall ACC) based on , as well as the total squared loss (TSL),
based on . These metrics are widely used for binary data, easily interpreted, and broadly comparable across models. As discussed below, they constitute a natural starting point for model selection.
3.3 Node level metrics
Node level metrics provide information on how well (or poorly) the model can reproduce the roles and positions of the nodes in the observed graph, when each node’s edge variables are successively held out. If a model provides a plausible approximation to the mechanism giving rise to the observed network, then one would anticipate that the discrepancy between the node level metrics of the observed and simulated networks would be minimal. However, this will likely depend on the specific node metric, since some are more sensitive to minor changes in the structure than others (e.g.; degree versus betweenness). Because holding out the state of even one edge variable could potentially alter the roles/positions of all the nodes in a graph, the discrepancy is here calculated for all nodes across all simulations.
Given any nodelevel centrality measure, , it is natural to assess the discrepancy between the vector of observed node level indices and those of simulated, . We quantify the degree of agreement via the coefficient of reliability,
(3.1) 
where is the total sum of squares with respect to node level index measuring the overall variation in the observed values and the mean squared error (MSE) of node level index is given by
. The reliability metric is similar to the coefficient of determination () in multiple regression and the reliability measures widely employed in psychometric contexts (e.g. gulliksen1987theory)
. The coefficient of reliability standardizes the amount of predictive error in a model’s node level index relative to the overall variation in the observed values; it can thus be thought of as assessing the extent to which the outofsample centrality scores are sufficiently accurate to discriminate between central and peripheral vertices in the observed graph. We focus on degree, betweenness, and eigenvector centralities since they are commonly used metrics in analyzing social networks’ structures, but other nodelevel indices
(see, e.g. wasserman1994social) are certainly feasible. In this context, each index may be motivated as follows:
Degree centrality quantifies the number of ties that each node has to all other nodes within the network. A node with a higher degree centrality has more ties to the other nodes in the network; accuracy with respect to this index thus indicates that the model is able to distinguish more active vertices from those with fewer relationships.

Eigenvector centrality measures the degree of membership of a given node in the largest core/periphery structure in the graph.^{2}^{2}2
Except in very rare cases for which the graph adjacency matrix lacks a principal eigenvalue. In such circumstances, eigenvector centrality is a signed indicator of membership in the two largest core/periphery structures (positive versus negative).
The eigenvector centrality is also the best onedimensional approximation of the graph structure (in a leastsquares sense), and accuracy in reproducing it indicates the extent to which the model is able to recover the broadest structural features of the graph. 
Betweenness centrality (freeman1978centrality) measures the extent to which a node lies on unique shortest paths between other nodes in the network. A node with a high betweenness centrality acts as a bridge between other parts of the network. Betweenness is highly sensitive to the path structure in the network, and accuracy on this index thus indicates a model that is able to preserve distances among nodes.
Similar to , the statistic has an upper limit of , and it is equal to if and only if the proposed model reproduces networks that are perfectly correlated with the observed networks with respect to nodes’ roles and positions. A smaller value indicates that the model is less able to recapitulate the observed network’s structure. If is negative, then the predicted nodelevel indices deviate from the observed networks’ values substantively even after adjusting for the variation across the observed values.
It should be noted that, like any other reliability metric, measures relative discrimination and hence is sensitive to the underlying variation in the observed centrality score. Where there is little variation in centrality (and hence little opportunity for discrimination), it may be better to use the raw values than . The choice depends on whether the objective is to measure direct accuracy, versus ability to discriminate between high and low centrality nodes per se; in some cases, it may be worthwhile to examine both metrics.
3.4 Graph level metrics
A third class of selection metrics quantify how well (or poorly) the proposed model reproduces the observed networks’ graphlevel structural features. Specifically, we here measure the discrepancies between the observed network and the simulated networks’ betweenness and degree centralization indices (freeman1978centrality). The centralization score is defined by
with being any generic centrality score. can be rewritten as , clarifying that the centrality is proportional to the difference between the maximum and mean centrality scores (i.e., the extent to which one vertex is much more central than average). In this paper, all centralization scores are normalized by the corresponding theoretical maximum centralization scores.
For any generic graph level metric , we consider the root mean squared error (RMSE) as the discrepancy measure,
(3.2) 
The RMSEs for degree and betweenness centralization then respectively indicate the extent to which the model is able to capture the upper tails of these two distributions. Poor performance on these measures may suggest that the model lacks the ability to identify nodes that deviate substantially from the norm  a possible indicator that additional covariates are required (assuming that such identification is desired).
3.5 Remarks on Metrics and Heldout Strategies
The proposed metrics cover a wide range of predictive outcomes at multiple scales, but they are not exhaustive. Our aim is to provide a suite of simple measures that are of general use. Importantly, HOPE can easily be adapted to assess the quality of models based on alternative structural features or error metrics. In fact, we encourage researchers to carefully consider their modeling objectives when deciding which metrics to use as well as which heldout data strategy to employ. Alternative metrics may provide distinct or even contrasting information, since particular models may be better at some prediction tasks than others. Moreover, different ways of holding out data may yield distinct results, since each both provides different information to the model during the estimation phase and poses a different prediction task during the imputation phase. Comparing performance across tasks thus gives useful insights into a model’s strengths and weaknesses.
3.6 Efficient implementation of HOPE in parallel
It is worth noting that the proposed HOPE procedure is highly parallel. Once the edge variables are partitioned, then each subset can be run independently. Specifically, we define the running time of HOPE as a function of the number of available computing cores , number of folds ,
(3.3)  
(3.4) 
where ERGM fitting time is dependent upon the model complexity and also the number of heldout edge variables. In fact, the ERGM fitting time dominates the total run time and hence running in parallel can lead to an approximately fold reduction.
4 Case Studies
In this section, we examine the effectiveness of HOPE as a tool for model selection using two wellknown data sets from the social network literature: Lazega’s collaboration network between lawyers (lazega1999multiplexity; lazega2001collegial) and an adolescent friendship network of girls from the “Teenage Friends and Lifestyle Study" data set (michell2000smoke). Table 1
shows that these two networks substantially differ in their densities and degree distributions, as the collaboration network is denser and has a less skewed degree distribution, while the friendship network is more sparse and has a more skewed degree distribution. These differences produce distinct challenges for HOPE as shown below, while also demonstrating the general utility of the method.
Lazega Lawyer  Teenage Friends  
Network size  36  50 
Number of edges  115  74 
Directed  No  No 
Network density  0.18  0.06 
Mean degree  6.39  2.96 
SD degree  4.18  1.83 
Skewness degree  0.29  0.65 
Number of isolates  2  3 
Transitivity  0.39  0.42 
4.1 Collaboration Network Between Lawyers
Lazega’s lawyer data contain the collaboration relationships between partners in a New England law firm and have been used in a number of illustrative analyses (e.g., snijders2006new; hunter2006inference; caimo2013bayesian). The data set includes several nodal covariates: the Office that the lawyer works at (out of three, 1=Boston; 2=Hartford; 3=Providence); the Practice of the lawyer (out of two, 1=litigation; 2=corporate); the Gender (1=Male; 2=Female) of the lawyer; and each lawyer’s Seniority within in the firm, which is their rank order of entry into the firm (lower numbers indicate more senior partners, ranging from 1 to 36). The summary statistics in Table 1 indicate that the degree distribution is relatively even and not highly skewed. Additionally, Figure 1 illustrates a high level of homophily based on office and (to a lesser extent) practice.
We consider five competing model specifications that range from the simplest stochastic model in which all edge variables are hypothesized as i.i.d realizations of a Bernoulli random variable (i.e. homogeneous Bernoulli graph) to models that separately contain dyadic dependent terms and covariate terms. The final two models (Model 4 and Model 5) contain both types of terms, and in particular, Model 5 specifies a more granular relationship between nodes’ covariates. These models are listed below as the terms used in
ergm function (see the Appendix for more details of these network statistics) –
Model 1 : edges

Model 2 : edges + gwesp(0.75)

Model 3 : edges + nodecov(Seniority) + nodecov(Practice) + nodematch("Practice") + nodematch(Gender) + nodematch(Office)

Model 4 : edges + gwesp(0.75) + nodecov(Seniority) + nodecov(Practice) + nodematch(Practice) + nodematch(Gender) + nodematch(Office)

Model 5 : edges + gwesp(0.75) + nodecov(Seniority) + nodecov(Practice) + nodematch(Practice,diff=T,keep=2) + nodematch(Gender,diff=T) + nodematch(Office,diff=T,keep=c(1,2))
The models above incrementally build upon the preceding models to generate several competing models between which HOPE can adjudicate. Model 2 builds upon model 1 by accounting for the dependence among edge variables with the inclusion of the geometrically edgewise shared partner (GWESP) statistic with a fixed decay parameter , which captures the tendency for local clustering. Similarly, model 3 builds upon model 1 by including nodal covariates that capture the effects of nodes’ attributes on their propensities to form ties overall and with similar types of nodes (though this is still a dyadic independent model). Model 4 contains all of the terms from the previous three models. Model 5 contains all of the terms in model 4 as well as additional terms for differential homophily for three of the nodes’ attributes: Gender, Office, and Practice. This final model moves towards a saturated model by replacing the uniform homophily effects with differential homophily effects. The former generates a network statistic of all nodes that are equal for a given attribute, while the latter generates a separate network statistic for each of the values for a given attribute. These models represent a range of specifications that one could consider for this network. While other models could be posited, we employ these to illustrate how HOPE can adjudicate between possible specifications.
Model 1  Model 2  Model 3  Model 4  Model 5  
Edges  
GWESP,  
Main Seniority  
Main Practice  
Homophily Practice  
Homophily Gender  
Homophily Office  
Homophily Practice (Litigation)  
Homophily Gender (Male)  
Homophily Gender (Female)  
Homophily Office (Boston)  
Homophily Office (Hartford)  
AIC  600.8  524.3  513.8  472.0  470.0 
BIC  605.2  533.2  540.5  503.2  510.0 
, , 
Table 2 provides a summary of the estimated parameters along with the nominal AIC/BIC when each model was fit to the fully observed data. Based on the models fit to the full network, model 4 is preferred according to the nominal BIC and model 5 by the AIC. Additionally, model 3 (with node covariates) is preferred over model 2 based on the AIC, but model 2 seems to be better according to the BIC. Based on the AIC/BIC, one has conflicting information regarding which model to select for this network while knowing little about the specific strengths and weaknesses of the competing models. HOPE provides an alternate basis for model selection, which does not depend on the assumptions underlying the AIC or BIC, and allows model comparison based on their predictive performances. We note that the AIC favors the most complicated model (i.e. model 5), but model 4 has the smallest BIC compared to all of the candidate models. Such disagreement is warranted as the AIC, by design, attempts to choose a model as large as the true model, while the probability of selecting a fixed, most parsimonious, and true model tends to as the sample size tends to (e.g. nishii1984asymptotic; haughton1988choice; haughton1989size) based on the BIC.
We implement HOPE on each of the models specified above, with as the sample size for the conditional simulations. The calculated metrics under each model specification and heldout strategy are provided in Table 3. When random samples of edges are heldout, we set (the same number of edge variables as under nodeheldout) to make the metrics comparable to those of node heldout. Throughout, we will not attempt to interpret the magnitudes of these values but focus on the relative difference and ranking between models. Focusing on the dyadlevel metrics, the substantial differences between model 1 and model 2 suggest that accounting for the dependencies between edge variables is crucial, and the importance of incorporating nodal covariate information is indicated by the noticeable gap between the dyadlevel metrics of model 1 and model 3. Model 4, which contains both nodal covariates and dyadic dependent terms, is associated with the highest predictive accuracy of nulls (Null ACC), overall predictive accuracy (Overall ACC) and smallest total squared loss (TSL), for all of the heldout strategies. Although model 5 produces the highest predictive accuracy of edges (Edge ACC), it does not perform as well as model 4 on the other dyadlevel metrics. Overall, this presents evidence in favor of model 4, which is more parsimonious than model 5 yet still performs better on all of the other all dyadlevel metrics.
If the primary objective of the proposed model is to explain phenomena that depend on variations in nodes’ positions/roles, then nodelevel and graphlevel metrics are better suited. The models are quite comparable in recapitulating nodes’ degree and eigenvector centralities as well as the network’s degree centralization. However, the more complex models (models 4 and 5) better predict nodes’ betweenness centrality and the network’s betweenness centralization index. The rankings between the models are largely consistent across the different metrics and heldout strategies. Taken together, HOPE suggests that model 4 is the preferred model (for most types of analyses) since it performs at least no worse than model 5, while having the clear advantage of being more parsimonious.
It is worth mentioning that particular heldout strategies may be preferred depending on both modeling objectives and choices of target metric. Under leaveout, where each edge variable is heldout and assessed separately, the marginal probability of edge variables can be calculated exactly via change scores (snijders2006new), allowing simple and accurate estimates of dyadlevel metrics. However, because many node and graphlevel metrics vary little under single edge changes, leave1out tends not to provide sufficient information on predictive performance with respect to these measures to discriminate between similar models. By contrast, omitting larger numbers of edge states (either via nodeheldout or leaveout strategies) poses a more difficult prediction problem, providing greater discriminatory power for comparing models’ abilities to preserve higherorder structural features. Similarly, the information provided by leaveout and nodeheldout strategies differ due to the concentration of heldout edge states on specific nodes in the latter case (making nodeheldout wellsuited for e.g. comparing models that are intended to capture vertexlevel structure). Unsurprisingly, the calculated dyadlevel metrics under leaveout are found to approximate those of leave1out better, as the missing data are more evenly distributed. Overall, leaveout appears to be the candidate for default choice of HOPE for general purposes, while other heldout strategies are also useful in certain scenarios.
Dyadlevel  Nodelevel  Graphlevel  

Model  AIC  BIC  Type  Edge ACC  Null ACC  Overall ACC  TSL  Degree  Betweenness  Eigen  RMSE, Between Cen.  RMSE, Degree Cen. 
1  600.8  605.2  Leave1out  0.192  0.817  0.703  92.55  0.999  0.996  0.999  0.005  0.005 
2  523.0  531.9  Leave1out  0.388  0.867  0.779  69.067  0.999  0.998  0.999  0.004  0.004 
3  513.8  540.5  Leave1out  0.306  0.845  0.747  80.701  0.999  0.997  0.999  0.005  0.004 
4  471.8  502.9  Leave1out  0.43  0.875  0.794  66.334  0.999  0.998  0.999  0.004  0.004 
5  470.0  510.0  Leave1out  0.434  0.873  0.793  66.54  0.999  0.998  0.999  0.004  0.004 
1  600.8  605.2  Leaveout  0.18  0.817  0.701  94.747  0.963  0.871  0.962  0.024  0.026 
2  523.0  531.9  Leaveout  0.37  0.863  0.773  71.45  0.973  0.932  0.969  0.02  0.023 
3  513.8  540.5  Leaveout  0.307  0.846  0.748  80.951  0.968  0.894  0.966  0.021  0.023 
4  471.8  502.9  Leaveout  0.419  0.872  0.789  67.412  0.975  0.935  0.971  0.02  0.022 
5  470.0  510.0  Leaveout  0.422  0.871  0.789  67.842  0.975  0.936  0.971  0.02  0.022 
1  600.8  605.2  Node  0.179  0.816  0.7  94.992  0.943  0.874  0.938  0.021  0.021 
2  523.0  531.9  Node  0.215  0.838  0.724  88.786  0.931  0.915  0.921  0.018  0.021 
3  513.8  540.5  Node  0.3  0.841  0.742  83.656  0.948  0.885  0.947  0.019  0.02 
4  471.8  502.9  Node  0.337  0.863  0.767  76.428  0.945  0.92  0.94  0.018  0.02 
5  470.0  510.0  Node  0.345  0.86  0.766  77.780  0.945  0.921  0.939  0.018  0.02 

The original TSL is scaled by for the purpose of comparison, as each edge variable is heldout twice in the entire process under node heldout.
4.2 Teenage Friendship Network
The teenage friendship network contains undirected friendship ties between young women^{3}^{3}3Following (bouranis2018bayesian), we use the intersection of reported ties and symmetrize the adjacency matrix.. This network is a subset of the Teenage Friends and Lifestyle Study data set (michell2000smoke), which has three years of friendships from 1995 to 1997. We use the data from the first year along with two nodal attributes: smoke, which takes the values 1 (nonsmoker), 2 (occasional smoker), and 3 (regular smoker); and drugs for cannabis use, which takes the values 1 (nondrug user), 2 (tried once), 3 (occasional) and 4 (regular use). Prior analysis on these data (bouranis2018bayesian) dichotomized the drugs such that 1 indicates individuals that never used drugs or tried once; and 2 indicates occasional or regular drug use. We only use the dichotomized version of the variable, which we refer to as drugs_binary. Figure 2 depicts the friendship network and individuals’ frequencies of smoking and drug use. As shown, most friendships are homophilous based on smoking and drug use, with one individual bridging two distinct groups of the largest component.
Again, we evaluate five models that incrementally build upon the previous models, similar to the models considered for the lawyers’ collaboration network. Model 5 includes a differential homophily term for smoking that is absent from Models 3 and 4. The decay parameters of GWESP and GWDEG terms are fixed as and following the models used in bouranis2018bayesian.

Model 1: edges

Model 2: edges + gwesp(log(2)) + gwdegree(0.8)

Model 3: edges + nodematch( drugs_binary, diff=T )

Model 4: edges + gwesp(log(2)) + gwdegree(0.8) + nodematch( drugs_binary, diff=T )

Model 5: edges + gwesp(log(2)) + gwdegree(0.8) + nodematch( drugs_binary, diff=T ) + nodematch(smoke, diff=T)
Model 1  Model 2  Model 3  Model 4  Model 5  
Edges  
Homophily drug (Never/Tried Once)  
Homophily drug (Occasional/Regular Use)  
GWESP,  
GWDEG,  
Homophily smoke (Nonsmoker)  
Homophily smoke (Occasional Smoker)  
Homophily smoke (Regular Smoker)  
AIC  560.8  453.6  535.1  438.7  443.7 
BIC  565.9  469.0  550.4  464.2  484.6 
, , 
Table 4 provides a summary of the estimated parameters along with the nominal AIC/BIC when each model was fit to the fully observed data. Based on the models fit to the full network, model 4 is preferred according to the nominal AIC/BIC. For each model specified above, we implement HOPE with as the sample size for the conditional simulations. Table 5 displays the calculated metrics for each model using the different heldout strategies, along with the AIC and BIC when fit to the original network. When random samples of edges are heldout, we set (the same number of edge variables as under nodeheldout) to make the metrics comparable to those of node heldout. In contrast to the lawyer network, the inclusion of the dyadic dependent term improves the fit of the models to the friendship network significantly more than including nodal covariates according to the AIC/BIC.
The dyadlevel metrics unanimously indicate that model 4 is as good if not better than all other models. It is worth noting that model 2 is itself a compelling model as the gaps between the dyadlevel metrics for model 2 and model 4 are minimal, suggesting that dyadic dependent terms play a more central role in explaining the formation of this friendship network. Similar to the collaboration network, the most complex model (model 5) does not lead to a better predictive performance at the dyad level.
Larger variations on the nodelevel and graphlevel metrics are present in this case. This is partly due to the sparsity of this friendship network, which makes the misplacement of a single edge generate a more substantial effect on the structural features. Such variation can be more informative for the purpose of model comparison. We find that all of the models, including the dyadic independence models (models 1 and 3), are fairly effective in reproducing nodes’ degree and eigenvector centralities. However, these models perform less well at recovering measures such as betweenness, compared to other models with terms that account for dyadic dependence. This may be because the combination of sparsity and local clustering that characterizes the network (and to which betweenness is sensitive) cannot be easily captured without GWESP.
The differences between the results of this network and the Lazega network are revealing. Note that the improvement produced by the covariate effects alone is not as substantial as that by dependence terms compared to the baseline homogeneous Bernoulli graph. In fact, the predictive performance of model 2 is almost comparable to those of models with both covariates and dyadic dependent terms under leaveout sampling. This suggests a dominant role of endogenous network effects in the formation of the teenage friendship network, as opposed to the strong role for covariates in structuring collaboration among lawyers. In addition, the substantially different performances of model 2 under node and leaveout heldout strategies further suggests the importance of dyadic dependence in network structure, as local structure about each vertex appears to be largely emergent and hence difficult to predict without access to other edge variables on the same vertex. By contrast, if tie formation were dominated by observed covariate effects, we would see much less differentiation between leaveout and nodeheldout performance. Such performance comparisons thus provide a useful tool for inferring the relative importance of different mechanisms in determining network structure.
Dyadlevel  Nodelevel  Graphlevel  

Model  AIC  BIC  Type  Edge ACC  Null ACC  Overall ACC  TSL  Degree  Betweenness  Eigen  RMSE, Between Cen.  RMSE, Degree Cen. 
1  560.8  565.9  Leave1Out  0.072  0.930  0.878  69.30  0.999  0.986  0.997  0.011  0.002 
2  453.6  469.0  Leave1Out  0.480  0.965  0.936  41.064  0.999  0.998  0.998  0.004  0.001 
3  535.1  550.4  Leave1Out  0.088  0.948  0.896  66.477  0.999  0.989  0.997  0.010  0.002 
4  438.7  464.2  Leave1Out  0.483  0.968  0.938  39.982  0.999  0.998  0.998  0.004  0.001 
5  443.7  484.6  Leave1Out  0.480  0.966  0.937  40.684  0.999  0.998  0.998  0.004  0.001 
1  560.8  565.9  LeaveOut  0.062  0.939  0.886  69.528  0.932  0.355  0.895  0.064  0.013 
2  453.6  469.0  LeaveOut  0.457  0.962  0.932  43.365  0.958  0.892  0.906  0.027  0.009 
3  535.1  550.4  LeaveOut  0.086  0.941  0.889  68.007  0.934  0.35  0.896  0.065  0.012 
4  438.7  464.2  LeaveOut  0.463  0.964  0.933  42.664  0.959  0.884  0.909  0.028  0.009 
5  443.7  484.6  LeaveOut  0.463  0.963  0.933  42.897  0.959  0.884  0.911  0.027  0.009 
1  560.8  565.9  Node  0.058  0.94  0.886  140.246  0.927  0.477  0.857  0.056  0.01 
2  453.6  469.0  Node  0.076  0.94  0.887  136.863  0.923  0.864  0.766  0.035  0.011 
3  535.1  550.4  Node  0.081  0.941  0.889  137.486  0.927  0.483  0.856  0.055  0.01 
4  438.7  464.2  Node  0.101  0.943  0.892  132.847  0.925  0.859  0.779  0.035  0.011 
5  443.7  484.6  Node  0.097  0.943  0.892  135.706  0.922  0.859  0.778  0.035  0.011 

The original TSL is scaled by for the purpose of comparison, as each edge variable is heldout twice in the entire process under node heldout.
5 Discussion and Conclusion
In this paper, we employ a technique analogous to crossvalidation for models with dependent, relational data: HeldOut Predictive Evaluation (HOPE). Researchers can use HOPE to compare different model specifications of ERGMs and then select the best model for their purpose. HOPE is flexible to multiple ways of holdingout data, and allows models to be evaluated based on an array of predictive metrics. Those we have delineated are by no means exhaustive, and we hope that researchers will consider other metrics in the future based on their specific research questions, such as goodnessoffit metrics related to a graph’s spectrum by shore2015spectral if e.g. macroscopic community structure is of substantial interest. Taken together, HOPE complements existing model adequacy checks while addressing a current limitation for model selection. Unlike other model selection tools (e.g.; AIC or the Bayes factor), HOPE is always straightforward to implement, and can be easily extended to ERGMs in which edges are counts, ranks, or continuous (see e.g., krivitsky2012exponential; krivitsky2017exponential; desmarais2012statistical, respectively) rather than just binary relations. Researchers can also use HOPE to diagnose different models’ predictive abilities. This allows model selection to proceed based on tasks that are more appropriate to the analyst’s modeling objectives.
The application of HOPE to two different networks with vastly different structures indicates the importance of properly choosing the heldout strategy and metrics that will be used to select the model. Based on empirical findings in this paper, the leave1out strategy has the potential to yield the most reliable estimates of the dyadlevel metrics, but can generate nodelevel and graphlevel metrics that have too little variation across models, as some structural measures are fairly robust and hence remain unchanged given the misplacement of a single edge variable. Holding out each node’s edge variables can alter the network structure to a large extent, and hence facilitate a more fruitful comparison of nodelevel and graphlevel metrics, but the information it delivers regarding the predictive accuracy of edges can be drastically different from the other heldout strategies for networks that are sparse or contain imbalanced covariates. The leaveout strategy provides a better general purpose strategy, since more heldout edge variables are present in each batch than leave1out, and they are more evenly distributed rather than all of the dyads containing on of the vertices, which can make the assessments dependent on nodes’ idiosyncracies. As a general starting point, we suggest setting equal to (or, in the directed case, ), to facilitate comparison with nodeheldout performance.
As noted, HOPE is distinct from other model selection and adequacy techniques. Simulationbased goodnessoffit approaches (hunter2008goodness) are widely used for adequacy checking, but are not wellsuited to model selection. The method developed by koskinen2018outliers
can successfully identify outliers that influence coefficients’ magnitude and significance, but it focuses on quantifying how outliers affect parameter estimates for a model rather than a model’s predictive performance. The AIC and BIC scores
areintended for model selection, but are difficult or impossible to calculate accurately in the case of dependence models (with only nominal data degrees of freedom currently known) and draw upon asymptotic arguments of dubious validity in typical ERGM applications. Bayes factors have better theoretical properties, but are even more difficult to compute. All of these methods have assets, and we do not argue against their use. However, HOPE offers distinct gains in feasibility and interpretability, and it has the advantage of focusing attention on models’ outofsample predictive abilities. As such, it moves away from coefficient tests by illuminating which effects are
predictive rather than significant. The models for the collaboration network show that several of the heterogeneous homophily terms in model 5 are significant, and when fit to the full network, this model has a better AIC value than model 4, which only includes uniform homophilly terms. Yet, the predictive metrics illuminate the fact that these effects are ephemeral, and more importantly, they do not improve the model’s predictive accuracy compared to model 4. HOPE’s ability to illuminate these differences speaks to the ongoing controversy about significance and values (goodman2019getting; kuffner2019p; mcshane2019abandon). In our view, terms in a model should typically improve the ability of the model to predict the phenomena of interest rather than simply being statistically significant and, by turn, significant effects that add nothing to prediction should be viewed with suspicion.In sum, HOPE is complementary to extant methods of model selection and model adequacy checks. We believe that the strengths of HOPE will prove to be a useful tool for research when selecting between multiple model specification of ERGMs or other statistical network models with complex dependencies.
Acknowledgments
This work was supported in part by NSF awards IIS1526736 and DMS1361425.
Appendix A
Here, we list the definitions of the sufficient statistics used in Section 4. All of the terms are in the context of an undirected network, since that is the case for all of the data sets used. For more details, see morris2008specification for a more comprehensive list of sufficient statistics for ERGMs.
Edge statistic (Edges)
Geometrically weighted edgewise shared partner (GWESP) statistic
where is the number of connected pairs that have exactly common neighbors, which is a measure of local clustering in a network. The decay parameter controls the relative contribution of to the GWESP statistic.
Geometrically weighted degree statistic (GWDEG)
where is the number of nodes who have exactly neighbors. The decay parameter controls the relative contribution of to the GWDEG statistic.
Uniform homophily effect (nodematch)
Categorical covariate is defined for each node in the graph. The term counts the total number of edges between nodes with the same values of ,
Differential homophily effect (nodematch)
Categorical covariate is defined for each node in the graph, which can take distinct values, for any given possible value of where
Edgelevel covariate effect (edgecov)
Covariate is measured for each pair of nodes in the graph,
Node covariate (nodecov)
Covariate is measured for every node in the graph, which can take numerical values
Comments
There are no comments yet.