Selection of Exponential-Family Random Graph Models via Held-Out Predictive Evaluation (HOPE)

Statistical models for networks with complex dependencies pose particular challenges for model selection and evaluation. In particular, many well-established statistical tools for selecting between models assume conditional independence of observations and/or conventional asymptotics, and their theoretical foundations are not always applicable in a network modeling context. While simulation-based approaches to model adequacy assessment are now widely used, there remains a need for procedures that quantify a model's performance in a manner suitable for selecting among competing models. Here, we propose to address this issue by developing a predictive evaluation strategy for exponential family random graph models that is analogous to cross-validation. Our approach builds on the held-out predictive evaluation (HOPE) scheme introduced by Wang et al. (2016) to assess imputation performance. We systematically hold out parts of the observed network to: evaluate how well the model is able to predict the held-out data; identify where the model performs poorly based on which data are held-out, indicating e.g. potential weaknesses; and calculate general summaries of predictive performance that can be used for model selection. As such, HOPE can assist researchers in improving models by indicating where a model performs poorly, and by quantitatively comparing predictive performance across competing models. The proposed method is applied to model selection problem of two well-known data sets, and the results are compared to those obtained via nominal AIC and BIC scores.



There are no comments yet.


page 1

page 2

page 3

page 4


EPP: interpretable score of model predictive power

The most important part of model selection and hyperparameter tuning is ...

On the consistency between model selection and link prediction in networks

A principled approach to understand network structures is to formulate g...

Leave Zero Out: Towards a No-Cross-Validation Approach for Model Selection

As the main workhorse for model selection, Cross Validation (CV) has ach...

Rescaling and other forms of unsupervised preprocessing introduce bias into cross-validation

Cross-validation of predictive models is the de-facto standard for model...

Latent space projection predictive inference

Given a reference model that includes all the available variables, proje...

A new approach in model selection for ordinal target variables

This paper introduces a novel approach to assess model performance for p...

A multiple testing framework for diagnostic accuracy studies with co-primary endpoints

Major advances have been made regarding the utilization of artificial in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The exponential family random graph modeling (ERGM) framework (holland1981exponential; frank1986markov; snijders2006new; hunter2006inference) (known in older work by the term (wasserman1996logit)) has emerged as an important approach to the statistical analysis of social network data, providing a highly general way of specifying distributions on graphs and allowing the complex dependence structure of edges in a network to be specified in terms of local structural properties (robins2007recent). A wide variety of features have been proposed as potential instantiations of the different types of driving forces governing the formation of social networks (morris2008specification), with the potential to accommodate an increasingly rich range of network types. At the same time, however, poor specifications can lead to unrealistic model behavior (handcock2003assessing; schweinberger2011instability), and in practice considerable domain expertise can be required to select terms that implement the correct dependence structure for a specific setting (lusher2013exponential).

Choosing among competing model specifications can be viewed as a model selection problem, making information criteria such as Akaike Information Criterion (AIC) (akaike1973information) and Bayesian Information Criterion (BIC)(schwarz1978estimating) natural choices for adjudication. However, these criteria rely on a number of theoretical assumptions that are frequently problematic in a network modeling context. First, edge variables in typical network models are non-independent, making it difficult to determine the effective sample size needed for size-corrected AIC and BIC calculations (hunter2008goodness); indeed, at this time the theoretical justification for these criteria is unclear in the case of models for single networks with dyadic dependence (though see kolaczyk2015question; schweinberger2017exponential, for some possible directions)

. Second, likelihood calculations for complex ERGMs rely on stochastic approximations (e.g., bridge sampling) that become expensive for large networks and where high precision is needed. This is not a barrier to parameter estimation (which typically relies on quantities such as likelihood ratios between identically specified models with similar parameters that can be more precisely computed), but makes fine distinctions between the likelihoods of similarly performing but differently specified models difficult.

The theoretical (if not computational) challenges noted above can be avoided in a Bayesian context by performing model selection by Bayes factors

(raftery1995bayesian). ERGM applications to date require fairly expensive, high-quality posterior simulation using methods such as the Reversible Jump Exchange Algorithm (caimo2013bayesian), a tailored version of the conventional Reversible Jump MCMC algorithm (green1995reversible) combined with the exchange algorithm (Murray:2006:MDD:3020419.3020463) by caimo2011bayesian to deal with the double intractability of the posterior density, as the ERGM likelihood cannot be computed analytically in general. This approach has the advantage of being theoretically principled in the setting of fixed sample size, but the need to obtain a high-quality approximation of the Bayes factor can be computationally demanding. Promising directions in this area include efforts like those of bouranis2018bayesian to reduce the computational cost by adjusting the posterior samples yielded by an analytically tractable approximation of the ERGM likelihood, e.g. the pseudolikelihood. However, its magnitude adjustment step requires approximating the true ERGM likelihood evaluated at MLE, which raises the same issue encountered in the calculation of AIC/BIC; thus, an efficient and easily used method for obtaining Bayes factors for general ERGMs remains elusive. Moreover, the Bayes factor itself is not always an ideal tool for model selection. The Bayes factor (and its multi-model generalizations) provides an answer to the question, “which of a set of proposed models is more likely to be the true data generating process?” assuming that the models being evaluated are a priori

equally probable and that one is correct. While the equal probability assumption (i.e., uniform model priors) can be adjusted, the hidden assumption that one proposed model is correct – or, at least, that among incorrect models the model “more likely” to be correct is also “better” – is not entirely innocent

(Bernardo94; spiegelhalter2002bayesian). When no available model is correct, the model preferred by the Bayes factor may or may not have other desirable properties (e.g., better predictive performance for some task of interest), and indeed the Bayes factor may heavily weight aspects of model performance that are not in practice those most valued by the analyst.

Relatedly, the Bayes factor can be sensitive to model features, such as the tail weight of the parameter priors, that are often chosen on semi-arbitrary grounds (and that in practice often have little impact on estimation). This creates the risk that model selection will be unduly influenced by choices of the analyst that are difficult to constrain and that are otherwise of minimal substantive importance. Moreover, the approach is only applicable to Bayesian inference, which is not at present widely used for ERGMs due to computational challanges. Thus, while the Bayes factor can be an important tool in the network analyst’s arsenal, it also poses considerable difficulty in practice.

As an attempt to compensate for the difficulties of likelihood-based criteria, goodreau2007advances; hunter2008goodness introduced graph-level “goodness-of-fit” (GOF) plots to assess the fit of ERGMs, in which several graph-level statistics (e.g., degree distributions, edgewise shared partner distributions) of observed networks are compared against those of simulated networks from the estimated model. The underlying idea is that draws from a fitted ERGM should have structural properties similar to the observed one, and, in particular, that network properties not explicitly used to fit the model (“out of model” properties) should be reproduced by those that were used (“in model” properties). The properties used to assess a model are usually chosen on substantive grounds, though some efforts have been made to suggest relatively “generic” statistics of broad utility (e.g. hunter2008goodness; wang2013exponential1; wang2013exponential2; shore2015spectral). While this approach has proven useful in practice, it is properly a model adequacy checking strategy rather than a model selection strategy: it provides ways to identify performance deficiencies in a chosen model, but it does not provide a general rubric for choosing among competing models. Likewise, the GOF approach is not designed to provide strong information regarding the predictive performance of a fitted model. Rather, it only answers the question of how well networks drawn from a model fit to a specific data set reproduce other features of that data set. This is useful for detecting when a fitted model is incapable

of producing realistic behavior, but it does not establish that the model will predict well (either in the context of extrapolation to new structures or interpolation of held-out data).

While it could perhaps be argued that predictive performance is not always a major concern for ERGMs, lack of predictive power at the very least suggests limitations of a model that should be borne in mind when using it. Moreover, predictive performance is clearly a consideration in many applications. For example, studies of international conflict (hoff2004modeling; maoz2006structural) or bill cosponsorship (fowler2006connecting; cranmer2011inferential)

are concerned with consequential relations for which predictions are of significant interest. These could include e.g. the ability to forecast future network states from past network states (or merely from covariates), or conditional prediction of edge states given covariates and/or information on other edges. In the latter case, performance involving specific edge states (as opposed e.g. to unlabeled properties such as the degree variance) is of obvious importance: it is important to know whether a predicted conflict is between e.g. China and United States versus Bulgaria and Croatia. A model that successfully reproduced the degree distribution for a given year’s conflict network might not perform well at predicting the degree distribution for the next year’s network, nor at predicting who will be in a conflict with whom. Such a model might be judged satisfactory via conventional adequacy checks, but its value for understanding global conflict would be questionable at best.

The above suggest more explicit predictive metrics as potentially useful tools for model selection (as well as model assessment). In many fields, cross-validation (CV) techniques have been fruitfully used in this role, allowing one to assess how well the predictions from a model generalize to a new data set; flexible, easily understood, and able to be linked directly with performance outcomes of substantive interest, CV methods are well-adapted to model selection (see arlot2010survey, for a recent review). Typically, CV divides the data set into a training set (to which the model is fit) and a test set

(against which the fitted model’s predictions are evaluated under pre-specified loss functions), selecting the model with the smallest estimated loss. Many variants of this procedure exist (e.g., leave-one-out CV,

-fold CV, bootstrap CV, etc.), but all share the common feature of assessing predictive performance on a data subset that was held out during parameter estimation. Classical CV for regression models with independent and identically distributed (i.i.d.) data was proposed as early as geisser1975predictive, and CV procedures have been tailored for the purpose of performance evaluation in latent variable modeling of relational data (e.g., hoff2008modeling; dabbs2016comparison; li2016network; chen2018network), where edge variables are conditionally independent given latent variables and hence can be straightforwardly held out. Likewise, there is work on applying CV to ERGMs where multiple networks from the same population model are available, and entire networks can be held out (stewart2019multilevel). Such work suggests considerable potential utility in applying CV to the more typical setting of general ERGMs on single graphs realizations.

A difficulty with standard CV techniques in a general ERGM setting is the direct dependence of edge variables, which makes it impossible to simply omit edge variables without changing the underlying probability model.111This is related to the inconsistency of dependence models under naive subsampling, when the presence of unmeasured vertices is not accounted for (shalizi2013consistency); procedures that do allow consistent estimation are discussed by schweinberger2017exponential. Intuitively, the presence of an edge variable in such a model is itself informative, and this information must be retained for predictions to be meaningful. This same issue arises in the context of ERGM estimation from networks with missing edge data, where the presence of edge variables must still be accounted for even when their states are unknown. handcock2010modeling introduced an estimation scheme for handling such data in the case of ignorable missingness, which resolves this difficulty by integrating over the unknown states of the missing edge variables (and thereby preserving the impact of their interactions with the variables whose states are observed). The missing data case suggests the key to obtaining CV-like procedures for ERGMs: while one cannot meaningfully hold out edge variables, one can hold out the states (i.e., whether a given edge variable contains an edge or a null) of edge variables, retaining their presence but treating them as missing data. Building on this intuition, wang2016multiple proposed a held-out evaluation scheme for evaluating model-based imputation that was analogous to CV, which they dubbed Held-Out Predictive Evaluation (HOPE). Unlike CV, the edge variables in the validation set under HOPE are only marked as missing (i.e. NA) in the model training phase, instead of being completely eliminated. Thus, the trained model accounts for the presence of the edge variables, but it is not given information on their states. Testing is then performed by conditional prediction of the held-out edge states from the fitted model conditional on the edge variables that are not held-out. Since the core techniques needed to perform this procedure (estimation with missing edge data and conditional procedure) are supported in standard ERGM software, it can be used without the need for custom software implementations or other special considerations.

While HOPE was originally introduced in the context of imputation assessment, it is a general CV-analogue and the idea of holding out a portion of data as missing can be used for a wide range of evaluation tasks. For instance, koskinen2018outliers proposed a model-based approach for the identification of influential nodes in a network by assessing the sensitivity of estimated parameters when all edge variables associated with the corresponding node is held out. In this paper, we introduce the use of HOPE to perform model selection for ERGMs, with an emphasis on simple metrics and procedures that are applicable to a wide range of network data. Using HOPE, researchers can gain information on how well the model is able to predict held-out portions of the data from other edge observations; where the model performs poorly based on held-out data, which can point to weaknesses in parameterization; and how one model’s predictive performance compares to others. Because such predictive assessments automatically correct for overfitting (which by definition improves in-sample performance while harming out-of-sample performance), they can be quantitatively compared across specifications in a way that other metrics often cannot. Taken together, these assessments can assist researchers in improving models and facilitate the comparison of fit across multiple models, thus facilitating model selection.

The remainder of the paper is structured as follows. In section 2, we review the ERGM framework with special emphasis on the estimation of ERGM in the presence of missing data, accompanied by some notation necessary for the purpose of the proposed methodology. In section 3, we elaborate on the idea of HOPE and introduce two general strategies for holding-out data along with a set of metrics that measure a model’s performance based on several structural features and at different granularities. Section 4 provides two data examples that demonstrate the effectiveness of HOPE and compare it to existing approaches. Finally, we conclude with a discussion and potential directions for future work.

2 ERGM estimation and simulation process

Consider a random graph of fixed order , where is the node (vertex) set and is the stochastic adjacency matrix of , such that is the variable representing the state of an edge variable from node to node . For valued networks, edge variable represents the edges’ strengths or frequencies, e.g. trade flows between countries or numbers of interactions between individuals. In most social networks, self-edges are absent or not of substantive interest, and therefore we assume in this paper (though ERGMs can still be fit to networks with loops, and HOPE is still applicable). Lower case will represent the realized value of the edge variable. Letting be the set of all possible values of such matrices, an ERGM form for the probability mass function of is given by




The user-defined statistic

can incorporate any finite network statistics ranging from exogenous covariates (e.g., homophily based on nodes’ characteristics) to endogenous network effects (e.g., transitive closure) that are represented as counts of those local features in the observed network. The model parameter vector is

and its mapping to canonical parameters is . For notational convenience, we assume that all covariates are implicitly incorporated into . The reference measure determines the support and the basic shape of corresponding ERGM distribution (see krivitsky2012exponential, for an illustration). While the above can be applied to a wide range of relational data types, our HOPE procedure is generally applicable to any network model for which the estimation of parameters in the presence of missing data is feasible. To avoid notational and related complications caused by valued edges, however, we illustrate the key idea of HOPE using canonical binary ERGMs with counting measures (i.e., and ).

While various approaches can be used to estimate given an observation from , we focus here on the maximum likelihood estimation (MLE, presently the most widely used class of answer). Importantly, the MLE can be obtained when part of the data (i.e., edge states) are ignorably missing, which is a critical component of HOPE. To estimate where is possibly only partially observed, we follow the scheme introduced by handcock2010modeling for constructing the face-value likelihood based solely on the observed part of the data,


where is the observed portion of , and represents all configurations for missing portion of the data that are concordant to . Note (2.3) is a valid likelihood as it correctly describes the probability of the observed data marginalizing across the unobserved portion, and we can estimate the model parameters under relatively general conditions (see handcock2010modeling, for more details).

Maximizing (2.3) with respect to yields the MLE, , from which one can conduct conditional ERGM simulations (handcock2010modeling) to predict (or impute) the states of missing edge variables given the states of observed (i.e., retained) edge variables,


where .

The software package MPNet (wang2014mpnet) and the ergm package (hunter2008ergm) of the statnet (handcock2008statnet) software suite for R (R2018) have implemented simulation-based algorithms for approximating the MLE of under model (2.3). Additionally, both allow simulations from the estimated parameters with or without missing data (the former is used in goodness-of-fit assessments).

In this paper, we implement the HOPE procedure in R using the ergm package. To run conditional simulations from ERGMs, the observed part of the graph can be fixed using the constraint argument in the simulate.ergm function. The procedure is repeated for each set of held-out edge variables in observed data, and the simulated networks are then evaluated based on several criteria from various aspects. This makes HOPE analogous to cross-validation in that we are evaluating how well we can predict data that are not used for model training.

3 Held-out Predictive Evaluation

In this section, we first describe two possible ways that relational data can be held-out from an adjacency matrix. We then elaborate on the implementation procedure for HOPE. Finally, we introduce some generic metrics that can be used to evaluate a model’s performance at different granularities.

3.1 Held-out strategies and general procedure

To perform HOPE for the purpose of model selection, the entire index set of edge variables (, if is directed) is partitioned into subsets, , where

. Intuitively, our procedure operates by holding out the values of edge variables in one subset while fitting to the remainder, using the resulting estimate to predict the values of the held-out data. (Compare with conventional cross-validation, in which the variables themselves - and not merely their values - are held out. While this distinction is immaterial for independence models, it is consequential for typical ERGMs.) While other schemes are also feasible, two natural options for holding out network data are random sampling of edges and removal of all edge values associated with a randomly chosen vertex. For convenience of notation, we assume below that network size is an even number; however, generalization to the odd-numbered case is immediate.

  1. Random edge removal: is randomly divided into non-overlapping, equally-sized batches of size edge variables (if the graph is undirected), edge variables are selected (if the graph is directed), with each batch being held-out in turn. A similar strategy was used by hoff2008modeling for the purpose of selecting the optimal number of dimensions in latent space structure. It is worth mentioning that leave-one-out can be argued as a special case for random sample where , as each edge variable forms a batch by itself and is held out successively.

  2. Node removal: All edge variables involving a particular node are simultaneously held-out. koskinen2018outliers used a similar strategy to identify influential observations (nodes) under the ERGM framework.

The general procedure for HOPE then proceeds as follows. For :

1. Fit an ERGM to partially observed data based on (2.3), which in turn gives the corresponding held-out data MLE, .

2. Simulate graphs from the conditional distribution defined in (2.4) as with their corresponding adjacency matrices . Note, that for any , , because all edge variables in are fixed when simulating from the conditional distribution (2.4).

3. Evaluate the ability of the model to accurately predict the held-out data under any of several error metrics. The choice of which metrics to use will depend on which structural features users deem substantively important for their model.

When the graph is large, a subset of , namely , can be considered to reduce the computational burden, with metrics being calculated only on these subsets. In such cases, both uniformly random and stratified subset selection approaches may be considered; the former is simpler (and always applicable), but the latter may lead to more efficient approximation of the complete-partition solution.

3.2 Error metrics

It is natural to assess the predictive accuracy of a model at three levels: the dyad, the node, and the graph as a whole. Here, we outline a number of broadly useful choices of error metrics at each level of assessment.

3.2.1 Dyad level metrics

Dyad-level metrics focus on how well the model performs in predicting the presence and absence of edges. Our suggested measures are based on two types of edge predictions. First, we may simply consider whether an edge is predicted to be present or absent in any given simulated draw. Comparing the true states of held-out edges in each iteration (present/absent) with their imputed states leads to an aggregate confusion matrix, , a two-by-two table of true versus predicted values over all

. While the confusion matrix is pleasingly direct, it does not discriminate between differences in predicted probabilities and may be information-poor in sparse networks. For that reason, we also employ marginal probability estimates of edge presence, as computed from held-out models. Specifically, for

, we estimate the marginal probability by , which is equivalent to the posterior mean for the binomial probability under a Jeffrey’s prior; the use of a local Bayesian estimate here provides shrinkage in the case of highly sparse predictions, and ensures that predictive summaries remain well-behaved even for poor models. It is worth mentioning that the marginal probability can be calculated exactly via change scores (snijders2006new) under leave-1-out, but we do not pursue this direction as experiments show that is usually a good approximation to the marginal probability when is sufficiently large (e.g. 500).

From the above, we may then assess predictive performance by applying standard accuracy assessments to and/or . In this paper, we employ predictive accuracies for edges (Edge ACC), nulls (Null ACC), and edge variables (Overall ACC) based on , as well as the total squared loss (TSL),

based on . These metrics are widely used for binary data, easily interpreted, and broadly comparable across models. As discussed below, they constitute a natural starting point for model selection.

3.3 Node level metrics

Node level metrics provide information on how well (or poorly) the model can reproduce the roles and positions of the nodes in the observed graph, when each node’s edge variables are successively held out. If a model provides a plausible approximation to the mechanism giving rise to the observed network, then one would anticipate that the discrepancy between the node level metrics of the observed and simulated networks would be minimal. However, this will likely depend on the specific node metric, since some are more sensitive to minor changes in the structure than others (e.g.; degree versus betweenness). Because holding out the state of even one edge variable could potentially alter the roles/positions of all the nodes in a graph, the discrepancy is here calculated for all nodes across all simulations.

Given any node-level centrality measure, , it is natural to assess the discrepancy between the vector of observed node level indices and those of simulated, . We quantify the degree of agreement via the coefficient of reliability,


where is the total sum of squares with respect to node level index measuring the overall variation in the observed values and the mean squared error (MSE) of node level index is given by

. The reliability metric is similar to the coefficient of determination () in multiple regression and the reliability measures widely employed in psychometric contexts (e.g. gulliksen1987theory)

. The coefficient of reliability standardizes the amount of predictive error in a model’s node level index relative to the overall variation in the observed values; it can thus be thought of as assessing the extent to which the out-of-sample centrality scores are sufficiently accurate to discriminate between central and peripheral vertices in the observed graph. We focus on degree, betweenness, and eigenvector centralities since they are commonly used metrics in analyzing social networks’ structures, but other node-level indices

(see, e.g. wasserman1994social) are certainly feasible. In this context, each index may be motivated as follows:

  • Degree centrality quantifies the number of ties that each node has to all other nodes within the network. A node with a higher degree centrality has more ties to the other nodes in the network; accuracy with respect to this index thus indicates that the model is able to distinguish more active vertices from those with fewer relationships.

  • Eigenvector centrality measures the degree of membership of a given node in the largest core/periphery structure in the graph.222

    Except in very rare cases for which the graph adjacency matrix lacks a principal eigenvalue. In such circumstances, eigenvector centrality is a signed indicator of membership in the two largest core/periphery structures (positive versus negative).

    The eigenvector centrality is also the best one-dimensional approximation of the graph structure (in a least-squares sense), and accuracy in reproducing it indicates the extent to which the model is able to recover the broadest structural features of the graph.

  • Betweenness centrality (freeman1978centrality) measures the extent to which a node lies on unique shortest paths between other nodes in the network. A node with a high betweenness centrality acts as a bridge between other parts of the network. Betweenness is highly sensitive to the path structure in the network, and accuracy on this index thus indicates a model that is able to preserve distances among nodes.

Similar to , the statistic has an upper limit of , and it is equal to if and only if the proposed model reproduces networks that are perfectly correlated with the observed networks with respect to nodes’ roles and positions. A smaller value indicates that the model is less able to recapitulate the observed network’s structure. If is negative, then the predicted node-level indices deviate from the observed networks’ values substantively even after adjusting for the variation across the observed values.

It should be noted that, like any other reliability metric, measures relative discrimination and hence is sensitive to the underlying variation in the observed centrality score. Where there is little variation in centrality (and hence little opportunity for discrimination), it may be better to use the raw values than . The choice depends on whether the objective is to measure direct accuracy, versus ability to discriminate between high and low centrality nodes per se; in some cases, it may be worthwhile to examine both metrics.

3.4 Graph level metrics

A third class of selection metrics quantify how well (or poorly) the proposed model reproduces the observed networks’ graph-level structural features. Specifically, we here measure the discrepancies between the observed network and the simulated networks’ betweenness and degree centralization indices (freeman1978centrality). The centralization score is defined by

with being any generic centrality score. can be re-written as , clarifying that the centrality is proportional to the difference between the maximum and mean centrality scores (i.e., the extent to which one vertex is much more central than average). In this paper, all centralization scores are normalized by the corresponding theoretical maximum centralization scores.

For any generic graph level metric , we consider the root mean squared error (RMSE) as the discrepancy measure,


The RMSEs for degree and betweenness centralization then respectively indicate the extent to which the model is able to capture the upper tails of these two distributions. Poor performance on these measures may suggest that the model lacks the ability to identify nodes that deviate substantially from the norm - a possible indicator that additional covariates are required (assuming that such identification is desired).

3.5 Remarks on Metrics and Held-out Strategies

The proposed metrics cover a wide range of predictive outcomes at multiple scales, but they are not exhaustive. Our aim is to provide a suite of simple measures that are of general use. Importantly, HOPE can easily be adapted to assess the quality of models based on alternative structural features or error metrics. In fact, we encourage researchers to carefully consider their modeling objectives when deciding which metrics to use as well as which held-out data strategy to employ. Alternative metrics may provide distinct or even contrasting information, since particular models may be better at some prediction tasks than others. Moreover, different ways of holding out data may yield distinct results, since each both provides different information to the model during the estimation phase and poses a different prediction task during the imputation phase. Comparing performance across tasks thus gives useful insights into a model’s strengths and weaknesses.

3.6 Efficient implementation of HOPE in parallel

It is worth noting that the proposed HOPE procedure is highly parallel. Once the edge variables are partitioned, then each subset can be run independently. Specifically, we define the running time of HOPE as a function of the number of available computing cores , number of folds ,


where ERGM fitting time is dependent upon the model complexity and also the number of held-out edge variables. In fact, the ERGM fitting time dominates the total run time and hence running in parallel can lead to an approximately -fold reduction.

4 Case Studies

In this section, we examine the effectiveness of HOPE as a tool for model selection using two well-known data sets from the social network literature: Lazega’s collaboration network between lawyers (lazega1999multiplexity; lazega2001collegial) and an adolescent friendship network of girls from the “Teenage Friends and Lifestyle Study" data set (michell2000smoke). Table 1

shows that these two networks substantially differ in their densities and degree distributions, as the collaboration network is denser and has a less skewed degree distribution, while the friendship network is more sparse and has a more skewed degree distribution. These differences produce distinct challenges for HOPE as shown below, while also demonstrating the general utility of the method.

Lazega Lawyer Teenage Friends
Network size 36 50
Number of edges 115 74
Directed No No
Network density 0.18 0.06
Mean degree 6.39 2.96
SD degree 4.18 1.83
Skewness degree 0.29 0.65
Number of isolates 2 3
Transitivity 0.39 0.42
Table 1: Descriptive statsitics for Both Networks

4.1 Collaboration Network Between Lawyers

Lazega’s lawyer data contain the collaboration relationships between partners in a New England law firm and have been used in a number of illustrative analyses (e.g., snijders2006new; hunter2006inference; caimo2013bayesian). The data set includes several nodal covariates: the Office that the lawyer works at (out of three, 1=Boston; 2=Hartford; 3=Providence); the Practice of the lawyer (out of two, 1=litigation; 2=corporate); the Gender (1=Male; 2=Female) of the lawyer; and each lawyer’s Seniority within in the firm, which is their rank order of entry into the firm (lower numbers indicate more senior partners, ranging from 1 to 36). The summary statistics in Table 1 indicate that the degree distribution is relatively even and not highly skewed. Additionally, Figure 1 illustrates a high level of homophily based on office and (to a lesser extent) practice.

Figure 1: Lazega’s lawyers collaboration network. Colors indicate the different offices (White = Boston, Black = Hartford, Red = Providence); size indicates gender, men (smaller) and women (larger); shape indicates practice, ligitation (triangle), corporate (square).

We consider five competing model specifications that range from the simplest stochastic model in which all edge variables are hypothesized as i.i.d realizations of a Bernoulli random variable (i.e. homogeneous Bernoulli graph) to models that separately contain dyadic dependent terms and covariate terms. The final two models (Model 4 and Model 5) contain both types of terms, and in particular, Model 5 specifies a more granular relationship between nodes’ covariates. These models are listed below as the terms used in

ergm function (see the Appendix for more details of these network statistics) –

  • Model 1 : edges

  • Model 2 : edges + gwesp(0.75)

  • Model 3 : edges + nodecov(Seniority) + nodecov(Practice) + nodematch("Practice") + nodematch(Gender) + nodematch(Office)

  • Model 4 : edges + gwesp(0.75) + nodecov(Seniority) + nodecov(Practice) + nodematch(Practice) + nodematch(Gender) + nodematch(Office)

  • Model 5 : edges + gwesp(0.75) + nodecov(Seniority) + nodecov(Practice) + nodematch(Practice,diff=T,keep=2) + nodematch(Gender,diff=T) + nodematch(Office,diff=T,keep=c(1,2))

The models above incrementally build upon the preceding models to generate several competing models between which HOPE can adjudicate. Model 2 builds upon model 1 by accounting for the dependence among edge variables with the inclusion of the geometrically edgewise shared partner (GWESP) statistic with a fixed decay parameter , which captures the tendency for local clustering. Similarly, model 3 builds upon model 1 by including nodal covariates that capture the effects of nodes’ attributes on their propensities to form ties overall and with similar types of nodes (though this is still a dyadic independent model). Model 4 contains all of the terms from the previous three models. Model 5 contains all of the terms in model 4 as well as additional terms for differential homophily for three of the nodes’ attributes: Gender, Office, and Practice. This final model moves towards a saturated model by replacing the uniform homophily effects with differential homophily effects. The former generates a network statistic of all nodes that are equal for a given attribute, while the latter generates a separate network statistic for each of the values for a given attribute. These models represent a range of specifications that one could consider for this network. While other models could be posited, we employ these to illustrate how HOPE can adjudicate between possible specifications.

Model 1 Model 2 Model 3 Model 4 Model 5
Main Seniority
Main Practice
Homophily Practice
Homophily Gender
Homophily Office
Homophily Practice (Litigation)
Homophily Gender (Male)
Homophily Gender (Female)
Homophily Office (Boston)
Homophily Office (Hartford)
AIC 600.8 524.3 513.8 472.0 470.0
BIC 605.2 533.2 540.5 503.2 510.0
, ,
Table 2: Estimated Coefficients for Models Fit to the Lazega Lawyers’ Collaboration Network

Table 2 provides a summary of the estimated parameters along with the nominal AIC/BIC when each model was fit to the fully observed data. Based on the models fit to the full network, model 4 is preferred according to the nominal BIC and model 5 by the AIC. Additionally, model 3 (with node covariates) is preferred over model 2 based on the AIC, but model 2 seems to be better according to the BIC. Based on the AIC/BIC, one has conflicting information regarding which model to select for this network while knowing little about the specific strengths and weaknesses of the competing models. HOPE provides an alternate basis for model selection, which does not depend on the assumptions underlying the AIC or BIC, and allows model comparison based on their predictive performances. We note that the AIC favors the most complicated model (i.e. model 5), but model 4 has the smallest BIC compared to all of the candidate models. Such disagreement is warranted as the AIC, by design, attempts to choose a model as large as the true model, while the probability of selecting a fixed, most parsimonious, and true model tends to as the sample size tends to (e.g. nishii1984asymptotic; haughton1988choice; haughton1989size) based on the BIC.

We implement HOPE on each of the models specified above, with as the sample size for the conditional simulations. The calculated metrics under each model specification and held-out strategy are provided in Table 3. When random samples of edges are held-out, we set (the same number of edge variables as under node-held-out) to make the metrics comparable to those of node held-out. Throughout, we will not attempt to interpret the magnitudes of these values but focus on the relative difference and ranking between models. Focusing on the dyad-level metrics, the substantial differences between model 1 and model 2 suggest that accounting for the dependencies between edge variables is crucial, and the importance of incorporating nodal covariate information is indicated by the noticeable gap between the dyad-level metrics of model 1 and model 3. Model 4, which contains both nodal covariates and dyadic dependent terms, is associated with the highest predictive accuracy of nulls (Null ACC), overall predictive accuracy (Overall ACC) and smallest total squared loss (TSL), for all of the held-out strategies. Although model 5 produces the highest predictive accuracy of edges (Edge ACC), it does not perform as well as model 4 on the other dyad-level metrics. Overall, this presents evidence in favor of model 4, which is more parsimonious than model 5 yet still performs better on all of the other all dyad-level metrics.

If the primary objective of the proposed model is to explain phenomena that depend on variations in nodes’ positions/roles, then node-level and graph-level metrics are better suited. The models are quite comparable in recapitulating nodes’ degree and eigenvector centralities as well as the network’s degree centralization. However, the more complex models (models 4 and 5) better predict nodes’ betweenness centrality and the network’s betweenness centralization index. The rankings between the models are largely consistent across the different metrics and held-out strategies. Taken together, HOPE suggests that model 4 is the preferred model (for most types of analyses) since it performs at least no worse than model 5, while having the clear advantage of being more parsimonious.

It is worth mentioning that particular held-out strategies may be preferred depending on both modeling objectives and choices of target metric. Under leave--out, where each edge variable is held-out and assessed separately, the marginal probability of edge variables can be calculated exactly via change scores (snijders2006new), allowing simple and accurate estimates of dyad-level metrics. However, because many node and graph-level metrics vary little under single edge changes, leave-1-out tends not to provide sufficient information on predictive performance with respect to these measures to discriminate between similar models. By contrast, omitting larger numbers of edge states (either via node-held-out or leave--out strategies) poses a more difficult prediction problem, providing greater discriminatory power for comparing models’ abilities to preserve higher-order structural features. Similarly, the information provided by leave--out and node-held-out strategies differ due to the concentration of held-out edge states on specific nodes in the latter case (making node-held-out well-suited for e.g. comparing models that are intended to capture vertex-level structure). Unsurprisingly, the calculated dyad-level metrics under leave--out are found to approximate those of leave-1-out better, as the missing data are more evenly distributed. Overall, leave--out appears to be the candidate for default choice of HOPE for general purposes, while other held-out strategies are also useful in certain scenarios.

Dyad-level Node-level Graph-level
Model AIC BIC Type Edge ACC Null ACC Overall ACC TSL Degree Betweenness Eigen RMSE, Between Cen. RMSE, Degree Cen.
1 600.8 605.2 Leave-1-out 0.192 0.817 0.703 92.55 0.999 0.996 0.999 0.005 0.005
2 523.0 531.9 Leave-1-out 0.388 0.867 0.779 69.067 0.999 0.998 0.999 0.004 0.004
3 513.8 540.5 Leave-1-out 0.306 0.845 0.747 80.701 0.999 0.997 0.999 0.005 0.004
4 471.8 502.9 Leave-1-out 0.43 0.875 0.794 66.334 0.999 0.998 0.999 0.004 0.004
5 470.0 510.0 Leave-1-out 0.434 0.873 0.793 66.54 0.999 0.998 0.999 0.004 0.004
1 600.8 605.2 Leave--out 0.18 0.817 0.701 94.747 0.963 0.871 0.962 0.024 0.026
2 523.0 531.9 Leave--out 0.37 0.863 0.773 71.45 0.973 0.932 0.969 0.02 0.023
3 513.8 540.5 Leave--out 0.307 0.846 0.748 80.951 0.968 0.894 0.966 0.021 0.023
4 471.8 502.9 Leave--out 0.419 0.872 0.789 67.412 0.975 0.935 0.971 0.02 0.022
5 470.0 510.0 Leave--out 0.422 0.871 0.789 67.842 0.975 0.936 0.971 0.02 0.022
1 600.8 605.2 Node 0.179 0.816 0.7 94.992 0.943 0.874 0.938 0.021 0.021
2 523.0 531.9 Node 0.215 0.838 0.724 88.786 0.931 0.915 0.921 0.018 0.021
3 513.8 540.5 Node 0.3 0.841 0.742 83.656 0.948 0.885 0.947 0.019 0.02
4 471.8 502.9 Node 0.337 0.863 0.767 76.428 0.945 0.92 0.94 0.018 0.02
5 470.0 510.0 Node 0.345 0.86 0.766 77.780 0.945 0.921 0.939 0.018 0.02
  • The original TSL is scaled by for the purpose of comparison, as each edge variable is held-out twice in the entire process under node held-out.

Table 3: HOPE Metrics for 5 Models Fit to Lazega’s Lawyers’ Collaboration Network

4.2 Teenage Friendship Network

The teenage friendship network contains undirected friendship ties between young women333Following (bouranis2018bayesian), we use the intersection of reported ties and symmetrize the adjacency matrix.. This network is a subset of the Teenage Friends and Lifestyle Study data set (michell2000smoke), which has three years of friendships from 1995 to 1997. We use the data from the first year along with two nodal attributes: smoke, which takes the values 1 (non-smoker), 2 (occasional smoker), and 3 (regular smoker); and drugs for cannabis use, which takes the values 1 (non-drug user), 2 (tried once), 3 (occasional) and 4 (regular use). Prior analysis on these data (bouranis2018bayesian) dichotomized the drugs such that 1 indicates individuals that never used drugs or tried once; and 2 indicates occasional or regular drug use. We only use the dichotomized version of the variable, which we refer to as drugs_binary. Figure 2 depicts the friendship network and individuals’ frequencies of smoking and drug use. As shown, most friendships are homophilous based on smoking and drug use, with one individual bridging two distinct groups of the largest component.

Figure 2: Teenage friendship network. Colors indicate individuals’ different smoking patterns (white = non-smoker; black = occasional smoker; and red = regular smoker); size indicates individuals’ frequencies of drug use (larger nodes indicate greater frequencies).

Again, we evaluate five models that incrementally build upon the previous models, similar to the models considered for the lawyers’ collaboration network. Model 5 includes a differential homophily term for smoking that is absent from Models 3 and 4. The decay parameters of GWESP and GWDEG terms are fixed as and following the models used in bouranis2018bayesian.

  • Model 1: edges

  • Model 2: edges + gwesp(log(2)) + gwdegree(0.8)

  • Model 3: edges + nodematch( drugs_binary, diff=T )

  • Model 4: edges + gwesp(log(2)) + gwdegree(0.8) + nodematch( drugs_binary, diff=T )

  • Model 5: edges + gwesp(log(2)) + gwdegree(0.8) + nodematch( drugs_binary, diff=T ) + nodematch(smoke, diff=T)

Model 1 Model 2 Model 3 Model 4 Model 5
Homophily drug (Never/Tried Once)
Homophily drug (Occasional/Regular Use)
Homophily smoke (Nonsmoker)
Homophily smoke (Occasional Smoker)
Homophily smoke (Regular Smoker)
AIC 560.8 453.6 535.1 438.7 443.7
BIC 565.9 469.0 550.4 464.2 484.6
, ,
Table 4: Estimated Coefficients for Models Fit to the Teenage Friendship Network

Table 4 provides a summary of the estimated parameters along with the nominal AIC/BIC when each model was fit to the fully observed data. Based on the models fit to the full network, model 4 is preferred according to the nominal AIC/BIC. For each model specified above, we implement HOPE with as the sample size for the conditional simulations. Table 5 displays the calculated metrics for each model using the different held-out strategies, along with the AIC and BIC when fit to the original network. When random samples of edges are held-out, we set (the same number of edge variables as under node-held-out) to make the metrics comparable to those of node held-out. In contrast to the lawyer network, the inclusion of the dyadic dependent term improves the fit of the models to the friendship network significantly more than including nodal covariates according to the AIC/BIC.

The dyad-level metrics unanimously indicate that model 4 is as good if not better than all other models. It is worth noting that model 2 is itself a compelling model as the gaps between the dyad-level metrics for model 2 and model 4 are minimal, suggesting that dyadic dependent terms play a more central role in explaining the formation of this friendship network. Similar to the collaboration network, the most complex model (model 5) does not lead to a better predictive performance at the dyad level.

Larger variations on the node-level and graph-level metrics are present in this case. This is partly due to the sparsity of this friendship network, which makes the misplacement of a single edge generate a more substantial effect on the structural features. Such variation can be more informative for the purpose of model comparison. We find that all of the models, including the dyadic independence models (models 1 and 3), are fairly effective in reproducing nodes’ degree and eigenvector centralities. However, these models perform less well at recovering measures such as betweenness, compared to other models with terms that account for dyadic dependence. This may be because the combination of sparsity and local clustering that characterizes the network (and to which betweenness is sensitive) cannot be easily captured without GWESP.

The differences between the results of this network and the Lazega network are revealing. Note that the improvement produced by the covariate effects alone is not as substantial as that by dependence terms compared to the baseline homogeneous Bernoulli graph. In fact, the predictive performance of model 2 is almost comparable to those of models with both covariates and dyadic dependent terms under leave--out sampling. This suggests a dominant role of endogenous network effects in the formation of the teenage friendship network, as opposed to the strong role for covariates in structuring collaboration among lawyers. In addition, the substantially different performances of model 2 under node and leave--out held-out strategies further suggests the importance of dyadic dependence in network structure, as local structure about each vertex appears to be largely emergent and hence difficult to predict without access to other edge variables on the same vertex. By contrast, if tie formation were dominated by observed covariate effects, we would see much less differentiation between leave--out and node-held-out performance. Such performance comparisons thus provide a useful tool for inferring the relative importance of different mechanisms in determining network structure.

Dyad-level Node-level Graph-level
Model AIC BIC Type Edge ACC Null ACC Overall ACC TSL Degree Betweenness Eigen RMSE, Between Cen. RMSE, Degree Cen.
1 560.8 565.9 Leave-1-Out 0.072 0.930 0.878 69.30 0.999 0.986 0.997 0.011 0.002
2 453.6 469.0 Leave-1-Out 0.480 0.965 0.936 41.064 0.999 0.998 0.998 0.004 0.001
3 535.1 550.4 Leave-1-Out 0.088 0.948 0.896 66.477 0.999 0.989 0.997 0.010 0.002
4 438.7 464.2 Leave-1-Out 0.483 0.968 0.938 39.982 0.999 0.998 0.998 0.004 0.001
5 443.7 484.6 Leave-1-Out 0.480 0.966 0.937 40.684 0.999 0.998 0.998 0.004 0.001
1 560.8 565.9 Leave--Out 0.062 0.939 0.886 69.528 0.932 0.355 0.895 0.064 0.013
2 453.6 469.0 Leave--Out 0.457 0.962 0.932 43.365 0.958 0.892 0.906 0.027 0.009
3 535.1 550.4 Leave--Out 0.086 0.941 0.889 68.007 0.934 0.35 0.896 0.065 0.012
4 438.7 464.2 Leave--Out 0.463 0.964 0.933 42.664 0.959 0.884 0.909 0.028 0.009
5 443.7 484.6 Leave--Out 0.463 0.963 0.933 42.897 0.959 0.884 0.911 0.027 0.009
1 560.8 565.9 Node 0.058 0.94 0.886 140.246 0.927 0.477 0.857 0.056 0.01
2 453.6 469.0 Node 0.076 0.94 0.887 136.863 0.923 0.864 0.766 0.035 0.011
3 535.1 550.4 Node 0.081 0.941 0.889 137.486 0.927 0.483 0.856 0.055 0.01
4 438.7 464.2 Node 0.101 0.943 0.892 132.847 0.925 0.859 0.779 0.035 0.011
5 443.7 484.6 Node 0.097 0.943 0.892 135.706 0.922 0.859 0.778 0.035 0.011
  • The original TSL is scaled by for the purpose of comparison, as each edge variable is held-out twice in the entire process under node held-out.

Table 5: HOPE Metrics for 5 Models Fit to Teenage Friendship Network

5 Discussion and Conclusion

In this paper, we employ a technique analogous to cross-validation for models with dependent, relational data: Held-Out Predictive Evaluation (HOPE). Researchers can use HOPE to compare different model specifications of ERGMs and then select the best model for their purpose. HOPE is flexible to multiple ways of holding-out data, and allows models to be evaluated based on an array of predictive metrics. Those we have delineated are by no means exhaustive, and we hope that researchers will consider other metrics in the future based on their specific research questions, such as goodness-of-fit metrics related to a graph’s spectrum by shore2015spectral if e.g. macroscopic community structure is of substantial interest. Taken together, HOPE complements existing model adequacy checks while addressing a current limitation for model selection. Unlike other model selection tools (e.g.; AIC or the Bayes factor), HOPE is always straightforward to implement, and can be easily extended to ERGMs in which edges are counts, ranks, or continuous (see e.g., krivitsky2012exponential; krivitsky2017exponential; desmarais2012statistical, respectively) rather than just binary relations. Researchers can also use HOPE to diagnose different models’ predictive abilities. This allows model selection to proceed based on tasks that are more appropriate to the analyst’s modeling objectives.

The application of HOPE to two different networks with vastly different structures indicates the importance of properly choosing the held-out strategy and metrics that will be used to select the model. Based on empirical findings in this paper, the leave-1-out strategy has the potential to yield the most reliable estimates of the dyad-level metrics, but can generate node-level and graph-level metrics that have too little variation across models, as some structural measures are fairly robust and hence remain unchanged given the misplacement of a single edge variable. Holding out each node’s edge variables can alter the network structure to a large extent, and hence facilitate a more fruitful comparison of node-level and graph-level metrics, but the information it delivers regarding the predictive accuracy of edges can be drastically different from the other held-out strategies for networks that are sparse or contain imbalanced covariates. The leave--out strategy provides a better general purpose strategy, since more held-out edge variables are present in each batch than leave-1-out, and they are more evenly distributed rather than all of the dyads containing on of the vertices, which can make the assessments dependent on nodes’ idiosyncracies. As a general starting point, we suggest setting equal to (or, in the directed case, ), to facilitate comparison with node-held-out performance.

As noted, HOPE is distinct from other model selection and adequacy techniques. Simulation-based goodness-of-fit approaches (hunter2008goodness) are widely used for adequacy checking, but are not well-suited to model selection. The method developed by koskinen2018outliers

can successfully identify outliers that influence coefficients’ magnitude and significance, but it focuses on quantifying how outliers affect parameter estimates for a model rather than a model’s predictive performance. The AIC and BIC scores


intended for model selection, but are difficult or impossible to calculate accurately in the case of dependence models (with only nominal data degrees of freedom currently known) and draw upon asymptotic arguments of dubious validity in typical ERGM applications. Bayes factors have better theoretical properties, but are even more difficult to compute. All of these methods have assets, and we do not argue against their use. However, HOPE offers distinct gains in feasibility and interpretability, and it has the advantage of focusing attention on models’ out-of-sample predictive abilities. As such, it moves away from coefficient tests by illuminating which effects are

predictive rather than significant. The models for the collaboration network show that several of the heterogeneous homophily terms in model 5 are significant, and when fit to the full network, this model has a better AIC value than model 4, which only includes uniform homophilly terms. Yet, the predictive metrics illuminate the fact that these effects are ephemeral, and more importantly, they do not improve the model’s predictive accuracy compared to model 4. HOPE’s ability to illuminate these differences speaks to the ongoing controversy about significance and -values (goodman2019getting; kuffner2019p; mcshane2019abandon). In our view, terms in a model should typically improve the ability of the model to predict the phenomena of interest rather than simply being statistically significant and, by turn, significant effects that add nothing to prediction should be viewed with suspicion.

In sum, HOPE is complementary to extant methods of model selection and model adequacy checks. We believe that the strengths of HOPE will prove to be a useful tool for research when selecting between multiple model specification of ERGMs or other statistical network models with complex dependencies.


This work was supported in part by NSF awards IIS-1526736 and DMS-1361425.

Appendix A

Here, we list the definitions of the sufficient statistics used in Section 4. All of the terms are in the context of an undirected network, since that is the case for all of the data sets used. For more details, see morris2008specification for a more comprehensive list of sufficient statistics for ERGMs.

Edge statistic (Edges)

Geometrically weighted edgewise shared partner (GWESP) statistic

where is the number of connected pairs that have exactly common neighbors, which is a measure of local clustering in a network. The decay parameter controls the relative contribution of to the GWESP statistic.

Geometrically weighted degree statistic (GWDEG)

where is the number of nodes who have exactly neighbors. The decay parameter controls the relative contribution of to the GWDEG statistic.

Uniform homophily effect (nodematch)

Categorical covariate is defined for each node in the graph. The term counts the total number of edges between nodes with the same values of ,

Differential homophily effect (nodematch)

Categorical covariate is defined for each node in the graph, which can take distinct values, for any given possible value of where

Edge-level covariate effect (edgecov)

Covariate is measured for each pair of nodes in the graph,

Node covariate (nodecov)

Covariate is measured for every node in the graph, which can take numerical values