In online social networks, it is common to use predictions of node categories to estimate measures of homophily and other relational properties. However, online social network data often lacks basic demographic information about the nodes. Researchers must rely on predicted node attributes to estimate measures of homophily, but little is known about the validity of these measures. We show that estimating homophily in a network can be viewed as a dyadic prediction problem, and that homophily estimates are unbiased when dyad-level residuals sum to zero in the network. Node-level prediction models, such as the use of names to classify ethnicity or gender, do not generally have this property and can introduce large biases into homophily estimates. Bias occurs due to error autocorrelation along dyads. Importantly, node-level classification performance is not a reliable indicator of estimation accuracy for homophily. We compare estimation strategies that make predictions at the node and dyad levels, evaluating performance in different settings. We propose a novel "ego-alter" modeling approach that outperforms standard node and dyad classification strategies. While this paper focuses on homophily, results generalize to other relational measures which aggregate predictions along the dyads in a network. We conclude with suggestions for research designs to study homophily in online networks. Code for this paper is available at https://github.com/georgeberry/autocorr.READ FULL TEXT VIEW PDF
Networks are representations of complex underlying social processes. How...
Graphs are widely adopted for modeling complex systems, including financ...
This paper proposes a method to guide tensor factorization, using class
The widespread adoption of online social networks in daily life has crea...
Aggregate network properties such as cluster cohesion and the number of
Classification problems have made significant progress due to the maturi...
We investigate the representation of measures of network centrality usin...
Researchers have long sought to understand the pattern [Marsden, 1987, McPherson et al., 2006], causes [Kossinets and Watts, 2009], and consequences [Blau, 1977] of homophily [Coleman, 1958], or the tendency for like to associate with like. Measuring the similarity of nodes along along racial [Marsden, 1987, McPherson et al., 2006, Smith et al., 2014, Cesare et al., 2017b, Mollica et al., 2003, Wimmer and Lewis, 2010], ethnic [Hofstra et al., 2017], gender [Messias et al., 2017, Thelwall, 2009, Choudhury, 2011], social status [Kossinets and Watts, 2009], cultural [Lewis et al., 2012], emotional [Himelboim et al., 2016], political [Halberstam and Knight, 2016, Bakshy et al., 2015, Colleoni et al., 2014], and socioeconomic [DiPrete et al., 2011] lines is a core area of research in network science [McPherson et al., 2001]. In online networks, understanding the structure of homophily is crucial for understanding echo chambers [Barberá et al., 2015], access to information, and network integration [Karimi et al., 2018].
Online social networks present a particular challenge for understanding this fundamental aspect of networks: demographic and attitudinal information is often absent. The common strategy for addressing this is to predict demographics or attitudinal attributes[Cesare et al., 2017b, Messias et al., 2017] based on publicly available information such as names, profile photos, text, or other information [Barberá, 2016, Al Zamal et al., 2012, Hofstra and de Schipper, 2018, Choudhury, 2011, Messias et al., 2017]. This information is combined with ground truth
labels (known values of the category of interest for a set of individuals), and a supervised learning classifier[Molina and Garip, 2019] is then used to predict the node category for all nodes in the network.
Although predicted node attributes are widely used to empirically measure homophily and other relational properties [Cesare et al., 2017b, Messias et al., 2017, Himelboim et al., 2016, Boutyline and Willer, 2017, Hobbs et al., 2016, Bakshy et al., 2015, Colleoni et al., 2014, Choudhury, 2011], there is a lack of theoretical methodological work investigating when and to what extent the predictions produce reasonable estimates [Berry et al., 2018]. The most common strategy is to choose a model which maximizes a node-level measure of classification performance. Because of the complexities of networks, this criterion is not sufficient for reliable estimation of relational measures such as homophily.
In this paper, we formalize the homophily estimation problem as a dyadic prediction problem. This expression clarifies the difficulty in using node-level predictions to draw larger inferences: node residuals are multiplied along edges, magnifying a node’s residual in proportion to its degree, and opening the door to residual autocorrelation along dyads. Theoretically, we should expect correlated errors along edges due to latent homophily [DellaPosta et al., 2015], or the correlation of unobserved factors in the network. For example, name-based classifiers [Hofstra and de Schipper, 2018, Choudhury, 2011, Hobbs et al., 2016]
have frequently been used for gender classification. If a name-based method codes “Leslie” as “woman” because this is more common in the population, yet for a specific network community the name “Leslie” tends to indicate “man”, model errors will be correlated with the network and can bias overall gender homophily estimates. This issue is compounded by the highly skewed degree distributions of online networks[Kato et al., 2012], which introduces the additional possibility that misclassification for high degree nodes will bias the overall estimate.
We show that dyad-level predictions produce unbiased homophily estimates. However, such estimates are often high-variance for a given ground truth labeling budget111This fact suggests that even when possessing the “ideal” random edge sample with labels, modeling should be employed as a variance reduction technique.
. This motivates a two-step modeling procedure (called “ego-alter”) which predicts the category of a node from the perspective of each one of its network neighbors. This allows incorporating network information beyond the ego level, while still using standard modeling tools such as logistic regression. This ego-alter approach is theoretically less biased than a node-level model, and across a range of simulations outperforms both node-level and dyad-level models in overall error. Figure1 visually compares these three approaches. While we primarily study homophily with node demographics in mind, results here apply to a wide range of networked outcomes, such as estimating the fraction of people belonging to a certain race/ethnicity experience hate speech in their social media feeds [Davidson et al., 2019].
We study the average fraction of ego networks composed of ingroup members (visualized in Figure 2). Average ego network composition has been extensively studied in sociology, primarily in research concerning the General Social Survey network module [Marsden, 1987, McPherson et al., 2006]. The average ego network composition measures what the network tends to look like from the perspective of members of a given group. For instance, Black respondents to the GSS have been found to have higher average racial heterogeneity in their core discussion networks than White respondents222We choose this measure of homophily instead of Coleman’s homophily measure [Coleman, 1958] because it is not dominated by high degree nodes, although the Appendix shows that results for the average egonet measure apply to Coleman’s measure as well. [Marsden, 1987].
Average egonet composition can be written as a sum over ego networks, taking into account the category of both ego and alter. Let indicate the category of , for instance in the case of racial homophily, category may indicate Black, category may indicate White, and so on. Without loss of generality, assume that we are studying a binary outcome where . For compactness, we write to denote . Then the average fraction of group ’s ego networks which are composed of alters in group (Figure 2) can be written,
where indicates the degree of node , is a function which returns the neighbors of , and indicates the size of group , . For example, if , it means that an average ego network for group is composed of 70% ingroup members.
Note that equation 1 can be re-written as a sum over dyads in the network by rearranging the summation,
where are edges in graph . Rewriting the edge-level outcome
as a single random variableprovides an expression in terms of edge categories. Estimating can therefore be considered either a node or dyadic prediction task. Note that in the node case, predictions are multiplied.
Equation 2 shows how homophily can be estimated with knowledge of edge categories . Assume edge categories are obtained for a random sample of the edges , and features correlated with edge categories are available both for the sample and the population . Assuming random sampling simplifies the argument, and we discuss deviations from this assumption in the Appendix.
Assume a model predicting is chosen which has the property that the residuals sum to zero in the population and are uncorrelated with features333 For instance, ordinary least squares and logistic regression both have this property, as does any model with mean squared error loss.
For instance, ordinary least squares and logistic regression both have this property, as does any model with mean squared error loss., . Then this model trained on the sample
provides an unbiased estimate of homophily in the population given thatis included in as a feature.
The reason for including as a variable can be seen by the following argument. First, recall the conditional expectation (CEF) function expression [Angrist and Pischke, 2009]: , where can be estimated with a model such as OLS. An estimator for Equation 2 can be written in terms of predictions,
This indicates that the homophily estimate will be unbiased when the sum of residuals term is zero.
When , the expectation of the estimate equals the true value, .
Since we assumed a model is used where and , if is included in then .
This argument concerns model residuals, not error terms. No assumptions have been made about causality, true functional form, or predictive accuracy. With a random edge sample and OLS, is the only required variable in for an unbiased estimate, although this can produce a high variance estimate. Including robust predictive features is therefore still important for variance reduction and address cases of non-random sampling.
Sampling and labeling limitations often make collecting a large set of ground-truth dyads infeasible. In this situation, node-level ground truth information can be employed to estimate homophily. We present a two-step modeling strategy which we term “ego-alter” which uses only node-level ground truth information, reduces bias over a standard node-level model, and reduces variance compared to the edge model in Section 3. The ego-alter approach is biased, although the magnitude of bias in simulations we examine below is generally substantially less than a node-level model.
The ego-alter model is a hybrid approach: it uses dyadic features to predict the node-level ground truth and separately, producing one prediction per edge for both ego and alter (see Figure 1 for a visual representation). This has the benefit of reducing bias in two ways: first, predictions for and are improved by including a richer set of features which improves prediction accuracy; second, it reduces bias by reducing dyadic residual autocorrelation.
Since , can be estimated with the product of node predictions,
where the second line follows by substituting the CEF. This can be expressed as the true homophily value plus two bias terms,
The bias terms and both indicate dyadic correlation of the model residuals with neighbor outcomes. Assuming is included as a model feature, and can be thought of similarly: when model residuals are correlated with neighbor outcomes, the terms will be non-zero. This can happen when models produce pockets of similar errors in the network due to unobserved, network correlated features. When inverse ego degree is not included as a model feature, the bias terms become substantially more complex because of the interaction between degree, errors by degree, and errors along dyads.
Equation 5 indicates that the estimate equals its true value when or when . Note that is unlikely, this is because residuals for and have similar correlations with neighbor true outcomes and since both ego and alter models score the entire network444This is confirmed by simulations, where and tend to have similar values..
Note that is the result of combining two terms, since . This suggests an “augmented” ego-alter model: first, fit an ego model for , and then fit an alter model for which includes the ego predictions as a feature. This, in expectation, eliminates the term and reduces bias assuming and have the same signs.
, the probability ofis simulated as follows:
represent individual and network-correlated features, respectively. The individual-level feature is drawn from a normal distribution,, while the network feature is the maximum of the individual feature among the neighbors of each node : . is then standardized to follow a normal distribution. This creates outcomes correlated along some dyads in the network, where nodes with large values of “influence” neighbors. If is omitted, model residuals will be correlated along dyads and bias homophily estimates. The level of homophily generated by these parameters is moderate: the average ego network for group contains 59% of nodes in group (), while Coleman’s homophily index for group is 0.14.
We choose to be the maximum value among the alters of ego to provide a challenging setting for models: the true response is determined at the ego-network level yet models operate at the node or dyad levels, meaning a dyadic regression cannot capture the true functional form of the data generating process. This both approximates real-world scenarios where the data generating process is unknown, and demonstrates the argument about bias in Section 3. Five alternative simulation specifications are examined in the Appendix, with qualitatively similar results to this simulation.
Networks with 4000 nodes are generated according to a preferential attachment graph [Barabasi and Albert, 1999] with five links per node and a powerlaw exponent . Links are considered bidirected. Preferential attachment graphs have high degree disparities, providing a challenging setting for the estimation task considered here, since model errors on individual high degree nodes can bias estimates. We conduct simulations for both node and edge sampling, selecting 20% of nodes or 2.5% of edges randomly as ground truth cases. This produces roughly 800 ground truth nodes for both types of sampling. Note that sampling nodes into the ground truth set provides some ground truth dyads (and vice versa), meaning both node and dyad models can be fit with either type of sampling.
Using the ground truth sample to estimate a model, we classify all edges and estimate homophily across 500 simulation runs. Model performance is estimated in two ways: bias and absolute error. Bias is the average of across all simulation runs, and represents the systematic deviation from the true value. Absolute error is the average absolute error relative to the true underlying value, or the average of across all simulation runs. It captures how far estimates tend to be from the true value.
Since both absolute error and bias are normalized, they have the interpretation of “percent error.” The bias-variance tradeoff means that we should not expect the method with the lowest bias to also have the lowest absolute error.
We evaluate three types of models: node, dyad, and ego-alter (see Figure 1 for a depiction), all fit with logistic regression. For the node model, we examine models with and without network features. For the ego-alter model, we examine both the basic version and the “augmented” version. This gives a total of 5 models, which are described here in terms of their regression equations, where indicates a main-effects linear model .
The notation indicates that we predict ’s category for each neighbor separately, using features of both and in the prediction. Ego and alter degree are included in models because they tend to reduce the bias and variance of estimates and are available to researchers conducting network studies.
As shown in Figure 3, the default approach of using node-level classifier with no network features performs poorly. Homophily is underestimated by between 10% and 20%, with average absolute error of about the same magnitude. Even when accounting for the inverse degree term , the node-level approach still underestimates homophily by around 3%. This large reduction in bias indicates the importance of including network information in the model predicting node categories, while the remaining bias indicates the limitations of a node-level approach in the presence of network-correlated outcomes.
In this simulation, homophily is under-estimated. This indicates that errors tend to be positively correlated along dyads, increasing the sum of residuals term in Equation 4 and reducing the overall homophily estimate. In other words, there are pockets of the network where the model errors are similar. An alternative scenario exists where a model produces negatively correlated dyadic errors and an over-estimate of homophily. An instance where this happens is residual-degree correlation in the network. When high degree nodes have positive residuals and low degree nodes have negative residuals, the overall residual term in Equation 4 can be negative and cause an over-estimate of homophily555An instance of this phenomenon can be seen in the Section 8.1, with the simulation called “degree.”.
A dyadic model produces an unbiased estimate of homophily, according with the argument in Section 3. However, the dyadic approach does not produce the lowest absolute error. Despite a small amount of bias (around 1%), the ego-alter approach produces lower absolute error on average than the dyadic approach. In alignment with the theoretical argument in Section 4, including ego predictions in the alter model reduces bias about 20% on average.
While this simulation uses random sampling, the ego-alter model is also more robust to deviations from random sampling compared to other methods, as shown in the Appendix. In the presence of non-random edge samples, an edge model can be brittle. One potential corrective is weighting the ground truth data, but the often high-dimensional nature of edge features risks large design effects due to the curse of dimensionality[Iacus et al., 2012]. Additionally, a “meta-network-correlation” problem can arise, where errors in weights are network correlated.
A more practical approach is to employ a modeling strategy such as ego-alter which can more flexibly learn class probabilities in a network-aware way. While we intentionally restrict models here to logistic regression with only main effects, more flexible functional forms can also be employed to better approximate class probabilities within subgroups.
Machine learning models are usually evaluated on observation-level performance metrics such as precision, recall, and area under the curve (AUC). When using predictions to estimate an aggregate such as homophily, strong observation-level performance is encouraging but not sufficient for high-quality estimates of the aggregate. An error-free model will by definition produce a perfect estimate of homophily, but even models with strong out of sample observation-level performance can make dyad-autocorrelated errors that bias homophily estimates.
This can be seen clearly in Figure 4, which plots model performance against bias in estimating homophily666Only models which produce node-level category predictions can be evaluated this way, which necessitates excluding the edge model.. Models differ only slightly on traditional performance measures, yet produce large differences in homophily bias. The best model’s AUC is 0.8% better than the worst model, yet has a bias reduction of 95% (worst: 17.6% bias; best 0.8% bias).
Note that a meta-analysis of research on demographic classification on social media [Cesare et al., 2017a] found a median accuracy of 0.81 for predicting race/ethnicity, while simulations presented here have an average accuracy of around 0.77. This indicates that similar biases may be present with the type of classification performance found in real-world tasks.
When studying homophily in online communities, researchers can potentially improve the quality of estimates in five ways: including network information in models, using the ego-alter modeling strategy, improving model flexibility, sampling edges, and using cross-validation to check for the presence of network-residual correlation.
First and most importantly, network information should be incorporated into prediction models. Evidence from Sections 5 and 8.1 indicates the single largest improvement in model performance comes from including degree information () in models. The specific information to include is dependent on the estimand, and can even extend to behavioral information when outcomes such as political affiliation are studied in networks. We give an example of applying the process from Section 3 to new estimands in the Appendix (Sections 8.2 and 8.3).
Second, the simulation results consistently demonstrate that the ego-alter modeling strategy performs well both in terms of bias and absolute error. This is true even in the presence of a non-random ground truth sample. Since the ego-alter strategy is new, we recommend that researchers present results from both a node-level model and an ego-alter model, with network information included for both models.
Third, and closely related to the choice of modeling strategy is the choice of model itself: a logistic regression with only main effects in the presence of a non-random sample can produce large biases, as seen with the edge model and non-random edge sampling in the Appendix (Table 1). A more flexible model can mitigate this by better learning conditional class probabilities, although the performance will depend on having sufficient ground truth data.
Fourth, edges should be sampled instead of nodes when possible. A consistent finding of our simulations is that for the same labeling budget, a random edge sample outperforms a random node sample in terms of absolute error. In practical settings, such as Twitter, it is often much easier to randomly sample nodes than edges. One strategy for edge sampling is to use a result from the respondent driven sampling literature [Salganik and Heckathorn, 2004] that a random walk through an undirected network approximates an edge sample (see [Berry et al., 2018] for a discussion in the context of online networks). While this may not be feasible in some research settings, researchers may want to consider edge sampling if a random walk approach is possible.
Finally, researchers can obtain an estimate of network residual correlation by using cross-validation (see the discussion in [Molina and Garip, 2019] for a brief introduction to cross validation; see Chapter 7 of [Hastie et al., 2008] for a more extensive discussion). Cross validation splits the training data into a number of folds (usually 5 or 10), and uses all but one fold to train a model, with the held-out fold used to evaluate the model. This proceeds in a round-robin fashion so that the entire training set is scored in a way approximating out-of-sample prediction. In the context of homophily estimation, estimating the residual term in Equation 4 can provide important information about network residual correlation. This can be accomplished in a cross validation setting by dividing up all dyads in the training set into folds, and performing cross validation on the ground truth dyads. If , where is the training set, then models may need adjustment before providing reliable estimates of homophily. This strategy does not ensure unbiased homophily estimates, particularly in the presence of non-random ground truth sampling, but it does provide a useful diagnostic.
We have examined the problem of estimating homophily when predictions must be used for node attributes. While the problem is challenging, the results we present indicate that homophily can be studied in online networks when classification performance is strong and network information is incorporated directly into models.
The strategies outlined here also provide a pathway for the measurement of other network-level properties. Examples are triadic properties, such as social closure by demographic group. In studies of dynamic network processes such as contagion, models to reduce measurement error [Berry et al., 2019] may benefit from the results here. In the case of signed or multiplex networks, the distribution of different types of edges across groups may be important. Similarly to homophily estimation, consideration of how model errors intersect with graph properties is important for reliable use of predictions in networks.
In addition to the simulation presented in the main text, we examine four additional simulations. These simulations further demonstrate the strengths and weaknesses of the approaches considered in the main text. We describe these simulations by the data generating process for .
Independent: , where . In the independent simulation, node categories depend only on a node-level feature which is uncorrelated with the network.
Degree: , where and , standardized. In this simulation, node categories are a function of neighbor degree, meaning nodes with high-degree neighbors are more likely to have .
Main: the simulation described in Section 5 of the main text.
Unobserved: , where , and , standardized. is a standard normal unobserved variable which causes outcomes to be correlated in the network.
Sampled: This simulation samples nodes into the ground truth set according to degree and node features. is generated identically to the main simulation. For the edge simulation, dyads are sampled into the ground truth set by,
For the node simulation, nodes are sampled into the ground truth set by,
The variables are constants chosen to make the total number of ground truth nodes or dyads equivalent to the sampling fractions for the other simulations.
Tables 1 and 2 present all simulations by all models, including a “no model” category using just the ground truth observations for comparison. The best performing model for each simulation (each row) is bolded.
Researchers often care about actions in addition to node characteristics. For instance, what is the fraction of content seen broken down by gender of the content author? Addressing this question is important for examining visibility [Karimi et al., 2017] by gender online, and requires combining information about node gender and action.
In this case, Equation 2 is modified with a variable indicating some action of alter . Assume represents number of messages sent by , and indicates that is a woman.
In words, Equation 7 represents the average fraction of messages seen by members of group which come from members of group . When incorporating the additional variable into the equation, we can apply the same logic as Section 3 to obtain an unbiased estimate: incorporate into the predictive model, instead of alone.
We studied average egonet composition in the main text, but another popular measure of homophily is Coleman’s homophily index [Coleman, 1958]. This measure studies the fraction of within group links from the perspective of a certain group, relative to the proportion expected by chance.
The challenge is estimating the proportion of within-group links from the perspective of a given group . This can be done in a manner similar to Equation 2,
This turns out to be a simpler version of the egonet estimand considered in the main text, and can be addressed with similar modeling strategies.
We thank Thomas Davidson, Mario Molina, Pablo Barberá, Christopher Cameron, and the members of the Cornell Social Dynamics Lab for helpful comments and discussions.