A Correlated Network Scale-up Model: Finding the Connection Between Subpopulations

by   Ian Laga, et al.

Aggregated relational data (ARD), formed from "How many X's do you know?" questions, is a powerful tool for learning important network characters with incomplete network data. Compared to traditional survey methods, ARD is attractive as it does not require a sample from the target population and does not ask respondents to self-reveal their own status. This is helpful for studying hard-to-reach populations like female sex workers who may be hesitant to reveal their status. From December 2008 to February 2009, the Kiev International Institute of Sociology (KIIS) collected ARD from 10,866 respondents to estimate the size of HIV-related subpopulations in Ukraine. To analyze this data, we propose a new ARD model which incorporates respondent and subpopulation covariates in a regression framework and adds a correlation structure to the responses. The resulting size estimates of those most-at-risk of HIV infection can improve the HIV response efficiency in Ukraine. Additionally, the proposed model allows us to better understand two network features: 1. What characteristics affect who respondents know, and 2. How is knowing someone from one group related to knowing people from other groups. These features can allow researchers to better recruit marginalized individuals into the prevention and treatment programs.



There are no comments yet.


page 20


Thirty Years of The Network Scale up Method

Estimating the size of hard-to-reach populations is an important problem...

A sample size heuristic for network scale-up studies

The network scale-up method (NSUM) is a survey-based method for estimati...

Probabilistic HIV Recency Classification – A Logistic Regression without Labeled Individual Level Training Data

Accurate HIV incidence estimation based on individual recent infection s...

Consistently estimating graph statistics using Aggregated Relational Data

Aggregated Relational Data, known as ARD, capture information about a so...

Interpreting Models by Allowing to Ask

Questions convey information about the questioner, namely what one does ...

Can gender inequality be created without inter-group discrimination?

Understanding human societies requires knowing how they develop gender h...

Noisy Pooled PCR for Virus Testing

Fast testing can help mitigate the coronavirus disease 2019 (COVID-19) p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hard-to-reach populations are groups of people that are not easily sampled by commonly used surveys, potentially due to their stigmatized status (e.g. female sex workers) or their infeasibility to be reached (e.g. people who committed suicide). There is a long history of developing methods to estimate the sizes of hard-to-reach populations, such as direct survey estimates, capture-recapture, enumeration, and venue-based sampling, but still no method has emerged as the gold standard. UNAIDS/WHO outlined strengths and weaknesses of many of the methods (UNAIDS and WHO, 2010). Direct estimates typically use random surveys of the general population and calculate what percent of respondents belong to the hard-to-reach population, but are inefficient or lead to biased results for small and hard-to-reach populations. For reasonable sample sizes, many of these surveys don’t even reach the hard-to-reach populations, making it impossible to estimate the population size. Other methods require working with members of the hard-to-reach populations directly, which can lead to more accurate and precise estimates, but it is often difficult to contact populations directly that desire to stay hidden due to poor treatment and negative social stigma.

Originally motivated by estimating the size of people who have died in the Mexico earthquake (Bernard et al., 1989), the network scale-up method (NSUM) avoids the need for samples from the hard-to-reach population entirely, making it more convenient to implement and shining a light on the scale of impossible-to-reach population. NSUM uses aggregated relational data (ARD), which contains the answers to surveys with questions of the form “How many X’s do you know?” The ARD is collected from the general population, rather than from the target population.

After being introduced to UNAIDS as a promising method to estimate most-at-risk people for HIV infection, many countries/cities have attempted implementing ARD surveys in their general population. One of the largest surveys is the 2009 Ukraine survey, in which the Kiev International Institute of Sociology (KIIS) collected ARD from 10,866 respondents aged 14 and above from December 2008 to February 2009 to estimate the size of eight HIV-related subpopulations in Ukraine. The ultimate goal of the survey is to improve HIV response efficiency in Ukraine with the help of accurate size estimates. The authors relied on the NSUM to estimate these population sizes since existing methods like multiplier method and capture-recapture were too resource-intensive to obtain accurate estimates for all of Ukraine and would require studies in at least 60 settlements to obtain national size estimates (Paniotto et al., 2009).

Despite the attractive features of NSUM, many researchers found that the size estimates from NSUM differed greatly from other methods and that the uncertainty intervals were too small (Salganik et al., 2011). Without a gold standard, the reliability of this relatively new method is being doubted. How to assess the NSUM results becomes one of the biggest stumbling blocks of it being widely used.

The basic NSUM model assumes that the number of people that respondent knows in subpopulation , denoted by , follows


where is the degree (or total number of people that respondent knows), is the size of subpopulation , and is the total population size (Bernard et al., 1989, 1991)

. In words, this model assumes that the probability that a member of respondent

’s social network belongs to subpopulation is proportional to the prevalence of in the general population. could be estimated given ’s, ’s and .

Furthermore, is typically unknown and difficult to estimate directly, requiring the models to estimate both and , either sequentially or simultaneously. The estimation of relies on survey questions about “known populations” (e.g. people named John or postal workers, where the population sizes are known through census or other means), while the estimation of the unknown relies on survey questions about “unknown populations” (e.g. female sex workers), where the population size is unknown.

Early frequentist models provided a solid foundation for quickly and easily estimating degrees and subpopulation sizes from ARD surveys (Killworth et al. (1998a), Killworth et al. (1998b)). Recent Bayesian models have improved size estimates and answered important scientific questions about social networks. Zheng et al. (2006) included additional overdispersion in the model through a negative binomial overdispersion parameter, both better capturing the variability in the data than the existing methods and providing an estimate of the variation in respondents’ propensities to know someone in each group. For example, the authors found that localized groups like American Indians and homeless persons had the highest overdispersion estimates, indicating that people are likely to either know no members of those populations or know far more members than the average person. Later, Maltiel et al. (2015) aimed to model the NSUM biases (barrier effects, transmission error, and recall error) directly through the priors, again further improving size estimates and estimating the strength of each bias within the subpopulations. Most recently, Teo et al. (2019) included respondent covariates both about the respondent (e.g. age, gender) and how the respondent felt about each unknown subpopulation (e.g. what level-of-respect do you feel towards female sex workers) to adjust size estimates and study how these covariates influenced the number of people the respondents knew in each subpopulation. One finding of their study in Singapore was that younger respondent knew more male clients of female sex workers, men who have sex with men, and drug users than older respondents, and no female respondents reported knowing any female sex workers. However, the authors needed to fit two separate models, one for male respondents and one for female respondents, to account for the behavioral differences. More importantly, the model ignored the extra variability in the data and it resulted in small uncertainty intervals, similar to the Killworth et al. (1998b) estimates. We refer readers to Laga et al. (2021) for a more complete review of the existing NSUM models and ARD properties.

Until now, all models assume that after controlling for the overdispersion, biases, and covariates, the responses for a single participant is independent across all subpopulations. However, we conjecture that this is not the case. Zheng et al. (2006) observed that the residuals from their model were correlated, and respondents who knew individuals who had suffered from one negative experience (e.g. suicide or rape) were more likely to know individuals who suffered from other negative experiences. We aim to properly model the correlation structure to further improve NSUM estimates and answer the sociological question of how different subpopulations are related.

In the Ukraine survey, information about the respondents’ demographics and their acquaintance to multiple known and unknown subpopulations is collected. To better utilize all of the information and learn more about the connections among subpopulations, we propose a new ARD model that accounts for overdispersion, decomposes the biases, and incorporates respondent characteristics, while also capturing the correlations between subpopulations. Our regression framework allows for more flexibility and ease-of-use than the existing approaches and provides quantitative measures of how covariates influence the number of people known in both known and unknown groups. The correlation estimates from our model provide insight into how social networks form and can hint at how different subpopulations overlap in society. In addition, we propose various measures to assess the reliability of the model estimates.

This paper is organized as follows. First, Section 2 describes the Ukraine dataset. We introduce our proposed NSUM models in Section 3. The benefits and limitations of the models are discussed and our modeling choices are explained. We also show how the overall bias term in our model can be deconstructed into the three NSUM biases. In Section 4, we review existing ARD diagnostics and introduce several diagnostics that are common in network and other literature but new to the NSUM literature. We fit our proposed model to the Ukraine data set in Section 5. We establish empirical properties of our model in Section 6 and discuss practical advice for future collection and analysis of ARD in Section 7. Final remarks and discussion are found in Section 8.

2 Ukraine Data

Of the Eastern European countries, Ukraine has the second highest rate of new HIV infections in the WHO European Region, motivating the study of key populations (European Centre for Disease Prevention and Control/WHO Regional Office for Europe, 2017; Paniotto et al., 2009). From December 2008 to February 2009, the Kiev International Institute of Sociology interviewed 10,866 respondents aged 14 and above, asking “How many X’s do you know?” questions about 22 known groups and 8 unknown groups (Paniotto et al., 2009). The 8 unknown groups in the ARD survey include people injecting drugs over the last 12 months (IDUs), women providing sexual services for payment over the last 12 months (FSW), men providing sexual services for payment over the last 12 months (MSW), men who have sex with men (MSM), people living with HIV, IDUs sexual partners, FSW clients, and MSMs’ women sexual partners. We consider 4 of the 8 unknown subpopulations (FSW, MSW, MSM, and IDUs) since these subpopulations belong to the World Health Organization’s list of main key population groups vulnerable to HIV (World Health Organization and others, 2016). Examples of the known groups include men aged 20-30, women who gave birth to a child in 2007, and men who served sentences in places of imprisonment in 2007. Supplementary Table 1 lists the known and unknown groups and the sizes of the known groups.

In addition to the ARD (), respondents were also asked demographic questions about their gender, age, education, nationality, profession, and whether they had access to the internet (), as well as “what level of respect” the respondent believed there was in Ukraine for each group on a 1-5 Likert scale, where 1 represents very low level of respect (). After removing respondents with missing responses, the remaining sample has 8,451 respondents, which is 77.77% of the original data set. Furthermore, based on the accuracy of leave-one-out size estimates for the known groups, Paniotto et al. (2009) recommended keeping only 11 of the 22 known groups, so for our analysis, and .

There are significant differences between the distributions of responses across subpopulations. Figure 1 shows frequency barplots for three subpopulations, women aged 70+, kids, and IDUs. For women aged 70+ and kids, the average responses are very similar (9.99 and 10.17, respectively). However, compared to women aged 70+, respondents tend to know either 0 kids or many kids – there were nearly 1.7 times as many respondents who knew 0 kids than respondents who knew 0 women aged 70+. In this situation, the distribution of kids known by the respondents is more overdispersed than the distribution of women aged 70+ known by the respondents. For the hard-to-reach populations, the overdispersion is even more significant. Most respondents know 0 IDUs (92.6%), while some respondents report knowing 40, 50, 60, and even 130 IDUs. The distribution of responses indicates that there are likely significant barrier effects for certain populations like IDUs that violate the random mixing assumption.

Figure 1: Barplot of the observed Ukraine ARD for women aged 70+ (a), kids (b), and injection drug users (c). For visualization purposes only, the barplot for 30 represents responses greater than and equal to 30.

3 Models

In this section, we introduce our correlated NSUM model, discuss its properties, and describe the computation techniques used to estimate the model parameters. We first introduce the ARD notation used in the remainder of the manuscript. Given an ARD survey with respondents about subpopulations, the number of people that respondent reports knowing in group is . For ease of notation, we will assume that there are known subpopulations and only one unknown subpopulation, and they are ordered such that the subpopulation is unknown (although in practice, there may be any number of unknown subpopulations). Then, the ARD is an matrix, where is the number of respondents in the survey, and is known while is unknown. The respondent demographics (e.g. age, gender, occupation, etc.) are denoted by , where is the information about respondent for variable , . In some surveys, respondents are also asked how they feel about members in group using questions of the form “what is your level of respect towards group ?” The exact phrasing of the question can vary, but the answers are denoted with entries for respondent and group . The key distinction between and is that is a respondent-only feature while contains information about the interaction between the respondents and the subpopulations. All columns of and are centered to have mean zero.

3.1 The Correlated NSUM Model

Our correlated NSUM model is written as


where after scaling, represents the degree of respondent , represents the prevalence of group , and correct for the demographic and population-specific information, and represents the bias between respondent and group . We complete the formulation with the priors,


This parameterization is such that , a property shared by the Gamma prior in the Zheng et al. (2006) overdispersed model. The normal prior on comes from the evidence by Maltiel et al. (2015) that the distribution of degrees seems to follow a lognormal distribution. The half-Cauchy priors on and are recommended by Gelman (2006) to restrict the parameters away from very large values. Note that and are not sampled, and are only transformations of the sampled parameters . We separate the biases into the different terms (barrier, tranmission, and recall) after all parameters have been estimated, and these details are shown in Supplementary Section 1.

3.2 Computation

Parameters are estimated using Markov Chain Monte Carlo (MCMC). As presented above, the MCMC implementations would have trouble producing unbiased results without prohibitively long sampling chains. The hierarchical form of the model for the bias terms suffers from very inefficient MCMC sampling. Specifically, when the

are all close to one another, will shrink towards 0. Then, since is small, can only take very small MCMC steps, keeping both close to one another and close to 0. To break this dependence between and , the model can be reparameterized through parameter expansion

, which allows the size of the residuals to move independently of the variance parameters

(Van Dyk and Meng, 2001; Liu, 2003; Gelman, 2004; Gelman et al., 2008). Expanding our model, we reparameterize the bias as


where represents the lower-triangular Cholesky decomposition of the correlation matrix and . By sampling instead of the directly, the values of can comfortably jump around even when is small because of the additional error terms .

4 Diagnostics

Diagnostics to evaluate how well the assumed model fits the data have been largely underdeveloped for models involving ARD. Leave-one-out (LOO) subpopulation size estimates have been considered in almost all NSUM literature. In addition to LOO, Zheng et al. (2006) used standardized residuals from their model (defined as ) to study the residual correlation between subpopulations and the relationship between the residuals and the individual-level predictors. McCormick and Zheng (2012) and Maltiel et al. (2015) also used absolute relative error of the estimates to evaluate the performance of models under different assumptions, although this line of thinking has limited applicability to real data because the true is unknown for the hard-to-reach populations. In this section, we review existing informative diagnostics and propose several new diagnostics to evaluate the goodness of fit. We apply the diagnostics to the real data analysis in Section 5

4.1 Leave-one-out

A leave-one-out procedure can be implemented to evaluate the subpopulation size estimates. For models where the known subpopulation sizes are fixed in the estimation procedure (e.g. the Killworth MLE estimator), the known sizes are assumed to be unknown one at a time and estimated using the remaining known subpopulations. Thus, the model needs to be refit for each known subpopulation. For models in which all subpopulation sizes are estimands (e.g. the overdispersed model and our correlated model), the LOO procedure does not require re-estimating the parameters. Instead, since the subpopulation sizes are scaled using known sizes, the known size of the left out subpopulation is excluded when scaling the size estimates. The estimates are then compared to the truth, either visually or using some metric.

4.2 Posterior predictive checks

Posterior predictive checks are a staple of Bayesian model diagnostics, but have remained largely unexplored in models for ARD. Zheng et al. (2006) looked at posterior predictive checks to evaluate the performance of the model in accurately estimating the correct proportions of responses for equal to each non-negative integer. They concluded their model underestimated the proportion of respondents who know exactly one person in each group of the McCarty data set, and that data was “heaped” around nice numbers, like 10. We develop the use of posterior predictive checks and present here several more informative checks for ARD models and recommend their use in evaluating model fit for future studies.

Probability mass function: First, we recommend replicating the probability mass function diagnostic from Zheng et al. (2006). We especially recommend looking at and , as these values don’t typically suffer from rounding, but can inform the user about unexpected behavior in the data (e.g. fewer people know 0 doctors than expected from the number of doctors because only respondents who don’t have a family doctor will report knowing 0 doctors).

Conditionally-positive mean and standard deviation

: Second, we recommend plotting the mean of the positive responses against the standard deviation of the positive responses, grouped by each subpopulation (i.e. the columns of

). It is natural to look at the mean and standard deviation of the responses for each group, but posterior predictive checks are most informative when they capture features of the data not directly addressed by the model. Thus, we instead choose the mean and standard deviation conditional on the data or the posterior replicate being positive since NSUM models do not estimate these parameters directly. The mean and standard deviation for subpopulation are calculated by

where is the number of positive responses in group , i.e. .

Correlation: Posterior predictive checks can also be used to measure the ability to capture the correlation between a respondent’s responses across subpopulations. For each posterior sample, the sample correlation is calculated, producing pairwise correlations between each subpopulation. The distribution of pairwise correlations across the posterior samples are then compared to the pairwise correlations of the original Ukraine data. For two given subpopulations and , the pairwise correlation is calculated by

4.3 Surrogate residuals

Given their broader use, we consider surrogate residuals separately. Residuals for discrete data often have limited use because of the unusual visual appearance of the residuals and the lack of any theoretical distribution for the residuals under the correctly specified model. Liu and Zhang (2018) introduced the surrogate residuals, which solve these problems by generating continuous residuals that follow a known distribution under the null distribution, allowing the user to interpret the residuals in a familiar way. For ARD models, surrogate residuals are most useful because they can be used to detect anomalies between the observed data and the assumed distribution, inform variable selection, and diagnose the relationship between the response and the covariates. While our model is not in a class of cumulative link regression models as was the primary focus of Liu and Zhang (2018), the authors provided two methods to calculate surrogate residuals for general models. Specifically, the authors showed under their jittering strategy (B) that given an outcome , jittered surrogate variables , and residuals , the residuals have properties and

under the null hypothesis


Calculating the surrogate residuals in our setting is not obvious. Liu and Zhang (2018) proposed the surrogate residuals under a frequentist paradigm, while our model is inherently Bayesian. To the best of our knowledge, the surrogate residuals have only been applied to one Bayesian model (Park et al., 2021). In this case the model was ordinal, and the authors provided little detail on how to calculate the residuals. Thus, we introduce our approach for calculating surrogate residuals under the Bayesian paradigm. While the formulation and theorem are stated simply, actually calculating the residuals is difficult since is typically analytically intractable. However, it is straightforward to estimate using Monte Carlo approximations. In Algorithm 1 we outline our approach for calculating the residuals , which relies on samples from the posterior predictive distribution. Note that the algorithm is given for a single and the process should be repeated for all combinations of and .

Result: Surrogate residuals
Set equal to the number of posterior samples
for each ind in 1: do
       Calculate ;
       Calculate ;
       Simulate ;
       Calculate ;
       Calculate ;
       Simulate ;
       Simulate ;
end for
Calculate ;
Calculate ;
Algorithm 1 Bayesian Surrogate Residuals

In our case, (

) is the CDF of the Poisson distribution evaluated at the simulated responses (observed responses). Note that under the Bayesian paradigm, for each observation

, there is actually a vector of surrogate residuals

. Thus, we can study the posterior distribution of the surrogate residuals, and we recommend looking at the behavior of the surrogate residuals across multiple posterior samples. Alternatively, we could use a point estimate for (e.g. posterior mean) to calculate a single vector of surrogate residuals. This approach is more similar to the original frequentist approach, but doesn’t capture the uncertainty of the parameter estimates.

5 Ukraine ARD Results

We fit our correlated NSUM model to the Ukraine data and compare our model to the Zheng et al. (2006) overdispersed model. We used RStan to implement both models (Stan Development Team, 2019). Only one group was name-related, so we choose to include all known groups in the scaling term. The R code for the Maltiel et al. (2015) model provided in the Maltiel and Baraff (2015) R package and the JAGS code for the Teo et al. (2019) model were prohibitively slow to model the large Ukraine ARD, so results from these two models were not reported here.

5.1 Model performance


: First, we show the accuracy and validity of our subpopulation size estimates using LOO. The estimates with corresponding 95% credible intervals are shown in Figure

2. Overall, our correlated model is typically closer to the true sizes, and among 7 cases that our point estimates are outside the intervals of the overdispersed model, 5 of them are closer to the truth. The correlated model produces larger subpopulation size credible intervals intervals, consistent with the increased complexity of the model. This is desirable since NSUM models have historically underestimated the variability in the data, placing too much trust in the resulting size estimates. Additionally, both models produce smaller credible intervals for the larger subpopulations where the responses contain more information.

Many authors have noted that models typically overestimate large subpopulations and underestimate small subpopulations (Zheng et al., 2006; McCormick et al., 2007; Maltiel et al., 2015). For the correlated and overdispersed models with the Ukraine data set, this does not seem to be the case.

Figure 2: Subpopulation leave-one-out estimates. The subpopulations are ordered from largest to smallest size, and the relative error represents . The overdispersed model is shown with a dotted line and our correlated model with a solid line.

Conditionally-positive mean and standard deviation

: Next, we consider the joint distribution of positive mean and positive standard deviation for in-sample data. The results are shown in Supplementary Figure 1. First, the overdispersed model typically has a larger uncertainty. While the correlated model does not perfectly capture the conditional mean and standard deviations for all subpopulations, it displays a significant improvement over the overdispersed model. The overdispersed model typically captures the mean of the data but underestimates the standard deviations, while the correlated model sometimes underestimates the mean but captures the standard deviation).

Probability mass function: For the probability mass function, we consider 9 different probabilities in Supplementary Figure 2 where the true proportions are plotted against the simulated proportions with 95% credible intervals included. We denote the overdispersed model using empty shapes, the correlated model using filled shapes, the known subpopulations as circles, and the unknown subpopulations as triangles. We find that for , both models perform similarly across all subpopulations. The correlated model consistently estimates fewer zeros than observed while the overdispersed model estimates more zeros than observed in the data set. We also observe the heaping pointed out by Zheng et al. (2006), especially for and . Perhaps surprisingly, is estimated fairly accurately for our correlated model while is estimated poorly for both models. Our correlated model also does a better job of capturing larger responses, as indicated by the plot for . Overall, the correlated model again outperforms the overdispersed model.

Correlation: While we do include correlated random effects in our model, we do not explicitly model the correlation of the responses. Thus, we still consider the posterior predictive p-values of the correlation of the responses, as shown in Supplementary Figure 3 with the p-values in the left plot corresponding to the overdispersed model and in the right plot corresponding to the correlated model. Under the true model, we typically expect the posterior predictive p-values to be be concentrated near 0.5. It is clear from the clustering of the overdispersed p-values around 0 and 1 that the model does not produce responses that share the same correlated as the observed data. One the other hand, the clustering around 0.5 for our correlated model indicates that the correlated random effects are accurately capturing the correlation between subpopulations in the data set.

Surrogate residuals

: Finally, we review some results from the surrogate residuals. While the surrogate residuals can be used in many ways, in this section we show their ability to detect missing covariates, as is common in ordinary linear regression. Figure

3 shows one realization of the surrogate residuals corresponding to a single randomly chosen posterior sample against the standardized age covariate for the subpopulations men aged 20-30 and IDUs. Since there is a vector of surrogate residuals for each posterior sample, to visualize the variability in the residuals across these samples, loess curves for 25 random samples were are added for reference. In general, there is very little variability across the posterior samples, except in areas with very few observations (respondents aged older than 80). For men aged 20-30, the overdispersed model residuals are highly correlated with age, resulting in a quadratic loess curve. On the other hand, the residuals for the correlated model are mostly uncorrelated with age, indicating that the regression structure in the correlated model does accurately capture the relationship between age and the responses. There is no indication that higher order terms are needed.

For IDUs, we see that the relationship between the residuals and age is much weaker, and even the loess curve for the overdispersed model is relatively flat with a slight downward trend. For the correlated model the loess curve is almost absolutely flat. In this case, age seems to play a larger role in the responses for men aged 20-30 than it does for IDUs, which is expected. Not shown here is a QQ plot of the surrogate residuals against the uniform distribution, where both the overdispersed and correlated models perform well, although the correlated model produces slightly better QQ plots.

Figure 3: Surrogate residuals from the overdispersed model for subpopulations men aged 20-30 (a) and IDU (c) and from the correlated model for subpopulations men aged 20-30 (b) and IDU (d). Loess smoothing curves are plotted over residuals for reference.

5.2 Results

Our model performs well with respect to all considered diagnostics, showing the reliability of our model. We first show the results of the size estimates for the hidden populations and corresponding 95% credible intervals in Table 1. We notice that the size estimates from our correlated model are considerably larger than the estimates from the overdispersed model, but are very close to the size estimates from the adjusted estimates from Paniotto et al. (2009) which weighted observations based on the level of respect responses. While there is no gold standard, we believe the estimates from our correlated model is much closer to the truth based on the inclusion of additional information and the performance of the models in the previous diagnostics.

The correlated model credible intervals are also much wider than the other two models. This is a desired outcome, since simpler models often lead organizations to put too much confidence in size estimates, while the uncertainty around hard-to-reach population sizes is typically very large.

Subpopulation Overdispersed Correlated Ukraine Adjusted
(23,100 - 31,700)
(43,300 - 129,000)
(65,000 - 93,000)
(1,340 - 2,820)
(1,450 - 10,300)
(2,800 - 5,200)
(4,530 - 7,680)
(4,910 - 28,900)
(10,000 - 17,000)
(63,600 - 73,000)
(212,000 - 573,000)
(285,000 - 389,000)
Table 1: Estimated unknown subpopulation sizes and 95% credible intervals. Values are rounded to the nearest 100.

Next, we consider the covariate parameters estimates, and . Table 2 includes the group parameters corresponding to age, age, and level of respect. Overall, the parameter estimates are consistent with the expected results. For example, older respondents are more likely to know people 70 or older and younger people are more likely to know kids and people aged 20 to 30. Similarly, older respondents know more people who died in 2007. For the hard-to-reach populations, younger respondents are more likely to know people in all unknown groups.

Subpopulation Age Age Level of Respect
Men 20-30 -0.38* -0.19* 0.05*
Men 70+ 0.22* -0.05* 0.04*
Female 20-30 -0.38* -0.16* 0.08*
Female 70+ 0.25* -0.01 0.05*
Kids -0.31* 0.00 0.12*
Invalids 0.28* -0.08* -0.03
Died 2007 0.47* -0.06* NA
Pavlo -0.18* 0.00 0.18*
Prisoners 2007 -0.27* -0.26* 0.21*
Divorced Men 2007 -0.24* -0.34* -0.02
Birth 2007 -0.17* -0.12* 0.16*
FSW -0.74* -0.07 0.26*
MSW -0.86* 0.12 0.02
MSM -1.05* -0.28 0.62*
IDU -0.60* -0.31* 0.09
Table 2: Table of group parameter estimates for the correlated NSUM model. Significance at is denoted by *

Regarding level of respect, all significant parameters are positive, consistent with the belief that respondents with a more positive perception of a subpopulation will tend to know more people from the subpopulation. The two largest significant parameters for known groups are for prisoners and kids, which are perhaps more likely to have significant barrier effects. The parameters for FSW and MSM are the largest across all groups, which is also unsurprising.

We also report the estimates for the non-group level parameters (gender, education, nationality, profession, and access to internet). The parameter estimates corresponding to male was less than 0.01, 0.20 for Ukraine, 0.13 for employed, 0.09 for access to internet, 0.17 for secondary education, and 0.19 for vocational education, with a baseline of candidate of sciences or doctor of sciences. All parameters except for male were significant at . While previous studies have found that men have larger network sizes than women, for the populations in this study, there is not a significant difference between the number of people reported between the gender of the respondents. The other global covariates are consistent with previous findings and are expected. The network literature has shown that employed and educated individuals typically have larger network sizes. While we are not aware of literature that studies how access to internet affects network sizes, our finding is not surprising since access to internet typically means the individual can reach a broader range of contacts, for example through email.

Finally, we can also look at the estimated correlations, the novel feature of our model. The correlation matrix is shown in Figure 4

and is sorted using a hierarchical clustering algorithm. Our model produced many expected correlations, e.g. respondents who knew more men aged 70+ also knew more women aged 70+. For example, respondents who know more men and women aged 20 to 30 are also less likely to know people who died in 2007 than those respondents who know more men and women aged 70+. The results also highlight some interesting relationships that are less obvious. First, we find that respondents who know more men named Pavlo are less likely to know young men, and even less likely to know young women. Without census information about the birth records of men named Pavlo in Ukraine, we can guess that Pavlo is more common among older men in Ukraine.

Consistent with the residual analysis in Zheng et al. (2006) for the McCarty et al. (2001) data that found residuals for groups associated with negative events were highly correlated, the key populations of the Ukraine survey were all highly correlated with one another, prisoners, and to a lesser extent divorced men. Also consistent with the findings in Teo et al. (2019) that younger participants knew more IDU and MSM than older participants, even after accounting for age in our model, we still find that men and women aged 20 to 30 is more positively correlated with IDU and MSM than men and women aged 70+, which are negatively correlated with the hard-to-reach populations.

It is important to note that our correlation estimates correspond to the correlation of the reported number of connections. Therefore, some of the correlation between the hard-to-reach populations may be an indication of a respondent’s willingness to answer questions truthfully. That is, if respondents are unwilling to truthfully divulge how many FSW they know, then they may also be unwilling to answer honestly about IDU, MSW, and MSM, leading to two groups of people: those who are willing to report knowing members of hard-to-reach populations and those not willing, potentially increasing the observed correlation.

Figure 4: Estimated correlation matrix for the Ukraine data, arranged by a hierarchical clustering algorithm.

6 Model Misspecification

It is important to understand how violations of the correlated model affect parameter estimates, specifically the estimates. In practice, the number of respondents in ARD surveys varies widely from around 200 to over 10,000. Thus, the performance of the model under several different sample sizes is studied. The model is studied under both simulation for small and large sample sizes (below) and under asymptotic theory using numerical approaches (Supplementary Section 4). In this simulation study, we generate data from 5 models of varying degrees of misspecification. These allow us to quantify the effect of each model assumption, i.e. each set of covariates and the correlation structure.

For our simulations, we created data sets from the correlated model in Equation (2) using the covariates from the Ukraine data set with parameters estimated from the Ukraine results. Specifically, , ,

, and all hyperparameters were fixed at the posterior means of the real data analysis MCMC chains. For each simulated data set, new network sizes

and biases were sampled from their prior distributions with the estimated hyperparameters. The covariates and were sampled from the observed Ukraine covariates with replacement. We performed the simulation at five different respondent sample sizes, 100, 300, 1000, 3000, and 10,000.

We fit five different models, summarized by Table 3, where Model 5 represents the data generating model. The results are shown in Figure 5 for four known groups and the four unknown groups. In all cases, Model 5 (the true model) converges to the true value as the sample size increases. For Model 4, the effect that missing the covariate has on the estimate is directly proportional to how large is. For men aged 20-30, is large and thus affects the parameter estimate more than for the unknown subpopulations where is small. The same trend is true for Model 3 and . In all cases, there is a much weaker relationship between the responses and than the responses and , suggesting that for the Ukraine data set, excluding does not have a large effect on the subpopulation size estimates. It is thus important to phrase the level of respect questions well so that contains more information, a point we expand on in Section 7.

For Model 2, assuming no correlation in the presence of correlation also biases the results. The difference between Model 2 and Model 5 is most clear in the larger and highly correlated populations, e.g. men aged 20-30 and women aged 20-30. In both cases, assuming no correlation leads to underestimating the prevalence. The bias from the missing correlation structure seems to diminish, but not disappear entirely, as the number of respondents increases to 10,000.

X? Z? Correlated?
Model 1 No No No
Model 2 Yes Yes No
Model 3 Yes No Yes
Model 4 No Yes Yes
Model 5 Yes Yes Yes
Table 3: Summary of models fit in simulation study

Figure 5: Estimated values under different model assumptions.

7 Practical Advice

In this section, we offer guidance on how to better collect and analyze ARD.

Level of respect: The first suggestion is to phrase “level of respect” questions to maximize the correlation between the level of respect and the ARD. In Ukraine, the level of respect question was phrased as, “What level of respect do following groups have in Ukraine…”. The phrasing focuses on how people in Ukraine feel about the groups rather than how the respondent feels, leading to a weaker relationship between the level of respect and ARD responses. However, in Teo et al. (2019), their measure of respondent social acceptability rating of the hard-to-reach populations results in a closer connection between the ARD and the level of respect. For the three shared hard-to-reach groups (FSW, MSM, and IDU), the pairwise correlations in Ukraine between the number of people the respondent reports knowing and their level of respect for the three groups are 0.038, 0.031, and 0.020, respectively. On the other hand, the pairwise correlations in Singapore are 0.040, 0.129, and 0.131, respectively. Thus while level of respect does not seem to be correlated strongly to the number of FSW respondents know in either sample, the level of respect does seem to be a relatively strong predictor for MSM and IDU in the Singapore data. While there may not always be a connection between ARD and level of respect questions, it is important to ensure that questions are phrased in order to detect as much correlation as possible.

Similarly, it is important to collect the level of respect questions for all subpopulations. While the Teo et al. (2019) level of respect responses are more correlated with the ARD than in the Ukraine dataset, the authors only included information about the hard-to-reach populations. This results in a loss of information. In our analysis, we were able to account for the level of respect for all subpopulations, further improving the results.

Covariates of respondents: Second, it is important to ask respondents about themselves. We have shown through both the simulation study and the numerical analysis that ignoring important covariates can bias subpopulation size estimates. It is impossible to account for the covariate effect for ARD surveys that do not collect such information. Additionally, it is difficult to tell prior to collection which covariates are important, so survey questions should capture multiple respondent properties, including age, social behavior, and wealth.

Inclusion of similar subpopulations: Finally, we recommend including more known groups which face similar stigma to hard-to-reach populations. Zheng et al. (2006) found that other populations associated with negative experiences (e.g. prisoners, homicide victims, rape victims, people who have committed suicide, and people who were in auto accidents) were also highly correlated. While including correlated groups may improve the size estimates in our correlated model, understanding how connected different subpopulations are is also important. If groups are identified as being highly correlated with hard-to-reach populations, future researchers can better understand the social networks of members of hard-to-reach populations, making future survey sampling easier and more efficient.

8 Discussion

Aggregated relational data (ARD) is an extremely useful tool to not only population size estimation, but also to learn about properties of social networks. Many models have been developed to better capture the behavior of the data and account for the different sources of bias. One major limitation of the models is that uncertainty estimates are far too small, so researchers are too confident in their estimates. We improve upon these models by incorporating covariate and subpopulation coefficients in an easy to understand manner, addressing the empirical correlation between subpopulations, and providing a better fit and more reasonable uncertainty estimates for model parameters. Another benefit of our model is that we make very few assumptions about the biases in the model, allowing the data to drive the parameter estimates. The proposed various model diagnostics are also useful in other general settings when there is not a ground-truth to compare with. Satisfactory diagnostic results will increase the credibility of the new NSUM estimates and make them more acceptable by decision makers.

From this Ukraine ARD, we have identified the relationship between several predictors and the number of people known by respondents in both known and unknown groups, as well as produced reliable size intervals and the associated uncertainty intervals of the hard-to-reach populations associated with HIV prevalence. Our estimates align with the adjusted estimates of Paniotto et al. (2009) and can be used to inform government HIV prevention policy. Of the four key populations we considered, we estimated that there are nearly four times as many injection drug users as there are FSW, MSW, and MSM. This is consistent with other studies that estimated IDUs and their sexual partners make up 64% of people living with HIV in Ukraine (Des Jarlais et al., 2009). Combined with estimates of HIV prevalence among key populations, our size estimates illustrate the severity of the HIV epidemic in Ukraine.

Our analysis hints at, but does not explicitly model, the increased risk in individuals that belong to more than one key population. Our correlated model has shown that respondents who report knowing more people in one hidden population are more likely to know people in the other hidden populations. World Health Organization and others (2011) estimated that the HIV prevalence among female sex workers who inject drugs is around 43%, while only around 8.5% in female sex workers who do not inject drugs. Based on these relationships, it is clear that is it not sufficient to understand only the behavior of these key populations as a whole, but it is also necessary better understand the relationship between populations in order to effectively lower new HIV infections.

We believe that future ARD models should head in two directions. The first direction is to better exploit the relationship between populations, as illustrated by both our estimated correlation matrix and the covariate effects. It is clear that some individuals are closer to the key populations than others, either because of their age, gender, and similar characteristics, or because of their existing social networks. We would be able to better understand the properties and behavior of the key populations if we were able to survey respondents who were more familiar with the populations of interest.

The second direction is to connect ARD to more traditional network analysis, and a few researchers have already started exploring this area. Breza et al. (2017) showed when ARD is sufficient to estimate node- or graph-level statistics, removing the need for expensive complete network data. Breza et al. (2019) extended the theory and provided justification for why ARD can be used in place of complete networks. If future research can use ARD instead of complete networks, the cost of the studies will be greatly reduced, providing additional financial capital for other studies.


  • H. R. Bernard, E. C. Johnsen, P. D. Killworth, and S. Robinson (1991) Estimating the size of an average personal network and of an event subpopulation: some empirical results. Social Science Research 20 (2), pp. 109–121. Cited by: §1.
  • H. R. Bernard, E. C. Johnsen, P. D. Killworth, and S. Robinson (1989) Estimating the size of an average personal network and of an event subpopulation. In The Small World, pp. 159–175. Cited by: §1, §1.
  • E. Breza, A. G. Chandrasekhar, T. H. McCormick, and M. Pan (2017) Using aggregated relational data to feasibly identify network structure without network data. Technical report National Bureau of Economic Research. Cited by: §8.
  • E. Breza, A. G. Chandrasekhar, T. H. McCormick, and M. Pan (2019) Consistently estimating graph statistics using aggregated relational data. arXiv preprint arXiv:1908.09881. Cited by: §8.
  • D. C. Des Jarlais, K. Arasteh, S. Semaan, and E. Wood (2009) HIV among injecting drug users: current epidemiology, biologic markers, respondent-driven sampling, and supervised-injection facilities. Current Opinion in HIV and AIDS 4 (4), pp. 308. Cited by: §8.
  • European Centre for Disease Prevention and Control/WHO Regional Office for Europe (2017) HIV/AIDS surveillance in Europe 2017 – 2016 data. ECDC. Cited by: §2.
  • A. Gelman, D. A. Van Dyk, Z. Huang, and J. W. Boscardin (2008) Using redundant parameterizations to fit hierarchical models. Journal of Computational and Graphical Statistics 17 (1), pp. 95–122. Cited by: §3.2.
  • A. Gelman (2004) Parameterization and Bayesian modeling. Journal of the American Statistical Association 99 (466), pp. 537–545. Cited by: §3.2.
  • A. Gelman (2006) Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper). Bayesian Analysis 1 (3), pp. 515–534. Cited by: §3.1.
  • P. D. Killworth, E. C. Johnsen, C. McCarty, G. A. Shelley, and H. R. Bernard (1998a) A social network approach to estimating seroprevalence in the united states. Social Networks 20 (1), pp. 23–50. Cited by: §1.
  • P. D. Killworth, C. McCarty, H. R. Bernard, G. A. Shelley, and E. C. Johnsen (1998b) Estimation of seroprevalence, rape, and homelessness in the united states using a social network approach. Evaluation Review 22 (2), pp. 289–308. Cited by: §1.
  • I. Laga, L. Bao, and X. Niu (2021) Thirty years of the network scale-up method. Journal of the American Statistical Association 116 (535), pp. 1548–1559. Cited by: §1.
  • C. Liu (2003) Alternating subspace-spanning resampling to accelerate markov chain monte carlo simulation. Journal of the American Statistical Association 98 (461), pp. 110–117. Cited by: §3.2.
  • D. Liu and H. Zhang (2018) Residuals and diagnostics for ordinal regression models: a surrogate approach. Journal of the American Statistical Association 113 (522), pp. 845–854. Cited by: §4.3, §4.3.
  • R. Maltiel and A. J. Baraff (2015) NSUM: network scale up method. Note: R package version 1.0 External Links: Link Cited by: §5.
  • R. Maltiel, A. E. Raftery, T. H. McCormick, and A. J. Baraff (2015) Estimating population size using the network scale up method. The Annals of Applied statistics 9 (3), pp. 1247. Cited by: §1, §3.1, §4, §5.1, §5.
  • C. McCarty, P. D. Killworth, H. R. Bernard, E. C. Johnsen, and G. A. Shelley (2001) Comparing two methods for estimating network size. Human Organization 60 (1), pp. 28–39. Cited by: §5.2.
  • T. H. McCormick, T. Zheng, et al. (2007) Adjusting for recall bias in “How many X’s do you know?” surveys. In Proceedings of the Joint Statistical Meetings, Cited by: §5.1.
  • T. H. McCormick and T. Zheng (2012) Latent demographic profile estimation in hard-to-reach groups. The Annals of Applied Statistics 6 (4), pp. 1795. Cited by: §4.
  • V. Paniotto, T. Petrenko, O. Kupriyanov, and O. Pakhok (2009) Estimating the size of populations with high risk for HIV using the network scale-up method. Ukraine: Kiev International Institute of Sociology. Cited by: §1, §2, §2, §5.2, §8.
  • J. Park, J. Y. Choi, J. Lee, and M. Kyung (2021)

    Bayesian approach to multivariate component-based logistic regression: analyzing correlated multivariate ordinal data

    Multivariate Behavioral Research, pp. 1–18. Cited by: §4.3.
  • M. J. Salganik, D. Fazito, N. Bertoni, A. H. Abdo, M. B. Mello, and F. I. Bastos (2011) Assessing network scale-up estimates for groups most at risk of HIV/AIDS: evidence from a multiple-method study of heavy drug users in Curitiba, Brazil. American Journal of Epidemiology 174 (10), pp. 1190–1196. Cited by: §1.
  • Stan Development Team (2019) RStan: the R interface to Stan. Note: R package version 2.19.2 External Links: Link Cited by: §5.
  • A. K. J. Teo, K. Prem, M. I. Chen, A. Roellin, M. L. Wong, H. H. La, and A. R. Cook (2019) Estimating the size of key populations for HIV in Singapore using the network scale-up method. Sexually Transmitted Infections 95 (8), pp. 602–607. Cited by: §1, §5.2, §5, §7, §7.
  • UNAIDS and WHO (2010) Guidelines on estimating the size of populations most at risk to HIV. Geneva, Switzerland: World Health Organization, pp. 51. Cited by: §1.
  • D. A. Van Dyk and X. Meng (2001) The art of data augmentation. Journal of Computational and Graphical Statistics 10 (1), pp. 1–50. Cited by: §3.2.
  • World Health Organization and others (2011) Global HIV/AIDS response: epidemic update and health sector progress towards universal access: progress report 2011. Cited by: §8.
  • World Health Organization and others (2016) Consolidated guidelines on HIV prevention, diagnosis, treatment and care for key populations. Cited by: §2.
  • T. Zheng, M. J. Salganik, and A. Gelman (2006) How many people do you know in prison? Using overdispersion in count data to estimate social structure in networks. Journal of the American Statistical Association 101 (474), pp. 409–423. Cited by: §1, §1, §3.1, §4.2, §4.2, §4, §5.1, §5.1, §5.2, §5, §7.

Supplementary Section  1 Ukraine Data

Subpopulation Known Size
Men aged 20-30 4,088,438
Men aged 15-17 935,153
Men above 70 1,328,606
Women aged 20-30 3,966,956
Women aged 15-17 899,443
Women above 70 2,993,846
Children (boys and girls) aged 10-13 1,869,098
Moldavians 258,619
Romanians 150,989
Poles 144,130
Jews 104,186
Romany 47,587
1st group invalids 278,195
Doctors of any speciality 222,884
People who died in 2007 762,877
Men named Pavlo 323,863
Men who served sentences in places of imprisonment in 2007 108,511
Men who officially divorced in 2007 178,364
Women who gave birth to a child in 2007 472,657
Doctors and Candidates of Science who received a scientific degree in Ukraine over the last 15 years 69,471
Nurse women, nurse men, aid-men and aid-women 487,148
Militiamen 273,200
Supplementary Table 1: Ukraine subpopulations and sizes

Supplementary Section  2 Bias Decomposition

In this section, we show how to decompose our bias estimates into the transmission, barrier, and recall effects. While our main focus is on subpopulation size estimates, the effect of each bias is also an important question. We have found that without overly strict assumptions on the structure of the biases, estimating each individual bias separately is infeasible, but imposing incorrect assumptions can negatively affect the other parameter estimates. Therefore, we estimate only an overall bias term that accounts for all sources of bias and decompose the bias into the individual biases after fitting the model, leaving the other model parameters unaffected by the bias assumptions. The form of our bias gives the model more flexibility in capturing the heterogeneity present in the data.

Our decomposition approach assumes the probability component of the binomial likelihood of the maltiel2015estimating recall error model is equivalent to the mean parameter of our Poisson likelihood. The recall error model is given by:


Then, we fit a Bayesian regression model with the posterior means of our probabilities as the response. Specifically, we first calculate the posterior means as


where is the number of posterior samples. Then, using the prior assumptions from maltiel2015estimating, we assume the model


where and are fixed from prior information and flat non-informative priors are placed on the remaining hyperparameters. The above priors can easily be adapted to suit new prior information and different assumptions about the structure of the biases.

Supplementary Section  3 Supplementary Figures

Supplementary Figure 1: Posterior predictive checks for conditionally-positive mean and standard deviation. The mean and standard deviation of posterior samples is shown across subpopulation for the overdispersed model (cyan), our correlated model (pink), and for the observed data (black square). For visualizations, only 500 posterior samples are shown for each model.

Supplementary Figure 2: Posterior predictive checks for the probability mass function and associated 95% posterior credible intervals. The x-axis and y-axis represents the proportion of simulated responses and observed responses that are equal to the integer values considered, respectively. The mean and 95% credible interval of the proportions over the posterior samples is shown using the shape and error bars.

Supplementary Figure 3: Posterior predictive checks for subpopulation correlations for the overdispersed model (left) and the correlated model (right).

Supplementary Section  4 Asymptotic Misspecification

In this section, we implement a theoretical asymptotic analysis to study the influence of global

covariates and correlation. We use the pioneering work of white1982maximum to numerically calculate the asymptotic bias of the size estimate parameter under these two forms of misspecification. Specifically, given a vector of parameter values that satisfies


where the expectation is taken with respect to the true parameter vector , then . Additionally,

Using this setup, we can calculate the theoretical asymptotic distribution of parameters for a misspecified model. Unfortunately, is analytically intractable in our case, necessitating a numerical solution to Equation (4). heagerty2001misspecified faced the same problem under a Binomial generalized linear model. Here we adopt a similar numerical study for our Poisson likelihood.

Supplementary Section  4.1 Missing Covariates

First we study the effect of missing a covariate. In this case the distribution of network sizes and random effects is identical ( , and ), but the likelihood for the true model is given by , whereas the misspecified model is given by . We set , , and and vary , , and together. Here, and with probability 0.5 and with probability 0.5.

The results under different random effect variances are shown in Table 2. While there is numerical imprecision for the asymptotic results, it is clear that both the prevalence and standard deviation estimates are asymptotically biased when fitting a misspecified model which ignores important covariates since the relative errors are many times larger than for the true model. Interestingly, the asymptotic bias of the parameters does not seem to depend on the variance of random effects.

True 1 0 -1 3 1 3
Misspecified 4 -6 120 89 -101 -86
True 0 0 1 0 0 0
Misspecified -1 -6 112 -59 -84 -760
True 0 0 1 -9 1 2
Misspecified 4 -6 120 -18 -56 -50
True 0 0 -1 0 1 0
Misspecified 4 -8 128 -36 -54 -42
Supplementary Table 2: Asymptotic relative bias of the prevalence and standard deviation of the random effects under a model with misspecified covariates. Relative bias is defined as .

Supplementary Section  4.2 Missing Correlation

Next, we study the effect of excluding the correlation structure. The likelihood and distribution of network sizes for both the true and misspecified model is given by where , . For the true model, , whereas . For all simulations, , , and , where and is specified below. The correlations were chosen to roughly mimic the observed correlation in the Ukraine dataset, where some populations were moderately correlated, while other populations were essentially uncorrelated with the others.

Table 3 presents the asymptotic bias of the prevalence and standard deviation parameters when fitting an uncorrelated model when the true model is correlated. We find that ignoring the correlation does not seem to asymptotically bias prevalence or standard deviation estimates. In all cases of the misspecified model, the relative error is less than 1%. In most cases, this error is larger than the error found under the true model, due to the numerical imprecision of the 4-dimensional integral required to calculate the asymptotic bias. We stress that while inference is likely unaffected asymptotically, the simulation study showed that the results are affected for a finite number of respondents, and estimating the correlation is still important because it answers the scientific questions of how different social groups connect.

True -73 -48 -1 -753 130 -351
Misspecified 6 32 -27 -120 50 -221
True -53 121 -48 434 -112 12
Misspecified -1 -10 -8 410 -10 -84
True -33 12 112 -333 -101 145
Misspecified -1 32 28 27 70 -90
True -21 5 10 -212 -150 182
Misspecified -9 -42 2 16 -145 -190
Supplementary Table 3: Asymptotic relative bias of the prevalence and standard deviation of the random effects under a model with misspecified correlation. Relative bias is defined as . True model relative bias is shown to evaluate numerical imprecision.

apalike NSUM_bib2