Random sampling is an effective way for researchers to learn about a population of interest. Characteristics of the sample can be generalized to the population of interest using standard statistical theory. However, traditional random sampling may be impossible when the target population is hidden, e.g. drug users, men who have sex with men, sex workers and homeless people. Due to concerns like privacy, stigmatization, discrimination, and criminalization, members of hidden populations may be reluctant to participate in a survey. In addition to the hidden populations for which no sampling “frame” exists, there are rare populations of great research interest, marked by their tiny fractions in the entire population. It is unlikely that random sampling from the general population would accrue a reasonable sample from a very rare population. However, it is often the case that members of hidden or rare populations know each other socially. This suggests that social referral would be an effective method for accruing a large sample. To this end, researchers have developed a variety of social link-tracing survey designs. The most popular is respondent-driven sampling (RDS) [Heckathorn1997].
In RDS, a small set of initial subjects, known as “seeds” are selected, not necessarily simultaneously, from the target population. Subjects are given a few coupons, each tagged with a unique identifier, which they can use to recruit other eligible subjects. Participants are given a reward for being interviewed and for each eligible subject they recruit using their coupons. Each subject reports their degree, the number of others whom they know in the target population. No subject is permitted to enter the study twice and the date and time of each interview is recorded.
While RDS can be an effective recruitment method, it reveals only incomplete social network data to researchers. Any ties between recruited subjects along which no recruitment took place remain unobserved. The social network of recruited subjects is of great interest to sociologists, epidemiologists and public health researchers since it may induce dependencies in the outcomes of sampled individuals. Fortunately, RDS reveals information about the underlying social network that can be used to (approximately) reconstruct it. By leveraging the time series of recruitments, the degrees of recruited subjects, coupon information, and who recruited whom, it is possible to interpret the induced subgraph of RDS respondents as a simple random graph model [Crawford2016].
In this paper, we introduce a flexible and expressive stochastic model of RDS recruitment on a partially observed network structure. We derive the likelihood of the observed time series; the model admits any edgewise inter-recruitment time distribution. We propose a stochastic optimization algorithm Render
(REspoNdent-Driven nEtwork Reconstruction) to estimate unknown parameters and the underlying social network. We conduct extensive experiments, on synthetic and real data, to confirm the accuracy and the reconstruction performance ofRender. In particular, we apply Render to reconstruct the network of injection drug users from an RDS study in St. Petersburg, Russia.
RDS has been modeled as a random walk with replacement on a graph at its equilibrium distribution [Heckathorn1997, Salganik and Heckathorn2004, Volz and Heckathorn2008, Goel and Salganik2009, Gile and Handcock2010]
, and under this argument the sampling probability is proportional to degree, which is the basis for an estimator of the population mean[Heckathorn2002, Salganik2006]. In this paper, we adopt an approach that focuses on the network structure estimable from the RDS sample using recruitment information [Crawford2016]
. crawford2014graphical (2016) assumes that edge-wise inter-recruitment times follow the exponential distribution, but this approach is relatively inflexible. In many other contexts, dynamic or random processes can reveal structural information of an underlying, partially observed, network[Kramer et al.2009, Shandilya and Timme2011, Linderman and Adams2014]. Network reconstruction and the edge prediction problem have been studied for diffusion processes where multiple realizations of the process on the same network are available [Liben-Nowell and Kleinberg2007, Gomez Rodriguez, Leskovec, and Krause2010, Gomez Rodriguez et al.2011]. In the case of RDS, reconstruction is particularly challenging because only a single realization of the diffusion process is observed. Furthermore, we must account for the role of coupons in recruitment as they intricately introduce bias. However, in contrast to general diffusion processes over graphs, some important network topological information is revealed by RDS. In this study, we leverage all the available data routinely collected during real-world RDS studies.
We conform to the following notation throughout the paper. Suppose that is a real-valued function and that
is a vector. Thenis a vector of the same size as vector and the -th entry of is denoted by , whose value is given by . The transposes of matrix and vector are written as and , respectively. And let be the all-ones column vector.
Dynamics of Respondent-Driven Sampling
We characterize the social network of the hidden population as a finite undirected simple graph with no parallel edges or self-loops. Members of the hidden population are vertices; implies that and know each other. Using RDS, researchers recruit members from the hidden population into the study. The time-varying recruitment-induced subgraph is a nested collection of subgraphs of , where is the termination time of the study. For all , is a subgraph of . Here, is the null graph since there are no subjects in the study initially. For simplicity, we write for , and call this the recruitment-induced subgraph or induced subgraph unless we explicitly specify the time . The vertex set of the time-varying recruitment-induced subgraph at time (i.e., ) denotes the members in the hidden population that are known to the study by time . The subgraph is induced by the vertex set ; i.e., . The time-varying recruitment-induced subgraph evolves in the following way [Crawford2016].
At time , researchers recruit a subject in the population as a seed. This subject is included in the vertex set of for all . Researchers may provide this subject with coupons to recruit its neighbors.
Once recruited into this study (either by researchers or its neighbors) at time , subjects currently holding coupons will attempt to recruit their yet-unrecruited neighbors. The inter-recruitment time along each edge connecting a recruiter with an unrecruited neighbor is i.i.d. with cdf . Recruitment happens when a neighbor is recruited into the study and is provided with a number of coupons. A successful recruitment costs the recruiter one coupon.
The directed recruitment graph is , where is the set of members in the study at the final stage. For two subjects , if and only if recruits . Note that the subjects recruited by researchers (i.e., the seeds) have zero in-degree in . Let denote the cardinality of (equivalently ). For simplicity, we label the subject recruited in the th recruitment event by . The labels of the subjects in the study are . The vector of recruitment times is , where is the recruitment time of subject . In shorthand, we write for . Let be the coupon matrix whose element if subject has at least one coupon just before the th recruitment event, and zero otherwise. The rows and columns of the coupon matrix are ordered by subjects’ recruitment time. The degree vector is , where is the degree of in . At time (where for ), if a subject has at least one coupon and at least one neighbor not in the study, we term it a recruiter at time ; if a subject has not entered the study and has at least one neighbor with at least one coupon, we term it a potential recruitee at time . We assume that the observed data from a RDS recruitment process consist of .
Likelihood of Recruitment Time Series
We assume that the inter-recruitment time along an edge connecting a recruiter and potential recruitee is i.i.d. with cdf . Let , where is the random inter-recruitment time and . We write for the corresponding conditional pdf. Let be the conditional survival function and be the conditional hazard function.
We now derive a closed-form expression for the likelihood of the recruitment time series . In what follows, let be the set of seeds, and let denote the adjacency matrix representation of the recruitment-induced subgraph at the final stage ; thus we use and interchangeably throughout this paper.
(Proof in Appendix). Let and be the recruiter set and potential recruitee set just before time , respectively, and be the set of seeds. The following statements with respect to the likelihood of the recruitment time series hold.
The likelihood of the recruitment time series is given by
Let and be column vectors of size defined as and , and let and be matrices, defined as and . Furthermore, we form matrices and , where denotes the Hadamard (entrywise) product. We let
where denotes the lower triangular part (diagonal elements inclusive) of a square matrix. Then, the log-likelihood of the recruitment time series can be written as
Network Reconstruction Problem
Given the observed time series, we seek to reconstruct the binary symmetric, zero-diagonal adjacency matrix of and the parameter ( is the parameter space) that maximizes . Recall that we use and interchangeably throughout this paper. We have
where is the prior distribution for . The constraint for the parameter is obvious: must reside in the parameter space ; i.e., . Now we discuss the constraint for the adjacency matrix —we require that the adjacency matrix must be compatible.
We seek the matrix that maximizes the probability . However, we know that the directed recruitment subgraph, if viewed as an undirected graph, must be a subgraph of the true recruitment-induced subgraph. Let be the adjacency matrix of the recruitment subgraph when it is viewed as an undirected graph; i.e., the entry of is if a direct recruitment event occurs between and (either recruits or recruits ), and is otherwise. We require that be greater than or equal to entrywise. Recall that every subject in the study reports its degree; thus the adjacency matrix should also comply with the degree constraints. Following [Crawford2016], we say that a symmetric, binary and zero-diagonal matrix is a compatible adjacency matrix if entrywise, and entrywise.
Now we formulate this problem as an optimization problem.
Given the observed data , we seek an adjacency matrix (symmetric, binary and zero-diagonal) and a parameter value that
Alternating Inference of and
Given the observed data , we wish to infer the adjacency matrix of the recruitment-induced graph and the distribution parameter . Given , the maximum likelihood estimator (MLE) for is
Similarly, given the true parameter , the MLE for the adjacency matrix is given by
In practice, both the parameter and the true recruitment-induced subgraph are unknown and need estimation. However, we can alternately estimate and . This is what Render (presented in Algorithm 1) does. Each iteration is divided into two steps: -step and -step.
Estimation of using simulated annealing
The -step of Render solves
equivalently, it suffices to solve
We employ a simulated-annealing-based method to estimate the adjacency matrix (presented in Algorithm 2). Let the energy function be
where is the temperature. We specify a cooling schedule, which is a sequence of positive numbers that satisfy , where is the temperature in the -th iteration. Note that the -th iteration of the simulated annealing procedure has a compatible adjacency matrix as its state.
Algorithm 3 specifies which state (compatible adjacency matrix) the algorithm should transition into in the next iteration. Concretely, in each iteration of Algorithm 2, it randomly proposes an edge that connects vertices and . If the edge does not appear in and it still conforms to the degree constraint if we add the edge, then we simply add it to . In contrast, if the edge appears in and it still conforms to the subgraph constraint if we remove the edge, then we simply remove it from . If neither condition is satisfied, the algorithm tries again with a new proposal. We prove in the appendix that the space of compatible adjacency matrices is connected by the proposal method.
The proposed compatible adjacency matrix is accepted as the state for the next iteration with probability , where equals
otherwise, the matrix remains the state for the next iteration. The term is called the likelihood ratio and the term is called the proposal ratio. To implement Algorithm 2, we have to compute the likelihood ratio and the proposal ratio efficiently. In fact, they can be evaluated efficiently in a recursive manner.
(Proof in Appendix). The proposal ratio is given by
where and . Here, we let denote the -entry of the matrix .
The same way, we can find the likelihood ratio in a recursive manner.
(Proof in Appendix). If we view in Theorem 1 as a function of the adjacency matrix , denoted by , then the recurrence relation between and is as follows:
Here, we assign the minus sign “” in “” if , and assign the plus sign “” in “” if . By the same convention, the likelihood ratio is given by
Estimation of distribution parameter
In a -step in Algorithm 1, we have to solve the optimization problem . If the parameter space is a subset of the -dimensional Euclidean space , this problem can be solved using off-the-shelf solvers, e.g., fminsearch in MATLAB.
We evaluated the proposed method in two aspects, the reconstruction performance of the recruitment-induced subgraph and the parameter estimation of the edgewise inter-recruitment time model. Let be the adjacency matrix of the true recruitment-induced subgraph and be the estimate. We define the true and false positive rates (TPR and FPR) as and Fig. 1 illustrates an example of the reconstruction procedure. We simulated a RDS process over the Project 90 graph [Woodhouse et al.1994] with power-law edgewise inter-recruitment time distribution, whose shape parameter and scale parameter Fifty subjects are recruited in this process. We show clockwise from top left: the true recruitment-induced subgraph , the observed recruitment graph , the estimated network with recruitments as blue arrows and dashed lines as inferred edges, and the estimated recruitment-induced subgraph . The TPR equals and the FPR equals . The estimated parameter and .
Impact of distribution parameter
We simulated 50 RDS over the Project 90 graph with inter-recruitment time distribution (parametrized by the shape and scale) for each , , and . Thus a total number of 250 RDS processes are simulated and the mean inter-recruitment time is fixed to be . The reconstruction performance is illustrated in Fig. 2. Each point on the receiver operating characteristic (ROC) plane represents a reconstruction accuracy performance of a simulated RDS process. Points with the same marker have the same inter-recruitment time distribution parameter. From the figure, we can observe that there is no significant sign of separation of points with different inter-recruitment time distribution parameters. Reconstruction accuracy is robust to the distribution parameter.
Impact of distribution model
We simulated 50 RDS process over the Project 90 graph with inter-recruitment time distribution . The recruitment-induced subgraph is reconstructed via the model of the true inter-recruitment distribution and the exponential distribution , respectively. The TPR and FPR of each dataset are presented in Fig. 3
. The TPR is always higher than the FPR, which reaffirms the effectiveness of our proposed reconstruction method. For each dataset, the TPR and FPR under the true and the exponential models are very close to each other. We observe that there is no significant reconstruction skewness incurred by mis-specification of the inter-recruitment time distribution model.
We simulated 200 RDS processes over the Project 90 graph with edgewise inter-recruitment time distribution (parametrized by the shape and scale parameters) for each , , and . We used the method in the “Estimation of distribution parameter” section. We assess the bias of the estimated shape parameter , which is given by . Fig. 4 shows the distribution of the bias using Tukey boxplots. In Fig. 4, the red boxes depict the distribution of the biases of the estimated shape parameters inferred through the estimated adjacency matrix, while the green boxes illustrate the distribution of those inferred given the true adjacency matrix. The horizontal axis shows the value of the true shape parameter.
With respect to the red boxes (those based on the estimated adjacency matrices), The middle line of each box is very close to zero and thus the estimation is highly accurate. The interquartile range (IQR) , which measures the deviation of the biases, declines as the shape parameter decreases. Even for the box with largest deviation (i.e., the box with ), the IQR is approximately and of the biases reside in the interval . Compared with the parameter estimation via the true adjacency matrix, this estimator based on the estimated adjacency matrix is biased to some degree. In Fig. 4, we can observe that it underestimates .
Then consider the green boxes (those based on the true adjacency matrix). Similar to those based on the estimated adjacency matrix, the deviation of this estimator declines as the value of the shape parameter decreases. The middle line of each box is noted to coincide perfectly with zero bias line, which suggests that this estimator is unbiased given the true adjacency matrix.
Experiments on Real Data
We also apply Render to data from an RDS study of drug users in St. Petersburg, Russian Federation. We use Render to infer the underlying social network structure of the drug users in this study. Since it could be confusing to visualize the whole inferred network, we only show the inferred subgraph of the largest community of the network, as presented in Fig. 5. The blue arrows represent the edges in the recruitment subgraph that indicates the recruiter and recruitee. Gray dashed edges are inferred from the data.
In this paper, we precisely formulated the dynamics of RDS as a continuous-time diffusion process over the underlying graph. We derived the likelihood for the recruitment time series under an arbitrarily recruitment time distribution. As a result, we develop an efficient stochastic optimization algorithm, Render, that identifies the optimum network that best explains the collected data. We then supported the performance of Render through an exhaustive set of experiments on both synthetic and real data.
FWC was supported by NIH/NCATS grant KL2 TR000140 and NIMH grant P30MH062294. LC would like to thank Zeya Zhang and Yukun Sun for their encouragement.
- [Chen, Crawford, and Karbasi2015] Chen, L.; Crawford, F. W.; and Karbasi, A. 2015. Seeing the unseen network: Inferring hidden social ties from respondent-driven sampling. arXiv preprint arXiv:1511.04137.
- [Crawford2016] Crawford, F. W. 2016. The graphical structure of respondent-driven sampling. Sociological Methodology to appear.
- [Gile and Handcock2010] Gile, K. J., and Handcock, M. S. 2010. Respondent-driven sampling: An assessment of current methodology. Sociological Methodology 40(1):285–327.
[Goel and Salganik2009]
Goel, S., and Salganik, M. J.
Respondent-driven sampling as Markov chain Monte Carlo.Statistics in Medicine 28(17):2202–2229.
[Gomez Rodriguez et al.2011]
Gomez Rodriguez, M.; Balduzzi, D.; Schölkopf, B.; Scheffer, G. T.; et al.
Uncovering the temporal dynamics of diffusion networks.
28th International Conference on Machine Learning (ICML 2011), 561–568. International Machine Learning Society.
- [Gomez Rodriguez, Leskovec, and Krause2010] Gomez Rodriguez, M.; Leskovec, J.; and Krause, A. 2010. Inferring networks of diffusion and influence. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 1019–1028. ACM.
- [Heckathorn1997] Heckathorn, D. D. 1997. Respondent-driven sampling: a new approach to the study of hidden populations. SOCIAL PROBLEMS-NEW YORK- 44:174–199.
- [Heckathorn2002] Heckathorn, D. D. 2002. Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Social Problems 49(1):11–34.
- [Kramer et al.2009] Kramer, M. A.; Eden, U. T.; Cash, S. S.; and Kolaczyk, E. D. 2009. Network inference with confidence from multivariate time series. Physical Review E 79(6):061916.
- [Liben-Nowell and Kleinberg2007] Liben-Nowell, D., and Kleinberg, J. 2007. The link-prediction problem for social networks. Journal of the American society for information science and technology 58(7):1019–1031.
- [Linderman and Adams2014] Linderman, S., and Adams, R. 2014. Discovering latent network structure in point process data. In Proceedings of The 31st International Conference on Machine Learning, 1413–1421.
- [Salganik and Heckathorn2004] Salganik, M. J., and Heckathorn, D. D. 2004. Sampling and estimation in hidden populations using respondent-driven sampling. Sociological Methodology 34(1):193–240.
- [Salganik2006] Salganik, M. J. 2006. Variance estimation, design effects, and sample size calculations for respondent-driven sampling. Journal of Urban Health 83(1):98–112.
- [Shandilya and Timme2011] Shandilya, S. G., and Timme, M. 2011. Inferring network topology from complex dynamics. New Journal of Physics 13(1):013004.
- [Volz and Heckathorn2008] Volz, E., and Heckathorn, D. D. 2008. Probability based estimation theory for respondent driven sampling. Journal of Official Statistics 24(1):79.
- [Woodhouse et al.1994] Woodhouse, D. E.; Rothenberg, R. B.; Potterat, J. J.; Darrow, W. W.; Muth, S. Q.; Klovdahl, A. S.; Zimmerman, H. P.; Rogers, H. L.; Maldonado, T. S.; Muth, J. B.; et al. 1994. Mapping a social network of heterosexuals at high risk for HIV infection. Aids 8(9):1331–1336.
Proof of Theorem 1
(1) We consider the recruitment of subject . Let be the set of recruiters of subject just before time and be the set of recruitees of recruiter just before time . Suppose that . The inter-recruitment time between and its potential recruiter is denoted and is greater than conditional on previous recruitment of . Let
be the random variable of next recruiter andbe the random variable of next recruitee. Let denote the event . We have
Thus the likelihood is
Now suppose that .
Thus the likelihood is
Therefore the entire likelihood is
(2) The log-likelihood is
And the number of recruitees of recruiter just before time is given by
The term in the log-likelihood is given by
The term in the log-likelihood is given by
Thus the log-likelihood is
Proof of connectedness of the space of compatible adjacency matrices
Consider two compatible adjacency matrices and . Their Hadamard product , which corresponds to the maximum common subgraph of and , is also compatible. We can remove the edges in that do not appear in one at a time, and thus arrive at state . Then we add the edges in that do not appear in once at a time, and thus finally arrive at state . All intermediate adjacency matrices in this procedure are compatible. ∎
Proof of Theorem 2
In fact, the total number of possible edges that can be added for is given by and the total number of possible edges that can be removed for is given by . Thus we have the proposal distribution
Similarly, we have
Thus we obtain the proposal ratio:
Proof of Theorem 3
Let denote the matrix whose -entry is one and all other entries are zeros. We have
where we assign the minus sign “” in “” if the edge that connects subjects and is added to and we assign the plus sign “” in “” if that edge is removed from . And can be re-written as
where in (3) we use the fact that is symmetric and denotes the strictly upper triangular part (diagonal entries exclusive) of a matrix. Thus
The -entry of is
Hence we have
Similarly, if we view in Theorem 1 as a function of the adjacency matrix , denoted by , we have