Introduction
Random sampling is an effective way for researchers to learn about a population of interest. Characteristics of the sample can be generalized to the population of interest using standard statistical theory. However, traditional random sampling may be impossible when the target population is hidden, e.g. drug users, men who have sex with men, sex workers and homeless people. Due to concerns like privacy, stigmatization, discrimination, and criminalization, members of hidden populations may be reluctant to participate in a survey. In addition to the hidden populations for which no sampling “frame” exists, there are rare populations of great research interest, marked by their tiny fractions in the entire population. It is unlikely that random sampling from the general population would accrue a reasonable sample from a very rare population. However, it is often the case that members of hidden or rare populations know each other socially. This suggests that social referral would be an effective method for accruing a large sample. To this end, researchers have developed a variety of social linktracing survey designs. The most popular is respondentdriven sampling (RDS) [Heckathorn1997].
In RDS, a small set of initial subjects, known as “seeds” are selected, not necessarily simultaneously, from the target population. Subjects are given a few coupons, each tagged with a unique identifier, which they can use to recruit other eligible subjects. Participants are given a reward for being interviewed and for each eligible subject they recruit using their coupons. Each subject reports their degree, the number of others whom they know in the target population. No subject is permitted to enter the study twice and the date and time of each interview is recorded.
While RDS can be an effective recruitment method, it reveals only incomplete social network data to researchers. Any ties between recruited subjects along which no recruitment took place remain unobserved. The social network of recruited subjects is of great interest to sociologists, epidemiologists and public health researchers since it may induce dependencies in the outcomes of sampled individuals. Fortunately, RDS reveals information about the underlying social network that can be used to (approximately) reconstruct it. By leveraging the time series of recruitments, the degrees of recruited subjects, coupon information, and who recruited whom, it is possible to interpret the induced subgraph of RDS respondents as a simple random graph model [Crawford2016].
In this paper, we introduce a flexible and expressive stochastic model of RDS recruitment on a partially observed network structure. We derive the likelihood of the observed time series; the model admits any edgewise interrecruitment time distribution. We propose a stochastic optimization algorithm Render
(REspoNdentDriven nEtwork Reconstruction) to estimate unknown parameters and the underlying social network. We conduct extensive experiments, on synthetic and real data, to confirm the accuracy and the reconstruction performance of
Render. In particular, we apply Render to reconstruct the network of injection drug users from an RDS study in St. Petersburg, Russia.Related Work
RDS has been modeled as a random walk with replacement on a graph at its equilibrium distribution [Heckathorn1997, Salganik and Heckathorn2004, Volz and Heckathorn2008, Goel and Salganik2009, Gile and Handcock2010]
, and under this argument the sampling probability is proportional to degree, which is the basis for an estimator of the population mean
[Heckathorn2002, Salganik2006]. In this paper, we adopt an approach that focuses on the network structure estimable from the RDS sample using recruitment information [Crawford2016]. crawford2014graphical (2016) assumes that edgewise interrecruitment times follow the exponential distribution, but this approach is relatively inflexible. In many other contexts, dynamic or random processes can reveal structural information of an underlying, partially observed, network
[Kramer et al.2009, Shandilya and Timme2011, Linderman and Adams2014]. Network reconstruction and the edge prediction problem have been studied for diffusion processes where multiple realizations of the process on the same network are available [LibenNowell and Kleinberg2007, Gomez Rodriguez, Leskovec, and Krause2010, Gomez Rodriguez et al.2011]. In the case of RDS, reconstruction is particularly challenging because only a single realization of the diffusion process is observed. Furthermore, we must account for the role of coupons in recruitment as they intricately introduce bias. However, in contrast to general diffusion processes over graphs, some important network topological information is revealed by RDS. In this study, we leverage all the available data routinely collected during realworld RDS studies.Problem Formulation
We conform to the following notation throughout the paper. Suppose that is a realvalued function and that
is a vector. Then
is a vector of the same size as vector and the th entry of is denoted by , whose value is given by . The transposes of matrix and vector are written as and , respectively. And let be the allones column vector.Dynamics of RespondentDriven Sampling
We characterize the social network of the hidden population as a finite undirected simple graph with no parallel edges or selfloops. Members of the hidden population are vertices; implies that and know each other. Using RDS, researchers recruit members from the hidden population into the study. The timevarying recruitmentinduced subgraph is a nested collection of subgraphs of , where is the termination time of the study. For all , is a subgraph of . Here, is the null graph since there are no subjects in the study initially. For simplicity, we write for , and call this the recruitmentinduced subgraph or induced subgraph unless we explicitly specify the time . The vertex set of the timevarying recruitmentinduced subgraph at time (i.e., ) denotes the members in the hidden population that are known to the study by time . The subgraph is induced by the vertex set ; i.e., . The timevarying recruitmentinduced subgraph evolves in the following way [Crawford2016].

At time , researchers recruit a subject in the population as a seed. This subject is included in the vertex set of for all . Researchers may provide this subject with coupons to recruit its neighbors.

Once recruited into this study (either by researchers or its neighbors) at time , subjects currently holding coupons will attempt to recruit their yetunrecruited neighbors. The interrecruitment time along each edge connecting a recruiter with an unrecruited neighbor is i.i.d. with cdf . Recruitment happens when a neighbor is recruited into the study and is provided with a number of coupons. A successful recruitment costs the recruiter one coupon.
The directed recruitment graph is , where is the set of members in the study at the final stage. For two subjects , if and only if recruits . Note that the subjects recruited by researchers (i.e., the seeds) have zero indegree in . Let denote the cardinality of (equivalently ). For simplicity, we label the subject recruited in the th recruitment event by . The labels of the subjects in the study are . The vector of recruitment times is , where is the recruitment time of subject . In shorthand, we write for . Let be the coupon matrix whose element if subject has at least one coupon just before the th recruitment event, and zero otherwise. The rows and columns of the coupon matrix are ordered by subjects’ recruitment time. The degree vector is , where is the degree of in . At time (where for ), if a subject has at least one coupon and at least one neighbor not in the study, we term it a recruiter at time ; if a subject has not entered the study and has at least one neighbor with at least one coupon, we term it a potential recruitee at time . We assume that the observed data from a RDS recruitment process consist of .
Likelihood of Recruitment Time Series
We assume that the interrecruitment time along an edge connecting a recruiter and potential recruitee is i.i.d. with cdf . Let , where is the random interrecruitment time and . We write for the corresponding conditional pdf. Let be the conditional survival function and be the conditional hazard function.
We now derive a closedform expression for the likelihood of the recruitment time series . In what follows, let be the set of seeds, and let denote the adjacency matrix representation of the recruitmentinduced subgraph at the final stage ; thus we use and interchangeably throughout this paper.
Theorem 1.
(Proof in Appendix). Let and be the recruiter set and potential recruitee set just before time , respectively, and be the set of seeds. The following statements with respect to the likelihood of the recruitment time series hold.

The likelihood of the recruitment time series is given by
where .

Let and be column vectors of size defined as and , and let and be matrices, defined as and . Furthermore, we form matrices and , where denotes the Hadamard (entrywise) product. We let
where denotes the lower triangular part (diagonal elements inclusive) of a square matrix. Then, the loglikelihood of the recruitment time series can be written as
Network Reconstruction Problem
Given the observed time series, we seek to reconstruct the binary symmetric, zerodiagonal adjacency matrix of and the parameter ( is the parameter space) that maximizes . Recall that we use and interchangeably throughout this paper. We have
where is the prior distribution for . The constraint for the parameter is obvious: must reside in the parameter space ; i.e., . Now we discuss the constraint for the adjacency matrix —we require that the adjacency matrix must be compatible.
We seek the matrix that maximizes the probability . However, we know that the directed recruitment subgraph, if viewed as an undirected graph, must be a subgraph of the true recruitmentinduced subgraph. Let be the adjacency matrix of the recruitment subgraph when it is viewed as an undirected graph; i.e., the entry of is if a direct recruitment event occurs between and (either recruits or recruits ), and is otherwise. We require that be greater than or equal to entrywise. Recall that every subject in the study reports its degree; thus the adjacency matrix should also comply with the degree constraints. Following [Crawford2016], we say that a symmetric, binary and zerodiagonal matrix is a compatible adjacency matrix if entrywise, and entrywise.
Problem Statement
Now we formulate this problem as an optimization problem.
Problem.
Given the observed data , we seek an adjacency matrix (symmetric, binary and zerodiagonal) and a parameter value that
Alternating Inference of and
Given the observed data , we wish to infer the adjacency matrix of the recruitmentinduced graph and the distribution parameter . Given , the maximum likelihood estimator (MLE) for is
Similarly, given the true parameter , the MLE for the adjacency matrix is given by
In practice, both the parameter and the true recruitmentinduced subgraph are unknown and need estimation. However, we can alternately estimate and . This is what Render (presented in Algorithm 1) does. Each iteration is divided into two steps: step and step.
Estimation of using simulated annealing
The step of Render solves
equivalently, it suffices to solve
We employ a simulatedannealingbased method to estimate the adjacency matrix (presented in Algorithm 2). Let the energy function be
where is the temperature. We specify a cooling schedule, which is a sequence of positive numbers that satisfy , where is the temperature in the th iteration. Note that the th iteration of the simulated annealing procedure has a compatible adjacency matrix as its state.
Algorithm 3 specifies which state (compatible adjacency matrix) the algorithm should transition into in the next iteration. Concretely, in each iteration of Algorithm 2, it randomly proposes an edge that connects vertices and . If the edge does not appear in and it still conforms to the degree constraint if we add the edge, then we simply add it to . In contrast, if the edge appears in and it still conforms to the subgraph constraint if we remove the edge, then we simply remove it from . If neither condition is satisfied, the algorithm tries again with a new proposal. We prove in the appendix that the space of compatible adjacency matrices is connected by the proposal method.
The proposed compatible adjacency matrix is accepted as the state for the next iteration with probability , where equals
otherwise, the matrix remains the state for the next iteration. The term is called the likelihood ratio and the term is called the proposal ratio. To implement Algorithm 2, we have to compute the likelihood ratio and the proposal ratio efficiently. In fact, they can be evaluated efficiently in a recursive manner.
Theorem 2.
(Proof in Appendix). The proposal ratio is given by
where and . Here, we let denote the entry of the matrix .
The same way, we can find the likelihood ratio in a recursive manner.
Theorem 3.
(Proof in Appendix). If we view in Theorem 1 as a function of the adjacency matrix , denoted by , then the recurrence relation between and is as follows:
(1) 
Here, we assign the minus sign “” in “” if , and assign the plus sign “” in “” if . By the same convention, the likelihood ratio is given by
Estimation of distribution parameter
In a step in Algorithm 1, we have to solve the optimization problem . If the parameter space is a subset of the dimensional Euclidean space , this problem can be solved using offtheshelf solvers, e.g., fminsearch in MATLAB.
Experiments
We evaluated the proposed method in two aspects, the reconstruction performance of the recruitmentinduced subgraph and the parameter estimation of the edgewise interrecruitment time model. Let be the adjacency matrix of the true recruitmentinduced subgraph and be the estimate. We define the true and false positive rates (TPR and FPR) as and Fig. 1 illustrates an example of the reconstruction procedure. We simulated a RDS process over the Project 90 graph [Woodhouse et al.1994] with powerlaw edgewise interrecruitment time distribution, whose shape parameter and scale parameter Fifty subjects are recruited in this process. We show clockwise from top left: the true recruitmentinduced subgraph , the observed recruitment graph , the estimated network with recruitments as blue arrows and dashed lines as inferred edges, and the estimated recruitmentinduced subgraph . The TPR equals and the FPR equals . The estimated parameter and .
Reconstruction Performance
Impact of distribution parameter
We simulated 50 RDS over the Project 90 graph with interrecruitment time distribution (parametrized by the shape and scale) for each , , and . Thus a total number of 250 RDS processes are simulated and the mean interrecruitment time is fixed to be . The reconstruction performance is illustrated in Fig. 2. Each point on the receiver operating characteristic (ROC) plane represents a reconstruction accuracy performance of a simulated RDS process. Points with the same marker have the same interrecruitment time distribution parameter. From the figure, we can observe that there is no significant sign of separation of points with different interrecruitment time distribution parameters. Reconstruction accuracy is robust to the distribution parameter.
Impact of distribution model
We simulated 50 RDS process over the Project 90 graph with interrecruitment time distribution . The recruitmentinduced subgraph is reconstructed via the model of the true interrecruitment distribution and the exponential distribution , respectively. The TPR and FPR of each dataset are presented in Fig. 3
. The TPR is always higher than the FPR, which reaffirms the effectiveness of our proposed reconstruction method. For each dataset, the TPR and FPR under the true and the exponential models are very close to each other. We observe that there is no significant reconstruction skewness incurred by misspecification of the interrecruitment time distribution model.
Parameter Estimation
We simulated 200 RDS processes over the Project 90 graph with edgewise interrecruitment time distribution (parametrized by the shape and scale parameters) for each , , and . We used the method in the “Estimation of distribution parameter” section. We assess the bias of the estimated shape parameter , which is given by . Fig. 4 shows the distribution of the bias using Tukey boxplots. In Fig. 4, the red boxes depict the distribution of the biases of the estimated shape parameters inferred through the estimated adjacency matrix, while the green boxes illustrate the distribution of those inferred given the true adjacency matrix. The horizontal axis shows the value of the true shape parameter.
With respect to the red boxes (those based on the estimated adjacency matrices), The middle line of each box is very close to zero and thus the estimation is highly accurate. The interquartile range (IQR) , which measures the deviation of the biases, declines as the shape parameter decreases. Even for the box with largest deviation (i.e., the box with ), the IQR is approximately and of the biases reside in the interval . Compared with the parameter estimation via the true adjacency matrix, this estimator based on the estimated adjacency matrix is biased to some degree. In Fig. 4, we can observe that it underestimates .
Then consider the green boxes (those based on the true adjacency matrix). Similar to those based on the estimated adjacency matrix, the deviation of this estimator declines as the value of the shape parameter decreases. The middle line of each box is noted to coincide perfectly with zero bias line, which suggests that this estimator is unbiased given the true adjacency matrix.
Experiments on Real Data
We also apply Render to data from an RDS study of drug users in St. Petersburg, Russian Federation. We use Render to infer the underlying social network structure of the drug users in this study. Since it could be confusing to visualize the whole inferred network, we only show the inferred subgraph of the largest community of the network, as presented in Fig. 5. The blue arrows represent the edges in the recruitment subgraph that indicates the recruiter and recruitee. Gray dashed edges are inferred from the data.
Conclusion
In this paper, we precisely formulated the dynamics of RDS as a continuoustime diffusion process over the underlying graph. We derived the likelihood for the recruitment time series under an arbitrarily recruitment time distribution. As a result, we develop an efficient stochastic optimization algorithm, Render, that identifies the optimum network that best explains the collected data. We then supported the performance of Render through an exhaustive set of experiments on both synthetic and real data.
Acknowledgements
FWC was supported by NIH/NCATS grant KL2 TR000140 and NIMH grant P30MH062294. LC would like to thank Zeya Zhang and Yukun Sun for their encouragement.
References
 [Chen, Crawford, and Karbasi2015] Chen, L.; Crawford, F. W.; and Karbasi, A. 2015. Seeing the unseen network: Inferring hidden social ties from respondentdriven sampling. arXiv preprint arXiv:1511.04137.
 [Crawford2016] Crawford, F. W. 2016. The graphical structure of respondentdriven sampling. Sociological Methodology to appear.
 [Gile and Handcock2010] Gile, K. J., and Handcock, M. S. 2010. Respondentdriven sampling: An assessment of current methodology. Sociological Methodology 40(1):285–327.

[Goel and Salganik2009]
Goel, S., and Salganik, M. J.
2009.
Respondentdriven sampling as Markov chain Monte Carlo.
Statistics in Medicine 28(17):2202–2229. 
[Gomez Rodriguez et al.2011]
Gomez Rodriguez, M.; Balduzzi, D.; Schölkopf, B.; Scheffer, G. T.; et al.
2011.
Uncovering the temporal dynamics of diffusion networks.
In
28th International Conference on Machine Learning (ICML 2011)
, 561–568. International Machine Learning Society.  [Gomez Rodriguez, Leskovec, and Krause2010] Gomez Rodriguez, M.; Leskovec, J.; and Krause, A. 2010. Inferring networks of diffusion and influence. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 1019–1028. ACM.
 [Heckathorn1997] Heckathorn, D. D. 1997. Respondentdriven sampling: a new approach to the study of hidden populations. SOCIAL PROBLEMSNEW YORK 44:174–199.
 [Heckathorn2002] Heckathorn, D. D. 2002. Respondentdriven sampling II: deriving valid population estimates from chainreferral samples of hidden populations. Social Problems 49(1):11–34.
 [Kramer et al.2009] Kramer, M. A.; Eden, U. T.; Cash, S. S.; and Kolaczyk, E. D. 2009. Network inference with confidence from multivariate time series. Physical Review E 79(6):061916.
 [LibenNowell and Kleinberg2007] LibenNowell, D., and Kleinberg, J. 2007. The linkprediction problem for social networks. Journal of the American society for information science and technology 58(7):1019–1031.
 [Linderman and Adams2014] Linderman, S., and Adams, R. 2014. Discovering latent network structure in point process data. In Proceedings of The 31st International Conference on Machine Learning, 1413–1421.
 [Salganik and Heckathorn2004] Salganik, M. J., and Heckathorn, D. D. 2004. Sampling and estimation in hidden populations using respondentdriven sampling. Sociological Methodology 34(1):193–240.
 [Salganik2006] Salganik, M. J. 2006. Variance estimation, design effects, and sample size calculations for respondentdriven sampling. Journal of Urban Health 83(1):98–112.
 [Shandilya and Timme2011] Shandilya, S. G., and Timme, M. 2011. Inferring network topology from complex dynamics. New Journal of Physics 13(1):013004.
 [Volz and Heckathorn2008] Volz, E., and Heckathorn, D. D. 2008. Probability based estimation theory for respondent driven sampling. Journal of Official Statistics 24(1):79.
 [Woodhouse et al.1994] Woodhouse, D. E.; Rothenberg, R. B.; Potterat, J. J.; Darrow, W. W.; Muth, S. Q.; Klovdahl, A. S.; Zimmerman, H. P.; Rogers, H. L.; Maldonado, T. S.; Muth, J. B.; et al. 1994. Mapping a social network of heterosexuals at high risk for HIV infection. Aids 8(9):1331–1336.
Appendix
Proof of Theorem 1
Proof.
(1) We consider the recruitment of subject . Let be the set of recruiters of subject just before time and be the set of recruitees of recruiter just before time . Suppose that . The interrecruitment time between and its potential recruiter is denoted and is greater than conditional on previous recruitment of . Let
be the random variable of next recruiter and
be the random variable of next recruitee. Let denote the event . We haveand
Thus the likelihood is
Now suppose that .
Thus the likelihood is
Therefore the entire likelihood is
(2) The loglikelihood is
And the number of recruitees of recruiter just before time is given by
The term in the loglikelihood is given by
The term in the loglikelihood is given by
Thus the loglikelihood is
∎
Proof of connectedness of the space of compatible adjacency matrices
Proof.
Consider two compatible adjacency matrices and . Their Hadamard product , which corresponds to the maximum common subgraph of and , is also compatible. We can remove the edges in that do not appear in one at a time, and thus arrive at state . Then we add the edges in that do not appear in once at a time, and thus finally arrive at state . All intermediate adjacency matrices in this procedure are compatible. ∎
Proof of Theorem 2
Proof.
In fact, the total number of possible edges that can be added for is given by and the total number of possible edges that can be removed for is given by . Thus we have the proposal distribution
Similarly, we have
(2) 
Thus we obtain the proposal ratio:
∎
Proof of Theorem 3
Proof.
Let denote the matrix whose entry is one and all other entries are zeros. We have
where we assign the minus sign “” in “” if the edge that connects subjects and is added to and we assign the plus sign “” in “” if that edge is removed from . And can be rewritten as
(3)  
where in (3) we use the fact that is symmetric and denotes the strictly upper triangular part (diagonal entries exclusive) of a matrix. Thus
The entry of is
Thus
Hence we have
(4) 
Similarly, if we view in Theorem 1 as a function of the adjacency matrix , denoted by , we have
Comments
There are no comments yet.