## 1 Introduction

The network reconstruction problem, also known as the network inference problem [10, 5, 19, 7, 20, 21, 11, 24, 1, 13, 18], arises naturally in a variety of scenarios and has been the focus of great research interest. In the most general setting, we assume there is an underlying unknown graph structure that represents the connections between network subjects, and that we can only observe single or multiple diffusion processes over the graph. Usually propagation of the diffusion process can only occur over network edges; however, there exist many hidden ties untraversed or unrevealed by the diffusion processes, and the goal is to infer such hidden ties. This network reconstruction problem arises in several empirical topic areas:

Blogosphere. Millions of authors in the worldwide blogosphere write articles every day, each triggering a diffusion process of reposts over the underlying blog network structure. The diffusion process initiated by an article can be represented by a directed tree. The observed data consist of multiple directed trees and it is of great interest to understand the underlying structure of information flow [9]. Following inference of the network, researchers may apply community detection algorithms to, e.g., aggregate and further analyze blog sites of different political views.

Online social networks. Weibo is a Twitter-like microblogging service in China [8] where users post microblogs and repost those from other users they follow. An explicit repost chain, which indicates the sequence of users that a post passes through, is attached to each repost on Weibo. Similarly, each post initiates a diffusion process. By observing several realizations of diffusion processes, researchers seek to understand the underlying social and information network structure.

Respondent-driven sampling. Respondent-driven sampling (RDS) is a chain-referral peer recruitment procedure that is widely used in epidemiology for studying hidden and hard-to-reach human populations when random sampling is impossible [14]. RDS is commonly used in studies of men who have sex with men, homeless people, sex workers, drug users, and other groups at high risk for HIV infection [25]. An RDS recruitment process is also a diffusion process over an unknown social network structure, and the diffusion tree (who recruited whom) is revealed by the observed process. In addition, when a subject enters the survey, she reports her total number of acquaintances in the population, or graph-theoretically speaking, her degree in the underlying network. Understanding the underlying network structure is a topic of great interest to epidemiologists and sociologists who wish to study the transmission of infectious diseases, and the propagation of health-related behaviors in the networks of high-risk groups [4]

. However, in contrast to the aforementioned scenarios where multiple diffusion realizations are available over the same network, in RDS we can only observe a single realization due to limited financial, temporal and human resources to conduct the experiments. As a result, network reconstruction from RDS data is particularly challenging and only heuristic algorithms are known. Crawford

[4]assumes that the recruitment time along any recruitment link is exponentially distributed and thus models RDS as a continuous-time diffusion process. Chen et al.

[3] relaxes the requirement of exponentially distributed recruitment times and extends it to any distribution. Both works use a simulated-annealing-based heuristic in order to find the most likely configuration.As a general strategy, for a particular diffusion model, a likelihood function can be derived that measures the probability of a diffusion realization. In this way, the network inference problem can be cast as an optimization problem, in which the researcher seeks the topology that maximizes the likelihood. Unfortunately, the derived likelihood functions are usually intractable for efficient maximization with respect to the graph, and can be computationally prohibitive to evaluate. To address this challenge, approximate solutions have been proposed as an efficient alternative

[10, 11]. For instance, Gomez-Rodriguez et al. [10], instead of maximizing the likelihood, derived an alternative heuristic formulation by considering only the most likely tree (still an NP-hard problem) rather than all possible propagation trees and showed how a greedy solution can find a near-optimal solution. It enjoys good empirical results when many realizations of the diffusion process can be observed.In this paper, we consider the challenging instance of network inference where only one realization of the diffusion process is observed. As a motivating empirical example, we study the network reconstruction problem for RDS data and propose Vine (Variational Inference for Network rEconstruction), a computationally efficient variational inference algorithm. Our major contributions are summarized as follows.

Proof of log-submodularity and a variational inference algorithm. We show that under a realistic model of RDS diffusion, the likelihood function is log-submodular. Using variational inference methods, we approximate the submodular function with affine functions and obtain tight upper and lower bounds for the partition function. We then estimate the most probable network configuration, which is the maximizer of the likelihood, as well as the marginal probability of each edge.

Relaxation of constraints. The optimization problem of the RDS likelihood (as shown later) is constrained. First, the observed diffusion results in a directed subgraph and the inferred network must be a supergraph of the diffusion process. Second, for each subject, their degree in the reconstructed subgraph cannot exceed their total network degree. The first constraint is easy to incorporate while the second precludes efficient computation of partition functions of the likelihood (or any linear approximations). We address this challenge by introducing penalty terms in the objective function. This way, the constrained reconstruction problem becomes unconstrained and admits the use of variational methods.

Flexibility for possibly inexact reported degrees. One may not assume that the reported degrees by recruited subjects are exact because subjects may not be able to accurately recall the number of people they know who are members of the target population. We would like to note that the aforementioned relaxation of the second constraint allows for more flexibility of the possible mismatch of the reported degrees from the true ones by introducing an additional term that penalizes the deviation between the reported and true degrees, seeking to preserve the relative accuracy of the reported degrees.

High reconstruction performance and time efficiency using a single realization of RDS diffusion. As shown by our experiments, Vine achieves significantly higher inference performance while running orders of magnitude faster. We should note that the very accurate inference is achieved based on the observation of a single diffusion realization. This is in sharp contrast to previous work that assumes multiple diffusion realizations.

The rest of the paper is organized as follows. In Section 2, we focus on network reconstruction for RDS data and formulate it as an optimization problem. We present our method in Section 3. Experimental results are presented in Section 4. All proofs are presented in Sections G, F, E, D and C. Additionally, we discuss the connection between RDS and other diffusion processes in Section H.

## 2 Network Reconstruction for RDS Data

is the degree vector, where

is the total degree of node in . In Fig. (f)f, is the recruitment time vector, where is the recruitment time of node . In Fig. (g)g, is the coupon matrix; its -entry is if node has at least one coupon just before the th recruitment event and is otherwise. The observed data consist of .We use the following notational convention throughout this paper. The symbol denotes the all-ones column vector. If is a real-valued function and that is a vector, then is a vector of the same size as vector , we denote the -th entry of by , and . The transposes of matrix and vector are written as and , respectively.

The objective of the RDS sampling method is to obtain a sample from a population for which random sampling is impossible. The network structure of the underlying inaccessible population is modeled as an undirected simple graph , where each vertex represents an individual and edges represent the intra-population connections. The sample obtained via RDS is denoted by . Let be the number of subjects recruited into the study by the end of the RDS process.

In contrast to random sampling, RDS is a chain-referral process that operates by propagating a diffusion process on the edges of the target
social network. Subjects enter the study one at a time. The recruitment (diffusion) process is done either by researchers directly or by other subjects already in the study prior to this new recruitment event. If a subject is recruited into the study by researchers directly,
she is called a *seed*. Let be the set of all seed nodes.
Note that seed nodes may not necessarily be recruited simultaneously; however, we need at least one seed that enters the study at the initial stage of the experiment in order to initiate the chain-referrals over the underlying network. We label the subjects in the time-order they enter the study; node is
the -th subject that enters the study. When a subject enters the study (either via researchers directly or other subjects already in the study), she will be given several coupons
to recruit other members
(each recruitment costs one coupon). Each coupon is marked with a unique ID that can be traced back to the recruiter.
Subjects are given a reward for being interviewed and recruiting other eligible subjects.
The date/time of
every subject’s recruitment is recorded
and every subject reports their total number of
acquaintances (their network *degree*).
Let be the recorded recruitment
time of subject and be the reported degree of subject
in the population (in the graph ).
Let the recruitment
time and degree vector be and
, respectively.

Once a subject recruits another subject the (directed) link between them will be revealed. The direction simply indicates who recruited whom. Furthermore, any subject who has entered the study with a coupon may not re-enter the study with another coupon, and no participant may enter the study more than once. Thus no subject can recruit others already recruited and thereby already in the study. We can form a directed graph, called the recruitment graph , that has the same vertex set as and reflects the recruitment links; if and only if subject directly recruits . The above requirements will result in a directed recruitment graph being a disjoint union of rooted directed arborescences (a directed graph is a rooted directed arborescence with root if for every vertex , there exists a unique directed path from to ) [12], where the root corresponds to a seed node. Equivalently, an arborescence is a directed, rooted tree in which all edges point away from the root. We illustrate an example of in Fig. (d)d, where the red links form the recruitment graph and there are two disjoint arborescences with roots and , respectively; the two roots correspond to the two seed nodes.

The first subject enters
the study at time . At some time , any
subject in the study who has at least one coupon (recall that each recruitment costs one coupon and that one cannot recruit any subject without a coupon) and has at least
one acquaintance not in the study (i.e., has at least one neighbor in who is not already recruited) is called a *recruiter* at
time ; accordingly, any subject who is not in the study and is connected to
at least one recruiter is called a *potential recruitee* or a
*susceptible subject *at time ; and the edge between a recruiter
and a potential recruitee is said to be *susceptible *at time
.
Let and be the
recruiter set and potential recruitee set just before time ,
respectively. Similarly, If subject is a recruiter just before time , then denotes the set of potential recruitees connected to subject just before time , and if is a potential recruitee just before time , then denotes the set of recruiters connected to subject just before time .

In what follows, we model RDS as a continuous-time stochastic process on the edges of a hidden graph. Our goal is to estimate the induced subgraph connecting the sampled vertices . To do this, we construct a flexible model for this process and derive its likelihood, conditional on an underlying graph. The inference problem is to find the graph that maximizes this likelihood, subject to the constraint that the graph must be compatible with the observed degrees in the data. We start with making the following assumptions:

###### Assumption 1.

Upon entering the study, each subject is given coupons and begins to recruit other members (if any) immediately.

###### Assumption 2.

Inter-recruitment times between any recruiter and its potential recruitees are i.i.d. continuous random variables with cumulative distribution function (cdf)

parametrized by .In fact, Assumption 2 can be relaxed to the case where inter-recruitment times are independent but not necessarily identically distributed. For simplicity we assume that they are i.i.d.

If is a random variable with cdf , we have and let We write

for the conditional probability density function (pdf). Let

be the conditional survival function and be the conditional hazard function. Recall that the set of all subjects in the study is denoted by . The recruitment graph has the same vertex set as and indicates who recruited whom: if subject recruits subject . Note that subject can recruit subject only if there is an edge in the underlying network that connects and . The coupon matrix has a in entry if subject has at least one coupon just before the -th recruitment event , and zero otherwise. In addition, we define another matrix , which is the adjacency matrix of the undirected version of , obtained by replacing all directed edges with undirected edges.###### Assumption 3.

The observed data from an RDS process consists of .

Our goal is to infer the induced subgraph , denoted by , which encodes all connections among the subjects in the study. We also use to denote the adjacency matrix of and throughout this paper and are used interchangeably. Obviously the undirectified version of must be a subgraph of . Thus must be greater than or equal to entrywise; formally,

This will be a constraint in the optimization problem specified later in Problem 1. Fig. 1 shows an example of an RDS process including its unobserved and observed parts.

Recall that denotes the set of seeds and let . The likelihood of the recruitment time series is given by

(1) |

(The proof of Eq. 1 is presented in Section C). The above model was originally derived in [3].

We can represent the log-likelihood in a more compact way using linear algebra. Prior to this, we need some notation. Let and be column vectors of size such that and is the number of pendant edges of subject (the reported total degree of subject minus the number of its neighbors in ), i.e.,

and let and be matrices, defined as and Furthermore, we form matrices and , where denotes the Hadamard (entrywise) product. We let

where the log of a vector is taken entrywise and denotes the lower triangular part (diagonal elements inclusive) of a square matrix. Then the log-likelihood can be written as

(2) |

(The proof of Eq. 2 is presented in Section D). To adopt a Bayesian approach, we consider maximizing the joint posterior distribution

where and are the prior distribution of and . The network inference problem of the RDS data is reduced to maximization of . Our main observation in this paper is that the log-likelihood function is submodular, which opens the possibility of rigorous analysis and variational inference.

If we assume that the reported degrees of subjects are exact, then the vector should be set to and it must be non-negative entrywise. However, in practice, the reported degree of a subject may be an approximation, but we assume the true degree does not deviate excessively from the reported degree. To be more flexible, we allow to be any non-negative integer-valued vector. In this case, the true degree vector will be . We penalize it if deviates from excessively. To be precise, we define the prior distribution as

(3) |

where is conducted entrywise and is a multivariate (-dimensional) convex function and non-decreasing in each argument whenever this argument is non-negative. We can now formulate our inference problem as an optimization problem.

###### Problem 1.

Given the observed data , we seek an adjacency matrix (symmetric, binary and zero-diagonal) and a parameter value that

Problem 1 can be solved by maximizing the likelihood with respect to and alternately. We set an initial guess for the parameter . In the -th iteration (), setting in Problem 1, we optimize the objective function over (this step is called the -step), denoting the maximizer by ; then setting in Problem 1, we optimize the objective function over (this step is called the -step), denoting the maximizer by . The interested reader is referred to Algorithm 1 in [3]. Note that the parameter space is usually a subset of the Euclidean space. The optimization problem given in the -step can be solved with off-the-shelf solvers. As a result, we focus on the -step; equivalently, we study how to solve Problem 1 assuming that is known.

## 3 Proposed Method

We now present a network reconstruction algorithm, based on submodular variational inference, for respondent-driven sampling data. This method is referred to as Vine in this paper. We first introduce the definition of a submodular function [17, 15].

###### Definition 1.

A pseudo-Boolean function is
*submodular *if
,
we have
,
where and denote entrywise logical conjunction and
disjunction, respectively.

We can trivially identify the domain with , the power set of . Thus a pseudo-Boolean function can also be viewed as a set function . We will view from these two perspectives interchangeably throughout this paper. If we view as a set function, it is submodular if for every subset , we have . An equivalent definition is that is submodular if for every and , we have ; this is also known as the “diminishing returns” property because the marginal gain when an element is added to a subset is no less than the marginal gain when it is added to its superset.

A pseudo-Boolean or set function is
*log-submodular
*if is submodular;
it is
*modular* if
(if viewed as a pseudo-Boolean function) or equivalently (if viewed as a set function),
where is called the weight of the element .
It is *affine* if , where
is modular
and is some fixed real number;
similarly, it is
*log-affine
*if is affine.

### 3.1 Removal of constraint in Problem 1

The formulation of Problem 1 makes clear that the network reconstruction problem is a constrained optimization problem. Recall that we have two constraints. One is that the reconstructed subgraph should contain all edges already revealed by the RDS process. This constraint is natural since if a direct recruitment occurs between two subjects, then they must know each other in the underlying social network. The other constraint is that the degree of the subject in the reconstructed subgraph must be bounded by the degree that this subject reports. In this section, we remove these two constraints and cast it into an unconstrained problem. The first constraint is easy to remove by considering only the edges unrevealed by the RDS process. The final objective function results from replacing the second constraint with a penalty function to allow for some room for the deviation of the degree in the inferred subgraph from the reported degree. After relaxing the two constraints, we turn Problem 1 into an unconstrained problem and make submodular variational inference (to be discussed later in Section 3.3) possible.

Specifically, the first constraint requires some entries of to be ; if the -entry of (denoted by ) is , so is . Only the rest of the entries of can either be or and are the free entries. We collect the free entries in a binary vector

and view as a function of . In fact, there is a one-to-one correspondence between and . In this way, we remove the constraint . Now we discuss how to relax the second constraint (the degree constraint) to make small deviation from the (usually approximate) reported degree possible.

Representing as a binary vector and thereby the objective function as a pseudo-Boolean function. We notice that is also a function of ; however, is an integer-valued vector rather than a binary vector. We consider representing as a binary vector and thereby casting into a pseudo-Boolean function. We observe that is bounded entrywise; to be precise, , , where . We can form an matrix such that the -th row of is the binary representation of ; formally,

(4) |

In this way, we represent as a pseudo-Boolean function of and . Let and define Therefore is a pseudo-Boolean function of , whose dimension is The likelihood function defines a probability measure over , , where is the normalizing constant.

### 3.2 Submodularity of log-likelihood function

Theorem 1 below shows that is log-submodular. We know that a submodular function can be approximated by affine functions from above and below. Due to the log-submodularity of

, it can be approximated by two log-affine functions from above and below. The partition function of a probability distribution proportional to a log-affine function can be computed in a closed form; therefore this distribution can be computed exactly and we have upper and lower bounds of

; we can conduct variational inference via the two bounds. We will elaborate on this in Section 3.3.###### Theorem 1 (Proof in Appendix E).

The function is log-submodular; equivalently, there exists a submodular function such that for every .

Normalizing into .
Ideally, we want a submodular function to be *normalized*; i.e., , or equivalently if viewed as a set function. Thus we define this way is a normalized submodular function (it is a submodular function minus some constant). In addition, we define and we have Note that the probability measure is proportional to (up a constant factor) and that only differs from up to a constant factor. Therefore the probability measure remains proportional to , thus is also a likelihood function and defines the same
probability measure over
as does. As a result, the probability measure can be expressed as
where is the normalizing constant,
or the *partition function*.

### 3.3 Variational inference

Using a variational method lets us consider bounding from above and from below with affine functions. We want to find modular functions and and two real numbers and such that for all . If this holds for all , then we have the inequality between log-partition functions: We define the partition function of the affine function as Using this notation, we have Note that from this we may also obtain the bounds for the marginal probability for each element . To be precise, if is sampled from the distribution , then the marginal probability satisfies We may also use or as a surrogate for and make inference via these two affine functions.

Suppose that we already have two affine functions and bounding from above and below. By Lemma 1 in [6], the log-partition function for in the unconstrained case is where is the weight of element . Thus we have

So our goal is to find the upper- and lower-bound affine functions.

Lower-bound affine function. We define

and

where , , and . Then we have the affine function that assigns to a weight of . Let and .

###### Proposition 1 (Proof in Appendix F).

The affine function is a lower-bound function of the submodular function ; for all , .

Upper-bound affine function. We may find an upper-bound affine function for a submodular function via its supergradients. The set of supergradients of a submodular function at [16] is defined as

(5) |

If a modular function is a supergradient of at , then the affine function is an upper bound of . The corresponding log-partition function is

We consider three families of supergradient, which are grow (), shrink () and bar () supergradients at . Let us view as a set function where . We define where . These three supergradients are defined as follows. If , then and if , then and

###### Proposition 2 (Proof in Appendix G).

The modular functions , and are supergradients of the submodular function at .

Define the modular function

By Lemma 4 in [6], we know that these two optimization problems are equivalent:

The right-hand side is an unconstrained submodular minimization problem, which can be solved efficiently [17]. By solving this problem, we obtain a supergradient at and thus know its partition function . Then we compute the partition function of grow and shrink supergradients at and let be the one with the smallest partition function. Thus the upper-bound affine function is .

When we have an affine approximation for the submodular function, we may make inference via the affine function. Suppose that the affine function is , which can be either an upper-bound or lower-bound approximation. Recall that and where and .

We can select a threshold and obtain by thresholding, . Thus we obtain an inferred adjacency matrix from . The proposed method Vine is presented in Algorithm 1.

## 4 Experiment

In this section, we evaluate the proposed variational inference algorithm via experimental results. By varying from to , we obtain a series of inferred adjacency matrices . Suppose that the true adjacency matrix of is . The reconstruction performance of an inferred adjacency matrix is measured by the true positive rate (TPR) and the false positive rate (FPR), which are defined as and where is the number of subjects. We plot the TPR and FPR of each on the ROC plane and obtain the ROC curve. Figs. (b)b and (a)a show an example of the reconstruction result. In this example, we simulated an RDS process over a real-world network, the Project 90 graph that represents the community structure of heterosexuals at high risk for HIV infection [25], with inter-recruitment time distribution (exponential distribution with rate ). In the simulation, we choose , and a single seed subject at the initial stage; each subject is given 3 coupons; 1176 missing edges are to be inferred. In practice, the sample size is usually fixed in advance (according to researchers’ study plan). In Fig. (a)a, the blue ROC curve corresponds to the upper-bound inference. We choose to be the norm. The red curve is a baseline reconstruction given by estimating by . Since must be a subgraph of , the FPR of is zero. The red curve is obtained by connecting the point of the TPR of on the vertical axis and the point . The performance is quantified by the area of the region under the ROC curve (note that larger is better). In this example, the area of the region under the blue curve is and that of the red curve is . The region under the blue curve is greater than that of the red curve. With the best thresholding, the algorithm can achieve a TPR of 90% while the FPR is only 10%. In Fig. (b)b, the blue curve is the ROC curve of the lower-bound inference. We choose to be the norm. The red curve is a baseline given by . The area under the blue curve is , which is greater than that of the red curve. Since the lower-bound approximation is obtained from the greedy algorithm while the upper-bound approximation is the solution to an optimization problem, we focus on the upper-bound inference.

### 4.1 Experiments on Facebook network

Recall that in Eq. 3, can be any non-decreasing convex function. We may let be times the norm (, ). We now study the influence of different choices of .

Influence of . First we fix and vary from to . We simulated RDS processes over the Facebook network [22]. For each , we measure the area under the ROC curve of the upper-bound inference for each RDS data and illustrate their distribution with a Tukey boxplot, shown in Fig. (a)a. We also record the advantage of the area under the ROC curve of the upper-bound inference over that of ; boxplots are given in Fig. (b)b. We observe that the variational inference algorithm achieves remarkably high accuracy when .

Influence of . We let be and vary from to . RDS processes are simulated over the Facebook network. For each , we measure the area under the ROC curve of the upper-bound inference for each RDS data and illustrate their distribution with a Tukey boxplot. The result is presented in Fig. (c)c. In addition, we also record the advantage of the area under the ROC curve of the upper-bound inference over that of . Accordingly, their boxplots are presented in Fig. (d)d. From Figs. (d)d and (c)c, we can observe that the variational inference algorithm achieves higher accuracy as increases from to and that the increase of from to leads to lower accuracy of the variational inference.

### 4.2 Experiments on large Project 90 and Epinion networks

We compare Vine with the simulated-annealing-based method proposed in [3] (referred to as SimAnn). Since SimAnn only gives a single point on the ROC plane rather than a curve, thus the reconstruction performance in this set of experiments is quantified by the distance from the output point to the upper left corner, The algorithm with smaller distance is considered to attain better performance. We apply both methods to the large Project 90 network [25] and the Epinions social network [23]. The Epinions network characterizes the trust relationships between users of a general consumer review site Epinions.com.

Fig. (e)e shows the distance to the upper left corner versus the number of edges in , where the value is chosen to minimize the distance of the Vine ROC curve datapoint to the upper-left corner ; we vary from tens of edges to edges. We generate many RDS processes with different sample sizes and sort them according to the number of edges in and see the reconstruction performance on these datasets. In this way, we plot how the reconstruction performance varies with the number of edges in . Fig. (g)g presents the distance to the upper left corner versus the number of subjects in the sample (the number of nodes in ). Vine outperforms SimAnn significantly on both datasets. Fig. (f)f presents the running time (in seconds) versus of the number of edges in . SimAnn was implemented in C++ while Vine was written in the Julia language and can be implemented as a parallelized version. Vine runs three orders of magnitude faster than SimAnn when there are more edges in and Vine is more scalable for large graphs.

Fig. (h)h shows the distance to the upper left corner versus the number of seeds. We vary the number of seeds from 10 to 100, while fixing the sample size. Both algorithms achieve better reconstruction performance with more seeds and that under the same number of seeds, Vine attains a remarkably better performance than SimAnn.

Fig. 4 visualizes the reconstruction result for a subnetwork of size 40 of the Epinions network. Only 6 recruitment links are revealed in . out of edges (approximately ) are successfully recovered.

## References

- [1] A. Anandkumar, A. Hassidim, and J. Kelner. Topology discovery of sparse random graphs with few participants. ACM SIGMETRICS Performance Evaluation Review, 39(1):253–264, 2011.
- [2] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
- [3] L. Chen, F. W. Crawford, and A. Karbasi. Seeing the unseen network: Inferring hidden social ties from respondent-driven sampling. In AAAI, pages 1174–1180, 2016.
- [4] F. W. Crawford. The graphical structure of respondent-driven sampling. Sociological Methodology, 46(1):187–211, 2016.
- [5] H. Daneshmand, M. Gomez-Rodriguez, L. Song, and B. Schölkopf. Estimating diffusion network structures: Recovery conditions, sample complexity & soft-thresholding algorithm. In ICML, 2014.
- [6] J. Djolonga and A. Krause. From map to marginals: Variational inference in bayesian submodular models. In NIPS, December 2014.
- [7] M. Farajtabar, M. Gomez Rodriguez, M. Zamani, N. Du, H. Zha, and L. Song. Back to the past: Source identification in diffusion networks from partially observed cascades. In AISTATS, pages 232–240, 2015.
- [8] Q. Gao, F. Abel, G.-J. Houben, and Y. Yu. A comparative study of users’ microblogging behavior on sina weibo and twitter. In UMAP, pages 88–101. Springer, 2012.
- [9] M. Gomez-Rodriguez, J. Leskovec, D. Balduzzi, and B. Schölkopf. Uncovering the structure and temporal dynamics of information propagation. Network Science, 2(01):26–65, 2014.
- [10] M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and influence. In SIGKDD, pages 1019–1028. ACM, 2010.
- [11] M. Gomez-Rodriguez, J. Leskovec, and B. Schölkopf. Structure and dynamics of information pathways in on-line media. In WSDM, 2013.
- [12] G. Gordon and E. McMahon. A greedoid polynomial which distinguishes rooted arborescences. Proceedings of the AMS, 107(2):287–298, 1989.
- [13] S. Hanneke and E. P. Xing. Network completion and survey sampling. In AISTATS, pages 209–215, 2009.
- [14] D. D. Heckathorn. Respondent-driven sampling: a new approach to the study of hidden populations. Social Problems, 44(2):174–199, 1997.
- [15] R. Iyer and J. Bilmes. Polyhedral aspects of submodularity, convexity and concavity. arXiv preprint arXiv:1506.07329, 2015.
- [16] R. Iyer, S. Jegelka, and J. A. Bilmes. Fast semidifferential-based submodular function optimization. In ICML, 2013.
- [17] S. Jegelka, H. Lin, and J. A. Bilmes. On fast approximate submodular minimization. In NIPS, pages 460–468, 2011.
- [18] M. Kim and J. Leskovec. The network completion problem: Inferring missing nodes and edges in networks. In SDM, volume 11, pages 47–58, 2011.
- [19] M. A. Kramer, U. T. Eden, S. S. Cash, and E. D. Kolaczyk. Network inference with confidence from multivariate time series. Physical Review E, 79(6):061916, 2009.
- [20] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal of the American society for information science and technology, 58(7):1019–1031, 2007.
- [21] S. Linderman and R. Adams. Discovering latent network structure in point process data. In ICML, pages 1413–1421, 2014.
- [22] J. J. McAuley and J. Leskovec. Learning to discover social circles in ego networks. In NIPS, pages 548–56, 2012.
- [23] M. Richardson, R. Agrawal, and P. Domingos. Trust management for the semantic web. In ISWC, pages 351–368. Springer, 2003.
- [24] S. G. Shandilya and M. Timme. Inferring network topology from complex dynamics. New Journal of Physics, 13(1):013004, 2011.
- [25] D. E. Woodhouse, R. B. Rothenberg, J. J. Potterat, W. W. Darrow, et al. Mapping a social network of heterosexuals at high risk for HIV infection. AIDS, 8(9):1331–1336, 1994.

## Appendix

### C Likelihood of Recruitment Time Series

We consider the recruitment of subject . Recall that denotes the set of recruiters of subject just before time and that denotes the set of potential recruitees of recruiter just before time .

We compute the likelihood of the -th recruitment event (the recruitment of subject ) in the two cases: enters the study via the recruitment of a subject already in the study (in this case, subject is not a seed node, which is denoted by ) and via the direct recruitment of the researchers (in this case, subject is a seed node, which is denoted by ).

Suppose that . The inter-recruitment time between and its potential recruiter is denoted and is greater than conditional on previous recruitment of . Let be the random variable of next recruiter and be the random variable of next recruitee, namely the subject that will be labeled as subject . We would like to note here that subject is in fact random. Let denote the event .

We first compute the probability that a certain subject is the next (-th) recruitee, is its recruiter, and the inter-recruitment time between and is greater than or equal to , conditional on event . Intuitively, is the recruitment time of subject and in fact we are computing the tail probability of . We condition on the event because having observed the -th recruitment event, we know that for all possible recruiter-recruitee pairs in the next (-th) recruitment event, say and , their inter-recruitment time should be greater than or equal to (otherwise, the event that subject recruits subject will happen before and they will not appear in and , respectively). We have

(6) |

Since the -th recruitment event is that recruits , the inter-recruitment time along this link must be minimum among those along all other links. Therefore, in Section C we consider for . We require that

which means exactly that the recruitment time of () is minimum (smaller than the recruitment time of for all ).

Then we marginalize the above probability in Section C over all possible combinations of and . Recall that any subject in could possibly be the subject and any subject in could be her recruiter; therefore we need to sum over all possible recruitee-recruiter combinations, i.e., sum over and :

Using the notation that we introduced in Section 2 to re-write and simplify the above expression, we obtain the likelihood of the -th recruitment event for :

Comments

There are no comments yet.