Many phenomenon can be modeled as information propagation in networks over time. Prevalent examples include spread of a disease through a population, transmission of information through a distributed network, and the diffusion of scientific discovery in academic network. In all these scenarios, it is disastrous once an isolated risk is amplified through diffusion in networks. Source detection therefore is critical for preventing the spreading of malicious information, and reducing the potential damages incurred.
In this paper, we study the source inference problem: given that a message has been diffused in network , can we tell which node is the source of diffusion given some observations at time ? The solution to this problem can help us answer many questions of a common theme: Which computer is the first one infected by computer virus? Who first spreads out the fake news in online social networks? Where is the origin of an epidemic? and which paper is the first scientific rumor on a specific topic in academic citation networks?
While finding the source node has these important applications, it is known that this problem is highly challenging, especially in complex networks. The prior studies mainly focus on topology of infected subgraph. Under the assumption that a full or partial snapshot of the infected nodes is observed at some time, some topology based estimators (such as rumor centrality, Jordan center, etc.) are proposed under various diffusion models (Shah and Zaman, 2010, 2011, 2012; Zhu and Ying, 2016b; Chen et al., 2016; Luo et al., 2017; Nguyen et al., 2016; Zhu and Ying, 2014; Luo et al., 2014; Zhu et al., 2017; Zhu and Ying, 2016a). These estimators, unfortunately, often suffer from poor source detection accuracy and high cost for obtaining the snapshot. Later on, metadata such as timestamps of infected nodes and the direction from which a node gets infected is exploited in the hope of improving the localization precision (Pinto et al., 2012; Zhu et al., 2016; Tang et al., 2018)
. However, they typically assume a Gaussian-distributed transmission delay for each edge, which may be impractical for many applications such as Bitcoin P2P network(Fanti and Viswanath, 2017) and mobile phone network(Wang et al., 2009)
, etc. In these networks, the transmission delay for each edge has been verified to follow Geometric distribution.
In this paper, we adopt the discrete-time susceptible-infected (SI) model. The network is assumed to be an undirected graph. Initially, only one node is infected at some unknown time. The infection then begins to diffuse in the network via random interaction between neighboring nodes. Now, we wish to locate the source node using some observation . We assume that contains some set of nodes with first infection timestamps . The nodes in is sampled uniformly at random. Given partial timestamps , the question is which node is the information source.
In order to infer the information source using limited timestamps , one may seek for the solution via a ML estimator, as is widely adopted in many prior arts. However, such an estimator incurs exponential complexity. Instead, here we develop an infection path based estimator where the source is the root node of the most likely time labeled cascading tree consistent with observed timestamps
. In a tree graph, by establishing an equivalence between infection path based estimator and a linear integer programming, the infection path based estimator can be efficiently resolved via message passing. In a general graph, to overcome the difficulty of searching exponential number of infection paths, we incorporate a time labeled BFS heuristic to approximate the infection path based estimator using linear integer programming.
We remark that in our problem of interest, only limited timestamps and the location of nodes are considered as observation . This setting has many practical advantages over those using snapshot and direction information (Shah and Zaman, 2011; Luo et al., 2017; Zhu and Ying, 2016b). First, it is time consuming, and sometimes impossible, to collect the full snapshot of the infected nodes at some time. For example, Twitter’s streaming API only allows a small percentage () of the full stream of tweets to be crawled. Second, sometimes the direction from which a susceptible node gets infected is hard to obtain. For example, in a flu outbreak a person often cannot tell with certainty who infected him/her. The same also goes for anonymous social networks (Fanti et al., 2017, 2016), where the direction information is hidden. Finally, sampled nodes with timestamps contains more information than partial snapshot, and is easy to access in most scenarios (such as online social network, etc.).
The primary contributions are summarized as follows:
We propose an infection path based estimator to approximate the maximum likelihood estimator in detecting the information source. In a tree graph, this estimator is equivalent to a linear integer programming that can be efficiently solved via message passing approaches. By exploiting the property of linear integer programming, we find a reduced search region that remarkably improves the time efficiency. In a general graph, a time labeled BFS heuristic is incorporated to approximate the infection path based estimator.
We define a novel concept called candidate path to assist the analysis of error distance between the true source and the estimated source
on an arbitrary tree. Under the assumption that the limited timestamps are sampled uniformly at random, we provide a lower bound on cumulative distribution function ofby utilizing the conditional independence property on infinite -regular trees. To our best knowledge, this is the first estimator with provable performance guarantee under limited timestamps.
Extensive simulations over various networks are performed to verify the performance of the infection path based estimator. The error distance over -regular trees is found to be within a constant and decreases when becomes larger.
2. System Model
2.1. Infection Diffusion Model
Consider an undirected graph where is the set of nodes and is the set of edges of the form for some node and in . We use the susceptible-infected SI model in epidemiology to characterize the infection diffusion process. Suppose that time is slotted. Let denote the set of infected nodes at the end of time-slot . Initially only one node gets infected at the beginning of some time-slot . Thus and for . At the beginning of each time-slot , each infected node attempts independently to infect each of its susceptible neighbors with success probability . We define the first infection timestamp of node as the time-slot in which the state of node changes from susceptible to infected. Formally, is given by
2.2. The Source Inference Problem
Under the above SI-based infection diffusion model, we would like to locate the source node using some observations of the infection diffusion process. We denote the observations until some time-slot as , the detailed specification of which will be given in Section 2.3. The source inference problem can be formulated as the maximum-a-posteriori (MAP) estimation problem as
is the inferred source node. Since we do not know a priori from which source the diffusion started, it is natural to assume a uniform prior probability of the source node among all nodes. Following this set up, the MAP estimation is equivalent to maximum likelihood (ML) estimation problem given by
2.3. Detection Model
At some time-slot , we realized that an infection has been diffused in network . In order to estimate source node , we first sample some nodes and obtained their first infection timestamps . Then we use some source localization algorithm to infer the source node. Thus, the source inference consists of two stages: 1) sampling and 2) estimating source using .
In this paper we do not talk about the sampling of nodes , but focus on the source detection given and . Using the observations , the ML estimator could be written as
However, the likelihood in Eq.(1) is difficult to compute in general. To see this, we first give definitions of cascading tree and labeled cascading tree, which explain the diffusion path from a source node to any other destination nodes.
Definition 2.1 (Cascading Tree).
Given a source node and a set of destination nodes in graph , the cascading tree is a directed subtree in rooted at satisfying
spans nodes , i.e., ;
For any , if then ;
and for any .
where and are the out-degree and in-degree respectively in directed subtree , respectively. The set of cascading trees for source node and destination nodes is denoted as .
Definition 2.2 (Labeled Cascading Tree).
Given any cascading tree , consider any mapping from its nodes to time domains where denotes the first infection timestamp of node . We call a permitted timestamp for cascading tree if for each node . The cascading tree associated with permitted timestamps is called labeled cascading tree . The set of labeled cascading tree for source node and destination nodes is denoted as .
To understand the above two definitions in the context of diffusion process, as shown in Figure 1 we consider a grid graph in which two possible cascading trees and are highlighted. The node refers to the root node 1 of the cascading trees, and sampled nodes . In each cascading tree, the parent node of represents the node from which first gets infected. The cascading tree with permitted timestamps recovers the infection process starting from node 1.
Based on labeled cascading tree, the likelihood in Eq.(1) could be decomposed as
where . It is challenging to compute the likelihood in Eq.(2) because the summation is taken over all labeled cascading trees and even counting the number of permitted labeled cascading trees has been shown to be P-hard (Brightwell and Winkler, 1991).
3. Infection Path Based Source Localization
In our approximate solution, we shall treat both the infection starting time and the labeled cascading tree as variables to be jointly estimated with source node. After sampling nodes , in second stage, we want to identify the infection path that most likely leads to , i.e.,
where denotes the set of all permitted labeled cascading trees which are consistent with observed timestamps . The source node associated with is then viewed as the source node. We call the estimated source node infection path based estimator because it is the source node of the most likely time labeled cascading tree that explains the observed limited timestamps.
However, the optimization problem in Eq.(3) is still not easy to solve due to a large number of possible cascading trees involved. Below, we propose a two-step solution. First we fix the cascading tree rooted at node , and maximize the likelihood of infection path over all permitted timestamps to find the most likely time labeled cascading tree. Second, we maximize the likelihood of infection path over all possible cascading trees to find the most likely infection path . This gives exact solution for general trees, and heuristic for general graphs.
3.1. Infection Path Likelihood Computation in General Trees
In this section we solve the first step, i.e., compute the most likely permitted timestamps associated with the cascading tree that are consistent with the observations , given by
Let denote the transmission delay for edge under the infection diffusion model. It is obvious that
is a collection of i.i.d. random variables following geometric distribution,i.e., for . The logarithm of likelihood could be decomposed in terms of in general tree as follows
where is a directed edge in from to . Given the cascading tree , both and are fixed for all permitted timestamps . By combining Eq.(4) and Eq.(5) we can easily verify that the optimization problem in Eq.(4) is equivalent to following linear integer programming (LIP):
where is a collection of timestamps for nodes in . Note that the LIP(6) may be infeasible, in which case there is no permitted timestamps for the cascading tree under the constraints of partial timestamps . In other words, the infeasibility of LIP(6) indicates that the probability for any timestamps is 0 given partial timestamps .
Note that the objective function of LIP(6) is the sum of transmission delays over all edges of . The intuition of LIP(6) is to minimize the total transmission delays over all edges of under the constraints of limited timestamps . If we plug the constraints into the objective function of LIP(6), then
where are the in-degree and out-degree of node , respectively, on cascading tree . Note that for any node , since is non-leaf node. According to the definition of the cascading tree, we must have . It implies that . Therefore, to minimize the objective function of LIP(6), we shall assign the largest possible timestamps to nodes in .
This can be done by having each node pass two messages up to its parent. The first message is the virtual timestamp of node , which we denote as . The second message is the aggregate of the transmission delays of the edges , which we denote as . Here refers to the directed subtree of that is rooted at and points away from . The details of message passing are included in Algorithm 1, the time complexity of which is . And the optimality of message passing in solving LIP(6) is established in Proposition 3.1.
Proposition 3.1 (Optimality of Algorithm 1).
The proof is included in Appendix A.
3.2. Source Localization on a Tree
After computing the most likely timestamp for a fixed cascading tree, according to infection path based estimator in Eq.(3) we need to search over all cascading trees to find the most likely labeled cascading tree . When the underlying graph is a tree, there is only one cascading tree rooted at node since no cycle exists. Then the estimator is simply
where the inner maximization over is to find the most likely labeled cascading tree given , and the outer maximization over is to identify the source with most likely infection path.
To reduce the search region, we partition the underlying tree according to the infection path likelihood . As shown in Figure 2, the underlying tree is partitioned into four disjoint regions: , , , and . In the following we will show in three steps that
The first step is to show . Observe that in Figure 2, where is the minimum Steiner tree spanning in the underlying tree.
Lemma 3.2 ().
When the underlying graph is a tree, for any true source , any infection probability , and any observed partial timestamps , we have
for any node .
Now we assume that LIP(6) for cascading tree is feasible, and its optimal value is given by
where is the virtual timestamp of node . According to the definition of the cascading tree, . Since and is a directed tree without cycle, there must be a node connecting node with other node in . Such node can be found by . And then and . Note that cascading tree is minimum Steiner tree whose edges are directed. And where denotes the subtree of that is rooted at and points away from . According to Appendix A.1, we have for any node where is the virtual timestamp of node when running Algorithm 1 for cascading tree . Then
The second step is to show that for any node . We first give some definitions that could help characterize , , and .
Definition 3.3 ().
When the underlying graph is a tree, for each node , we define the distance between and with respect to sampled nodes to be the number of sampled nodes on path , i.e.,
Lemma 3.4 ().
When the underlying graph is a tree, for any node , we have
If , there are at least two distinct nodes such that . It implies that . Now consider the LIP(6) for cascading tree . Assume that is one permitted timestamps satisfying all the constraints of LIP(6) for cascading tree . For node and we have and . Note that
which violates the fact that . This contradiction indicates that LIP(6) for cascading tree is infeasible which means that
The third step is to show that . It suffices to argue that for any node , there is a node such that . Note that for any node , and . It suffices to argue Lemma 3.5.
Lemma 3.5 ().
When the underlying graph is a tree, for any node , if and we have
where is the unique sampled node on path .
We defer the proof of Lemma 3.5 to Appendix B. Then combining Lemma 3.2, 3.4, and 3.5, we can draw the conclusion that the likelihood of the time labeled cascading tree rooted at those nodes around the true source is larger, as stated in Proposition 3.6.
Proposition 3.6 ().
When the underlying graph is a tree, we have
According to Proposition 3.6, we could reduce the search region from to for infection path based estimator. However, it seems to be impractical due to lack of prior knowledge of where the true source is. Therefore, we seek for another region such that and could be obtained from partial timestamps , the sampled set , and topology of underlying tree graph. Intuitively, the region should be close to the sampled node with the minimum timestamp. We verify this intuition in Lemma 3.7 and define the region in Proposition 3.8.
Lemma 3.7 ().
Let denote any sampled node with minimum observed timestamp (ties broken arbitrarily), then .
Since is a sampled node with minimum observed timestamp, there cannot be any other sampled node on the path . Therefore, which implies that . ∎
Proposition 3.8 ().
Let be the sampled node with minimum observed timestamp. Let , then
It sufficies to prove that . We consider two cases.
(1) Consider the case where , then and . Apparently .
(2) Consider the case where . For any node , if then therefore which implies that . If , then there must exists at least one sampled node such that node is on the path . Note that and , therefore . ∎
Note that the in Proposition 3.8 could be computed via breadth-first search starting from . The details are given in Algorithm 2. Note that the most time consuming part is breadth-first search starting from node , therefore the time complexity of Algorithm 2 is . Given , we could find the infection path based estimator
using message-passing algorithm. The details are shown in Algorithm 3, the time complexity of which is .
3.3. Source Localization on General Graphs
Locating the source on general graph is challenging because there are exponential number of possible cascading trees for each node. To avoid such a combinatorial explosion we follow a time labeled BFS heuristic. The algorithm in presented in Algorithm 4. Starting from a node , we do a breadth-first search to construct a time labeled BFS tree. Specifically, we assign each node a time label . Initially if the starting node , we set . Otherwise, which represents an extremely small value. When a node is explored from a directed edge , if and we add directed edge to BFS tree and set . If we still add directed edge to BFS tree and set . The whole process terminates either when all the edges are explored or when are included in the BFS tree. Note that the resulting BFS tree may not contain all the sampled nodes , intuitively it is less likely for to be source if contains fewer sampled nodes. Therefore we use a threshold to rule out those “unlikely” nodes. In practice the threshold needs to be tuned to avoid the extreme case where all nodes are ruled out. Since a breadth-first search is executed for each node, the time complexity is .
4. Performance Guarantee
Although the infection path based estimator in Eq.(3) is only an approximation of the original ML estimator, we will prove in this section that it can still achieve provably good performance under certain topologies. Specifically, in this section we assume the underlying graph is tree , and we will present the performance guarantee for source localization algorithm on tree in terms of distribution of , which is the distance between true source and estimated source on tree . Assuming that the true source is given, we introduce a topological concept called candidate path and show that the infection path based estimator is always on that path. By means of candidate path, we are able to analyze the distribution of under the assumption that is uniformly sampled.
4.1. Candidate Path
According to Proposition 3.6, the infection path based estimator is
therefore, the estimated source even though we do not know in prior. If we look at the definition of
it is easy to find that only depends on the topology of , and . If we could utilize the observed timestamps , it is possible to define a tighter region that could help us analyze the distribution of . Especially, if we have
Proposition 4.1 ().
When the underlying graph is a tree, if , we have .
If , then , it implies that . For any node , , then by Lemma 3.5
From now on we assume that .
Lemma 4.2 ().
When the underlying graph is a tree, if , the infection path based estimator is
It suffices to prove that
For any node , if then . If , then there must be a node such that . Therefore, for any node , we have . ∎
Definition 4.3 (Anchor Node of ).
For true source node , we define its anchor node as
Definition 4.4 (Candidate Path).
The candidate path is defined as the intersection of paths from anchor node to sampled node , i.e.,
where is given by