Information Source Detection with Limited Time Knowledge

This paper investigates the problem of utilizing network topology and partial timestamps to detect the information source in a network. The problem incurs prohibitive cost under canonical maximum likelihood estimation (MLE) of the source due to the exponential number of possible infection paths. Our main idea of source detection, however, is to approximate the MLE by an alternative infection path based estimator, the essence of which is to identify the most likely infection path that is consistent with observed timestamps. The source node associated with that infection path is viewed as the estimated source v̂. We first study the case of tree topology, where by transforming the infection path based estimator into a linear integer programming, we find a reduced search region that remarkably improves the time efficiency. Within this reduced search region, the estimator v̂ is provably always on a path which we term as candidate path. This notion enables us to analyze the distribution of d(v^∗,v̂), the error distance between v̂ and the true source v^∗, on arbitrary tree, which allows us to obtain for the first time, in the literature provable performance guarantee of the estimator under limited timestamps. Specifically, on the infinite g-regular tree with uniform sampled timestamps, we get a refined performance guarantee in the sense of a constant bounded d(v^∗,v̂). By virtue of time labeled BFS tree, the estimator still performs fairly well when extended to more general graphs. Experiments on both synthetic and real datasets further demonstrate the superior performance of our proposed algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/18/2022

Epidemic Source Detection in Contact Tracing Networks: Epidemic Centrality in Graphs and Message-Passing Algorithms

We study the epidemic source detection problem in contact tracing networ...
08/10/2020

Parameter estimation in the SIR model from early infections

A standard model for epidemics is the SIR model on a graph. We introduce...
12/01/2021

Maximum Likelihood Estimation for Brownian Motion Tree Models Based on One Sample

We study the problem of maximum likelihood estimation given one data sam...
06/13/2018

A theory of maximum likelihood for weighted infection graphs

We study the problem of parameter estimation based on infection data fro...
10/19/2015

Confidence Sets for the Source of a Diffusion in Regular Trees

We study the problem of identifying the source of a diffusion spreading ...
09/24/2009

Rumors in a Network: Who's the Culprit?

We provide a systematic study of the problem of finding the source of a ...
12/31/2020

Semi-Parametric Estimation of Incubation and Generation Times by Means of Laguerre Polynomials

In epidemics many interesting quantities, like the reproduction number, ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Many phenomenon can be modeled as information propagation in networks over time. Prevalent examples include spread of a disease through a population, transmission of information through a distributed network, and the diffusion of scientific discovery in academic network. In all these scenarios, it is disastrous once an isolated risk is amplified through diffusion in networks. Source detection therefore is critical for preventing the spreading of malicious information, and reducing the potential damages incurred.

In this paper, we study the source inference problem: given that a message has been diffused in network , can we tell which node is the source of diffusion given some observations at time ? The solution to this problem can help us answer many questions of a common theme: Which computer is the first one infected by computer virus? Who first spreads out the fake news in online social networks? Where is the origin of an epidemic? and which paper is the first scientific rumor on a specific topic in academic citation networks?

While finding the source node has these important applications, it is known that this problem is highly challenging, especially in complex networks. The prior studies mainly focus on topology of infected subgraph. Under the assumption that a full or partial snapshot of the infected nodes is observed at some time, some topology based estimators (such as rumor centrality, Jordan center, etc.) are proposed under various diffusion models (Shah and Zaman, 2010, 2011, 2012; Zhu and Ying, 2016b; Chen et al., 2016; Luo et al., 2017; Nguyen et al., 2016; Zhu and Ying, 2014; Luo et al., 2014; Zhu et al., 2017; Zhu and Ying, 2016a). These estimators, unfortunately, often suffer from poor source detection accuracy and high cost for obtaining the snapshot. Later on, metadata such as timestamps of infected nodes and the direction from which a node gets infected is exploited in the hope of improving the localization precision (Pinto et al., 2012; Zhu et al., 2016; Tang et al., 2018)

. However, they typically assume a Gaussian-distributed transmission delay for each edge, which may be impractical for many applications such as Bitcoin P2P network

(Fanti and Viswanath, 2017) and mobile phone network(Wang et al., 2009)

, etc. In these networks, the transmission delay for each edge has been verified to follow Geometric distribution.

In this paper, we adopt the discrete-time susceptible-infected (SI) model. The network is assumed to be an undirected graph. Initially, only one node is infected at some unknown time. The infection then begins to diffuse in the network via random interaction between neighboring nodes. Now, we wish to locate the source node using some observation . We assume that contains some set of nodes with first infection timestamps . The nodes in is sampled uniformly at random. Given partial timestamps , the question is which node is the information source.

In order to infer the information source using limited timestamps , one may seek for the solution via a ML estimator, as is widely adopted in many prior arts. However, such an estimator incurs exponential complexity. Instead, here we develop an infection path based estimator where the source is the root node of the most likely time labeled cascading tree consistent with observed timestamps

. In a tree graph, by establishing an equivalence between infection path based estimator and a linear integer programming, the infection path based estimator can be efficiently resolved via message passing. In a general graph, to overcome the difficulty of searching exponential number of infection paths, we incorporate a time labeled BFS heuristic to approximate the infection path based estimator using linear integer programming.

We remark that in our problem of interest, only limited timestamps and the location of nodes are considered as observation . This setting has many practical advantages over those using snapshot and direction information (Shah and Zaman, 2011; Luo et al., 2017; Zhu and Ying, 2016b). First, it is time consuming, and sometimes impossible, to collect the full snapshot of the infected nodes at some time. For example, Twitter’s streaming API only allows a small percentage () of the full stream of tweets to be crawled. Second, sometimes the direction from which a susceptible node gets infected is hard to obtain. For example, in a flu outbreak a person often cannot tell with certainty who infected him/her. The same also goes for anonymous social networks (Fanti et al., 2017, 2016), where the direction information is hidden. Finally, sampled nodes with timestamps contains more information than partial snapshot, and is easy to access in most scenarios (such as online social network, etc.).

The primary contributions are summarized as follows:

  • We propose an infection path based estimator to approximate the maximum likelihood estimator in detecting the information source. In a tree graph, this estimator is equivalent to a linear integer programming that can be efficiently solved via message passing approaches. By exploiting the property of linear integer programming, we find a reduced search region that remarkably improves the time efficiency. In a general graph, a time labeled BFS heuristic is incorporated to approximate the infection path based estimator.

  • We define a novel concept called candidate path to assist the analysis of error distance between the true source and the estimated source

    on an arbitrary tree. Under the assumption that the limited timestamps are sampled uniformly at random, we provide a lower bound on cumulative distribution function of

    by utilizing the conditional independence property on infinite -regular trees. To our best knowledge, this is the first estimator with provable performance guarantee under limited timestamps.

  • Extensive simulations over various networks are performed to verify the performance of the infection path based estimator. The error distance over -regular trees is found to be within a constant and decreases when becomes larger.

The rest of this paper is organized as follows: We describe the system model in Section 2. The algorithm for computing the estimator is presented in Section 3. We discuss the performance of the estimator in Section 4. Simulations and experiments are shown in Section 5, and we conclude in Section 6.

2. System Model

2.1. Infection Diffusion Model

Consider an undirected graph where is the set of nodes and is the set of edges of the form for some node and in . We use the susceptible-infected SI model in epidemiology to characterize the infection diffusion process. Suppose that time is slotted. Let denote the set of infected nodes at the end of time-slot . Initially only one node gets infected at the beginning of some time-slot . Thus and for . At the beginning of each time-slot , each infected node attempts independently to infect each of its susceptible neighbors with success probability . We define the first infection timestamp of node as the time-slot in which the state of node changes from susceptible to infected. Formally, is given by

2.2. The Source Inference Problem

Under the above SI-based infection diffusion model, we would like to locate the source node using some observations of the infection diffusion process. We denote the observations until some time-slot as , the detailed specification of which will be given in Section 2.3. The source inference problem can be formulated as the maximum-a-posteriori (MAP) estimation problem as

where

is the inferred source node. Since we do not know a priori from which source the diffusion started, it is natural to assume a uniform prior probability of the source node among all nodes

. Following this set up, the MAP estimation is equivalent to maximum likelihood (ML) estimation problem given by

2.3. Detection Model

At some time-slot , we realized that an infection has been diffused in network . In order to estimate source node , we first sample some nodes and obtained their first infection timestamps . Then we use some source localization algorithm to infer the source node. Thus, the source inference consists of two stages: 1) sampling and 2) estimating source using .

In this paper we do not talk about the sampling of nodes , but focus on the source detection given and . Using the observations , the ML estimator could be written as

(1)

However, the likelihood in Eq.(1) is difficult to compute in general. To see this, we first give definitions of cascading tree and labeled cascading tree, which explain the diffusion path from a source node to any other destination nodes.

Definition 2.1 (Cascading Tree).

Given a source node and a set of destination nodes in graph , the cascading tree is a directed subtree in rooted at satisfying

  1. spans nodes , i.e., ;

  2. For any , if then ;

  3. and for any .

where and are the out-degree and in-degree respectively in directed subtree , respectively. The set of cascading trees for source node and destination nodes is denoted as .

Definition 2.2 (Labeled Cascading Tree).

Given any cascading tree , consider any mapping from its nodes to time domains where denotes the first infection timestamp of node . We call a permitted timestamp for cascading tree if for each node . The cascading tree associated with permitted timestamps is called labeled cascading tree . The set of labeled cascading tree for source node and destination nodes is denoted as .

To understand the above two definitions in the context of diffusion process, as shown in Figure 1 we consider a grid graph in which two possible cascading trees and are highlighted. The node refers to the root node 1 of the cascading trees, and sampled nodes . In each cascading tree, the parent node of represents the node from which first gets infected. The cascading tree with permitted timestamps recovers the infection process starting from node 1.

Figure 1. Illustration of (labeled) cascading tree.

Based on labeled cascading tree, the likelihood in Eq.(1) could be decomposed as

(2)

where . It is challenging to compute the likelihood in Eq.(2) because the summation is taken over all labeled cascading trees and even counting the number of permitted labeled cascading trees has been shown to be P-hard (Brightwell and Winkler, 1991).

As an alternative, in Section 3 we will propose an approximate solution that jointly estimates and labeled cascading tree together. This approach, as will be further demonstrated in Section 4, leads to provably good performance for tree topologies.

3. Infection Path Based Source Localization

In our approximate solution, we shall treat both the infection starting time and the labeled cascading tree as variables to be jointly estimated with source node. After sampling nodes , in second stage, we want to identify the infection path that most likely leads to , i.e.,

(3)

where denotes the set of all permitted labeled cascading trees which are consistent with observed timestamps . The source node associated with is then viewed as the source node. We call the estimated source node infection path based estimator because it is the source node of the most likely time labeled cascading tree that explains the observed limited timestamps.

However, the optimization problem in Eq.(3) is still not easy to solve due to a large number of possible cascading trees involved. Below, we propose a two-step solution. First we fix the cascading tree rooted at node , and maximize the likelihood of infection path over all permitted timestamps to find the most likely time labeled cascading tree. Second, we maximize the likelihood of infection path over all possible cascading trees to find the most likely infection path . This gives exact solution for general trees, and heuristic for general graphs.

3.1. Infection Path Likelihood Computation in General Trees

In this section we solve the first step, i.e., compute the most likely permitted timestamps associated with the cascading tree that are consistent with the observations , given by

(4)

Let denote the transmission delay for edge under the infection diffusion model. It is obvious that

is a collection of i.i.d. random variables following geometric distribution,

i.e., for . The logarithm of likelihood could be decomposed in terms of in general tree as follows

(5)

where is a directed edge in from to . Given the cascading tree , both and are fixed for all permitted timestamps . By combining Eq.(4) and Eq.(5) we can easily verify that the optimization problem in Eq.(4) is equivalent to following linear integer programming (LIP):

(6)

where is a collection of timestamps for nodes in . Note that the LIP(6) may be infeasible, in which case there is no permitted timestamps for the cascading tree under the constraints of partial timestamps . In other words, the infeasibility of LIP(6) indicates that the probability for any timestamps is 0 given partial timestamps .

Note that the objective function of LIP(6) is the sum of transmission delays over all edges of . The intuition of LIP(6) is to minimize the total transmission delays over all edges of under the constraints of limited timestamps . If we plug the constraints into the objective function of LIP(6), then

(7)

where are the in-degree and out-degree of node , respectively, on cascading tree . Note that for any node , since is non-leaf node. According to the definition of the cascading tree, we must have . It implies that . Therefore, to minimize the objective function of LIP(6), we shall assign the largest possible timestamps to nodes in .

This can be done by having each node pass two messages up to its parent. The first message is the virtual timestamp of node , which we denote as . The second message is the aggregate of the transmission delays of the edges , which we denote as . Here refers to the directed subtree of that is rooted at and points away from . The details of message passing are included in Algorithm 1, the time complexity of which is . And the optimality of message passing in solving LIP(6) is established in Proposition 3.1.

Proposition 3.1 (Optimality of Algorithm 1).

Algorithm 1 returns empty if and only if LIP(6) is infeasible. If LIP(6) is feasible, the aggregate delays at the source node is the optimal value of LIP(6), and the virtual timestamp of node is

where denotes the subtree of that is rooted at and points away from .

The proof is included in Appendix A.

Note that after solving LIP(6) for cascading tree , the maximum likelihood of with respect to is

(8)

if the output of Algorithm 1 is not empty.

0:  Cascading tree with partial timestamps .
0:  The aggregate delays , and virtual timestamps .
1:  for  in  do
2:     if  is a leaf then
3:        , ;
4:     else
5:        ;
6:        if  then
7:           if  then
8:              return  None.
9:           else
10:              ;
11:           end if
12:        end if
13:        ;
14:     end if
15:  end for
16:  return  .
Algorithm 1 Message-passing to solve LIP(6)

3.2. Source Localization on a Tree

After computing the most likely timestamp for a fixed cascading tree, according to infection path based estimator in Eq.(3) we need to search over all cascading trees to find the most likely labeled cascading tree . When the underlying graph is a tree, there is only one cascading tree rooted at node since no cycle exists. Then the estimator is simply

(9)

where the inner maximization over is to find the most likely labeled cascading tree given , and the outer maximization over is to identify the source with most likely infection path.

To reduce the search region, we partition the underlying tree according to the infection path likelihood . As shown in Figure 2, the underlying tree is partitioned into four disjoint regions: , , , and . In the following we will show in three steps that

(10)
Figure 2. Partition of underlying tree graph according to infection path likelihood.

The first step is to show . Observe that in Figure 2, where is the minimum Steiner tree spanning in the underlying tree.

Lemma 3.2 ().

When the underlying graph is a tree, for any true source , any infection probability , and any observed partial timestamps , we have

(11)

for any node .

Proof.

Apparently Eq.(11) holds when

in which case the LIP(6) for cascading tree is infeasible.

Now we assume that LIP(6) for cascading tree is feasible, and its optimal value is given by

where is the virtual timestamp of node . According to the definition of the cascading tree, . Since and is a directed tree without cycle, there must be a node connecting node with other node in . Such node can be found by . And then and . Note that cascading tree is minimum Steiner tree whose edges are directed. And where denotes the subtree of that is rooted at and points away from . According to Appendix A.1, we have for any node where is the virtual timestamp of node when running Algorithm 1 for cascading tree . Then

(12)

The second step is to show that for any node . We first give some definitions that could help characterize , , and .

Definition 3.3 ().

When the underlying graph is a tree, for each node , we define the distance between and with respect to sampled nodes to be the number of sampled nodes on path , i.e.,

(13)

Note that for any node , we have as shown in Figure 2. To prove for any node , it suffices to argue Lemma 3.4.

Lemma 3.4 ().

When the underlying graph is a tree, for any node , we have

(14)

if .

Proof.

If , there are at least two distinct nodes such that . It implies that . Now consider the LIP(6) for cascading tree . Assume that is one permitted timestamps satisfying all the constraints of LIP(6) for cascading tree . For node and we have and . Note that

which violates the fact that . This contradiction indicates that LIP(6) for cascading tree is infeasible which means that

The third step is to show that . It suffices to argue that for any node , there is a node such that . Note that for any node , and . It suffices to argue Lemma 3.5.

Lemma 3.5 ().

When the underlying graph is a tree, for any node , if and we have

(15)

where is the unique sampled node on path .

We defer the proof of Lemma 3.5 to Appendix B. Then combining Lemma 3.2, 3.4, and 3.5, we can draw the conclusion that the likelihood of the time labeled cascading tree rooted at those nodes around the true source is larger, as stated in Proposition 3.6.

Proposition 3.6 ().

When the underlying graph is a tree, we have

(16)

where .

When revisiting Figure 2, it is easy to observe that is exactly in Proposition 3.6 which proves the inequality (10).

According to Proposition 3.6, we could reduce the search region from to for infection path based estimator. However, it seems to be impractical due to lack of prior knowledge of where the true source is. Therefore, we seek for another region such that and could be obtained from partial timestamps , the sampled set , and topology of underlying tree graph. Intuitively, the region should be close to the sampled node with the minimum timestamp. We verify this intuition in Lemma 3.7 and define the region in Proposition 3.8.

Lemma 3.7 ().

Let denote any sampled node with minimum observed timestamp (ties broken arbitrarily), then .

Proof.

Since is a sampled node with minimum observed timestamp, there cannot be any other sampled node on the path . Therefore, which implies that . ∎

Proposition 3.8 ().

Let be the sampled node with minimum observed timestamp. Let , then

(17)
Proof.

It sufficies to prove that . We consider two cases.

(1) Consider the case where , then and . Apparently .

(2) Consider the case where . For any node , if then therefore which implies that . If , then there must exists at least one sampled node such that node is on the path . Note that and , therefore . ∎

Note that the in Proposition 3.8 could be computed via breadth-first search starting from . The details are given in Algorithm 2. Note that the most time consuming part is breadth-first search starting from node , therefore the time complexity of Algorithm 2 is . Given , we could find the infection path based estimator

(18)

using message-passing algorithm. The details are shown in Algorithm 3, the time complexity of which is .

0:  Underlying tree , sampled nodes with partial timestamps .
0:  Reduced search space .
1:  , ties broken arbitrarily;
2:  Construct cascading tree via breadth-first search;
3:  , put children of on cascading tree into an empty queue ;
4:  while  is not empty do
5:     
6:     if  then
7:        , put children of on cascading tree into ;
8:     end if
9:  end while
10:  return  .
Algorithm 2 Find Reduced Search Space
0:  Underlying tree , sampled nodes with partial timestamps .
0:  The estimated source node .
1:  Construct reduced search space using Algorithm 2;
2:  ;
3:  for  in  do
4:     Construct cascading tree via BFS.
5:     Run Algorithm 1 for cascading tree . If the output is empty, ;
6:  end for
7:   where is the output of Alg.1
8:  return  
Algorithm 3 Source Localization on General Tree

3.3. Source Localization on General Graphs

Locating the source on general graph is challenging because there are exponential number of possible cascading trees for each node. To avoid such a combinatorial explosion we follow a time labeled BFS heuristic. The algorithm in presented in Algorithm 4. Starting from a node , we do a breadth-first search to construct a time labeled BFS tree. Specifically, we assign each node a time label . Initially if the starting node , we set . Otherwise, which represents an extremely small value. When a node is explored from a directed edge , if and we add directed edge to BFS tree and set . If we still add directed edge to BFS tree and set . The whole process terminates either when all the edges are explored or when are included in the BFS tree. Note that the resulting BFS tree may not contain all the sampled nodes , intuitively it is less likely for to be source if contains fewer sampled nodes. Therefore we use a threshold to rule out those “unlikely” nodes. In practice the threshold needs to be tuned to avoid the extreme case where all nodes are ruled out. Since a breadth-first search is executed for each node, the time complexity is .

0:  Underlying graph , sampled nodes with partial timestamps , a threshold to be tuned.
0:  The estimated source node .
1:  Initialize search space ;
2:  for  in  do
3:     Construct a time labeled BFS tree rooted at node .
4:     if  then
5:        
6:     else
7:        Compute aggregate delays of node on tree using message passing Algorithm 1.
8:     end if
9:  end for
10:  .
11:  return  .
Algorithm 4 Source Localization on General Graph

4. Performance Guarantee

Although the infection path based estimator in Eq.(3) is only an approximation of the original ML estimator, we will prove in this section that it can still achieve provably good performance under certain topologies. Specifically, in this section we assume the underlying graph is tree , and we will present the performance guarantee for source localization algorithm on tree in terms of distribution of , which is the distance between true source and estimated source on tree . Assuming that the true source is given, we introduce a topological concept called candidate path and show that the infection path based estimator is always on that path. By means of candidate path, we are able to analyze the distribution of under the assumption that is uniformly sampled.

4.1. Candidate Path

According to Proposition 3.6, the infection path based estimator is

(19)

therefore, the estimated source even though we do not know in prior. If we look at the definition of

it is easy to find that only depends on the topology of , and . If we could utilize the observed timestamps , it is possible to define a tighter region that could help us analyze the distribution of . Especially, if we have

Proposition 4.1 ().

When the underlying graph is a tree, if , we have .

Proof.

If , then , it implies that . For any node , , then by Lemma 3.5

From now on we assume that .

Lemma 4.2 ().

When the underlying graph is a tree, if , the infection path based estimator is

(20)

where .

Proof.

It suffices to prove that

For any node , if then . If , then there must be a node such that . Therefore, for any node , we have . ∎

Definition 4.3 (Anchor Node of ).

For true source node , we define its anchor node as

(21)
Definition 4.4 (Candidate Path).

The candidate path is defined as the intersection of paths from anchor node to sampled node , i.e.,

(22)

where is given by

(23)