In social networks, a rumor spreads like an infectious disease. In fact, it can be modeled as an infectious disease [2, 3]. The most common theme of studies about a rumor (or infectious disease) is to analyze mechanisms of a spreading behavior of a rumor in a given network [4, 5].
Unlike this type of studies, we address the rumor source identification problem introduced by Shah and Zaman . The goal of this problem is to find the origin node of a rumor (rumor source) in a network among a given set of nodes with the rumor. If the rumor source can be detected, it is available to find a weak node which spreads a computer virus, to give ranking to websites for a search engine, etc. For this problem, Shah and Zaman  introduced the optimal estimator and analyzed the correct detection probability of it for some types of networks. This probability asymptotically goes to one for a very special network called geometric tree (see [3, Sec. IV.D]). However, they analytically or experimentally showed that the probability is asymptotically not high or goes to zero for many other networks such as regular trees, small-world networks, and scale-free networks, where a regular tree is a network which does not have any cycle and in which all nodes have the same degree, i.e, the number of edges connected to a node.
Although the optimal estimator may not find the rumor source, it actually selects a node near the rumor source. This fact is known experimentally (cf. [3, Sect. V.B] and [6, Sect. 8]) and is not known analytically to the best of our knowledge. In this paper, we focus on this fact and clarify it analytically. Especially, we focus on regular trees and clarify that, with quite high probability, the rumor source is within the distance “” from the node selected by the optimal estimator, where the distance is the number of edges of the unique path connecting two nodes. This is clarified by the probability distribution of the distance between the rumor source and the selected node.
Ii Rumor Source Identification Problem
In this section, we introduce the rumor source identification problem and show some known results of this problem.
Let be an undirected and connected graph. Let denote the set of nodes and denote the set of edges of the graph . We denote the edge connecting two nodes by the set of nodes . In this paper, we consider the case where is a regular tree, that is, the graph does not have any cycle, and all nodes have the same degree222The line graph () is not concerned in this paper because this case is somewhat difficult to treat in a unified manner. However, essential argument for this case is the same as the case where . . We assume that the number of nodes is countably infinite in order to avoid boundary effects.
A rumor spreads in a given regular tree . Initially, the only one node (the rumor source) possesses a rumor. The node possessing the rumor infects it to connected adjacent nodes, and these nodes keep it forever. For , let
be a real-valued random variable (RV) that represents the rumor spreading time from the nodeto the node after gets the rumor. In this model, spreading times of is represented as if , and if . This spreading model is sometimes called the susceptible-infected (SI) model .
Suppose that we observe a network consisted of infected nodes in the graph at some time. Since the rumor spreads to the connected adjacent nodes, this network is a connected subgraph of . We denote the RV of this network by and its realization as . We only know an observed network and do not know the realization of spreading times on edges. Then, the goal of the rumor source identification problem is to find the rumor source among given .
For this problem, the optimal estimator is the maximum likelihood (ML) estimator (cf. ) defined as
where ties broken uniformly at random and is the probability observing under the SI model assuming is the rumor source. For this optimal estimator, let be the correct detection probability when a graph of infected nodes is observed, i.e., . Shah and Zaman  showed the asymptotic behavior of as the next theorem.
Theorem 1 ([7, Theorem 3.1])
For a regular tree with degree , it holds that
where is the regularized incomplete beta function defined as and is the Gamma function.
According to this theorem, when , . Moreover, it rapidly converges to as goes to infinity (cf. [7, Corollary 1 and Figure 3]). This means that, unfortunately, the correct detection probability is not very high for regular trees.
Iii Main Results
In this section, we show that the ML estimator can select a node near the rumor source with high probability.
To this end, we clarify the probability distribution of the distance between the rumor source and the node selected by the ML estimator. We denote this probability by and define it as
where and denotes the distance between nodes and in the graph . Note that .
When , we can clarify a closed-form expression of the asymptotic behavior of as the next theorem.
Let . Then, for any , we have
We denote the rising factorial by . The next theorem gives tight upper and lower bounds of for more general degrees.
For any , , and , we have
, and for any .
These theorems imply that the ML estimator can select a node near the rumor source with high probability. This is clear from the next corollary and its numerical results (Fig. 1).
Let . Then, for any , we have
More generally, for any , , and , we have
Here, and denote the right-hand side of (1).
Since , Fig. 1 gives almost exact numerical results of . We note that numerical results for other degrees are almost the same (see Fig. 2). Thus, these results show that the rumor source is within the distance from the node selected by the ML estimator with quite high probability. We note that Khim and Loh [6, Corollary 2] gave another lower bound of . However, it is quite looser than our bound and is zero at least values of parameters and are within the rage in Fig. 1 and Fig. 2.
Iv Proofs of Theorems
In this section, we prove our main theorems. We will denote -length sequences of RVs and its realizations by and , respectively. For the sake of brevity, we denote by and by .
For any node in a regular tree with degree , there are neighbors. Thus, there are subtrees rooted at these neighbors with the parent node . In other words, the regular tree is divided into these subtrees and the node . Let be the number of infected nodes in the th subtree among those subtrees (). When is not the rumor source, let th subtree contain the rumor source . Note that, if is an infected node, we have . The next lemma is a key lemma to prove our main theorems.
For a node , let . Then, we have
We denote the set of nodes with distance from the rumor source by . Note that the number of elements of is . Then, can be represented as
where the last equality comes from Lemma 1.
On the other hand, let be the sequence of RVs each representing th infected node, where with probability 1. Then, we have . This implies that the event is equal to the event . Hence, we have
where . We also have
Thus, we need to obtain closed-form expressions of and .
Iv-a Closed-Form Expression of
Let be the set of neighboring nodes of in the graph . Suppose that the set of nodes are infected with a rumor, and any other nodes are not infected. Then, we denote the set of boundary nodes which may be infected by the infected nodes by , i.e., . Let be the set of ordered nodes on possible paths of infection, i.e., , where . Since are independent and these have the memoryless property, an infecting node is uniformly selected from boundary nodes at each step. Hence, we have for any and ,
Let be the (shortest) path from the rumor source to . Then, for and , the th infected node is if and only if the following event occurs for some such that :
where and . Hence, if and , we have
where (a) comes from the chain rule of the probability, and (b) comes from AppendixA.
The remaining case is that and . In this case, we have
Consequently, (7) holds for any and .
Iv-B Closed-Form Expression of
Suppose that the th infected node is . Since we consider a regular tree, has neighboring nodes . Let be the number of infected nodes of the subtree rooted at with the parent node after is infected. Let the subtree rooted at contain the rumor source. Thus, at the time that is infected, it holds that . From then on, an infecting node is uniformly selected from boundary nodes at each step. We note that for all , and . Then, numbers are drawn according to the Pólya’s urn model with colors balls (cf.  and ): Initially, balls of color are in the urn, where if and if . At each step, a single ball is uniformly drawn form the urn. Then, the drawn ball is returned with additional balls of the same color. Repeat this drawing process.
corresponds to the number of times that the balls of color are drawn. According to [11, Chap. 4], when the total number of drawing balls is
, the joint distribution ofis given by
where and . We note that the above probability only depends on , and .
Now, by definition, we have