Inferring the Underlying Structure of Information Cascades

10/12/2012 ∙ by Bo Zong, et al. ∙ The Regents of the University of California 0

In social networks, information and influence diffuse among users as cascades. While the importance of studying cascades has been recognized in various applications, it is difficult to observe the complete structure of cascades in practice. Moreover, much less is known on how to infer cascades based on partial observations. In this paper we study the cascade inference problem following the independent cascade model, and provide a full treatment from complexity to algorithms: (a) We propose the idea of consistent trees as the inferred structures for cascades; these trees connect source nodes and observed nodes with paths satisfying the constraints from the observed temporal information. (b) We introduce metrics to measure the likelihood of consistent trees as inferred cascades, as well as several optimization problems for finding them. (c) We show that the decision problems for consistent trees are in general NP-complete, and that the optimization problems are hard to approximate. (d) We provide approximation algorithms with performance guarantees on the quality of the inferred cascades, as well as heuristics. We experimentally verify the efficiency and effectiveness of our inference algorithms, using real and synthetic data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In various real-life networks, users frequently exchange information and influence each other. The information (e.g., messages, articles, recommendation links) is typically created from a user and spreads via links among users, leaving a trace of its propagation. Such traces are typically represented as trees, namely, information cascades, where (a) each node in a cascade is associated with the time step at which it receives the information, and (b) an edge from a node to another indicates that a user propagates the information to and influences its neighbor [4, 12].

A comprehensive understanding and analysis of cascades benefit various emerging applications in social networks [6, 16], viral marketing [1, 9, 27], and recommendation networks [24]. In order to model the propagation of information, various cascade models have been developed [8, 31, 33]. Among the most widely used models is the independent cascade model [16], where each node has only one chance to influence its inactive neighbors, and each node is influenced by at most one of its neighbors independently. Nevertheless, it is typically difficult to observe the entire cascade in practice, due to the noisy graphs with missing data, or data privacy policies [21, 29]. It is important to develop techniques that can infer the cascades using partial information. Consider the following example.

Fig. 1: A cascade of an Ad (partially observed) in a social network from user Ann, and its two possible tree representations and .
Example 1

The graph in Fig. 1 depicts a fraction of a social network (e.g., Twitter), where each node is a user, and each edge represents an information exchange. For example, edge with a weight represents that a user Ann sends an advertisement (Ad) about a released product (e.g.,

“Iphone 4s”) with probability

. To identify the impact of an Ad strategy, a company would like to know the complete cascade starting from their agent Ann. Due to data privacy policies, the observed information may be limited: (a) at time step , Ann posts an Ad about “Iphone 4s”; (b) at time step , Bill is influenced by Ann and retweets the Ad; (c) by time step , the Ad reaches Mary, and Mary retweets it. As seen, the information diffuses from one user to his or her neighbors with different probabilities, represented by the weighted edges in . Note that the cascade unfolds as a tree, rooted at the node Ann.

To capture the entire topological information of the cascades, we need to make inferences in the graph-time domain. Given the above partially observed information, two such inferred cascades are shown as trees and in Fig. 1. illustrates a cascade where each path from the source Ann to each observed node has a length that exactly equals to the time step, at which the observed node is influenced, while illustrates a cascade where any path in from Ann to an observed node has a length no greater than the observed time step when the node is influenced, due to possible delay in observation, e.g., Mary is known to be influenced by (instead of exactly at) time step . The inferred cascades provide useful information about the missing links and users that are important in the propagation of the information.

The above example highlights the need to make reasonable inference about the cascades, according to only the partial observations of influenced nodes and the time at or by which they are influenced. Although cascade models and a set of related problems, e.g., influence maximization, have been widely studied, much less is known on how to infer the cascade structures, including complexity bounds and approximation algorithms.

Contributions. We investigate the cascade inference problem, where cascades follow the widely used independent cascade model. To the best of our knowledge, this is the first work towards inferring cascades as general trees following independent cascade model, based on the partial observations.

(a) We introduce the notions of (perfect and bounded) consistent trees in Section II. These notions capture the inferred cascades by incorporating connectivity and time constraints in the partial observations. To provide a quantitative measure of the quality of inferred cascades, we also introduce two metrics in Section II, based on (i) the size of the consistent trees, and (ii) the likelihood when a diffusion function of the network graph is taken into account, respectively. These metrics give rise to two optimization problems, referred to as the minimum consistent tree problem and minimum weighted consistent tree problem.

(b) We investigate the problems of identifying perfect and bounded consistent trees, for given partial observations, in Section III and Section IV, respectively. These problems are variants of the inference problem.

(i) We show that these problems are all np-complete. Worse still, the optimization problems are hard to approximate: unless p = np, it is not possible to approximate the problems within any constant ratio.

(ii) Nevertheless, we provide approximation and heuristic algorithms for these problems. For bounded trees, the problems are -approximable, where is the size of the partial observation, and (resp. ) are the minimum (resp. maximum) probability on the graph edges. We provide such polynomial approximation algorithms. For perfect trees, we show that it is already np-hard to even find a feasible solution. However, we provide an efficient heuristics using a greedy strategy. Finally, we address a practical special case for perfect tree problems, which are -approximable, where is the diameter of the graph, which is typically small in practice.

(c) We experimentally verify the effectiveness and the efficiency of our algorithms in Section V, using real-life data and synthetic data. We show that our inference algorithms can efficiently infer cascades with satisfactory accuracy.

Related work. We categorize related work as follows.

Cascade Models. To capture the behavior of cascades, a variety of cascade models have been proposed [2, 13, 15, 17, 18], such as Suscepctible/Infected (SI) model [2], decreasing cascade model [17], triggering model [16], Shortest Path Model [19], and the Susceptible/Infected/Recover (SIR) model [18]. In this paper, we assume that the cascades follow the independent cascade model [13], which is one of the most widely studied models (the shortest path model [19] is one of its special cases).

Cascade Prediction. There has been recent work on cascade prediction and inference, with the emphasis on global properties (e.g., cascade nodes, width, size) [5, 11, 20, 23, 29, 31, 33] with the assumption of missing data and partial observations. The problem of identifying and ranking influenced nodes is addressed in [20, 23], but the topological inference of the cascades is not considered. Wang et al. [33] proposed a diffusive logistic model to capture the evolution of the density of active users at a given distance over time, and demonstrated the prediction ability of this model. Nevertheless, the structural information about the cascade is not addressed. Song et al. [31] studied the probability of a user being influenced by a given source. In contrast, we consider a more general inference problem where there are multiple observed users, who are influenced at different time steps from the source. Fei et al. [11] studied social behavior prediction and the effect of information content. In particular, their goal is to predict actions on an article based on the training dataset. Budak et al. [5] investigated the optimization problem of minimizing the number of the possible influencing nodes following a specified cascade model, instead of predicting cascades based on partial observations.

All the above works focus on predicting the nodes and their behavior in the cascades. In contrast, we propose approaches to infer both the nodes and the topology of the cascades in the graph-time domain.

Network Inference. Another host of work study network inference problem, which focuses on inferring network structures from observed cascades over the unknown network, instead of inferring cascade structures as trees [10, 14]. Manuel et al.  [14] proposes techniques to infer the structure of a network where the cascades flow, based on the observation over the time each node is affected by a cascade. Similar network inference problem is addressed in [10], where the cascades are modeled as (Markov random walk) networks. The main difference between our work and theirs is (a) we use consistent trees to describe possible cascades allowing partial observations; (b) we focus on inferring the structure of cascades as trees instead of the backbone networks.

Closer to our work is the work by Sadikov et al. [29] that consider the prediction of the cascades modeled as -trees, a balanced tree model. The global properties of cascades such as size and depth are predicted based on the incomplete cascade. In contrast to their work, (a) we model cascades as general trees instead of -balanced trees, (b) while Sadikov et al. [29] assume the partial cascade is also a -tree and predict only the properties of the original cascade, we infer the nodes as well as topology of the cascades only from a set of nodes and their activation time, using much less available information. (c) The temporal information (e.g., time steps) in the partial observations is not considered in [29].

Ii Consistent Trees

We start by introducing several notions.

Diffusion graph. We denote a social network as a directed graph , where (a) is a finite set of nodes, and each node denotes a user; (b) is a finite set of edges, where each edge denotes a social connection via which the information may diffuse from to ; and (c) a diffusion function which assigns for each edge a value , as the probability that node influences .

Fig. 2: Tree representations of a partial observation = , , : , and are consistent Trees, while is not.

Cascades. We first review the independent cascade model [16]. We say an information propagates over a graph following the independent cascade model if (a) at any time step, each node in is exactly one of the three states active, newly active, inactive; (b) a cascade starts from a source node being newly active at time step ; (c) a newly active node at time step has only one chance to influence its inactive neighbors, such that at time , (i) if is an inactive neighbor of , becomes newly active with probability ; and (ii) the state of changes from newly active to active, and cannot influence any neighbors afterwards; and (d) each inactive node can be influenced by at most one of its newly active neighbors independently, and the neighbors’ attempts are sequenced in an arbitrary order. Once a node is active, it cannot change its state.

Based on the independent cascade model, we define a cascade over graph = as a directed tree where (a) , ; (b) is the source node from which the information starts to propagate; and (c) is a function which assigns for each node a time step , which represents that is newly active at time step . Intuitively, a cascade is a tree representation of the “trace” of the information propagation from a specified source node to a set of influenced nodes.

Indeed, one may verify that any cascade from following the independent cascade model is a tree rooted at .

Example 2

The graph in Fig. 1 depicts a social graph. The tree and are two possible cascades following the independent cascade model. For instance, after issuing an ad of “Iphone 4s”, Ann at time becomes “newly active”. Bill and Jack retweet the ad at time . Ann becomes “active”, while Bill and Jack are turned to “newly active”. The process repeats until the ad reaches Mary at time step . The trace of the information propagation forms the cascade .

As remarked earlier, it is often difficult to observe the entire structure of a cascade in practice. We model the observed information for a cascade as a partial observation.

Partial observation. Given a cascade = , a pair is an observation point, if is known (observed) to be newly active at or by time step . A partial observation is a set of observation points. Specifically, is a complete observation if for any , there is an observation point . To simplify the discussion, we also assume that pair where is the source node. The techniques developed in this paper can be easily adapted to the case where the source node is unknown.

We are now ready to introduce the idea of consistent trees.

Ii-a Consistent trees

Given a partial observation of a graph = , a bounded consistent tree = w.r.t. is a directed subtree of with root , such that for every , , and reaches by hops, i.e., there exists a path of length at most from to . Specifically, we say a consistent tree is a perfect consistent tree if for every and , there is a path of length equals to from to .

Intuitively, consistent trees represent possible cascades which conform to the independent cascade model, as well as the partial observation. Note the following: (a) the path from the root to a node in a bounded consistent tree is not necessarily a shortest path from to in , as observed in [22]; (b) the perfect consistent trees model cascades when the partial observation is accurate, i.e., each time in an observation point is exactly the time when is newly active; in contrast, in bounded consistent trees, an observation point indicates that node is newly active at the time step , due to possible delays in the information propagation, as observed in [6].

Example 3

Recall the graph in Fig. 1. The partial observation of a cascade in is = , , . The tree is a perfect consistent tree w.r.t. , where is a bounded consistent tree w.r.t. .

Now consider the trees in Fig. 2. One may verify that (a) , and are bounded consistent trees w.r.t. ; (b) and are perfect consistent trees w.r.t. , where is not a perfect consistent tree. (c) is not a consistent tree, as there is no path from the source Ann to Mary with length no greater than as constrained by the observation point .

Ii-B Cascade inference problem

We introduce the general cascade inference problem. Given a social graph and a partial observation , the cascade inference problem is to determine whether there exists a consistent tree w.r.t. in .

There may be multiple consistent trees for a partial observation, so one often wants to identify the best consistent tree. We next provide two quantitative metrics to measure the quality of the inferred cascades. Let = be a social graph, and be a partial observation.

Minimum weighted consistent trees. In practice, one often wants to identify the consistent trees that are most likely to be the real cascades. Recall that each edge in a given network carries a value assigned by a diffusion function , which indicates the probability that influences . Based on , we introduce a likelihood function as a quantitative metric for consistent trees.

Likelihood function. Given a graph = , a partial observation and a consistent tree , the likelihood of , denoted as , is defined as:


Following common practice, we opt to use the log-likelihood metric, where

Given and , a natural problem is to find the consistent tree of the maximum likelihood in w.r.t. . Using log-likelihood, the minimum weighted consistent tree problem is to identify the consistent tree with the minimum , which in turn has the maximum likelihood.

Minimum consistent trees. Instead of weighted consistent trees, one may simply want to find the minimum structure that represents a cascade [25]. The minimum consistent tree, as a special case of the minimum weighted consistent tree, depicts the smallest cascades with the fewest communication steps to pass the information to all the observed nodes. In other words, the metric favors those consistent trees consist with the given partial observation with the fewest edges.

Given and , the minimum consistent tree problem is to find the minimum consistent trees in w.r.t. .

In the following sections, we investigate the cascade inference problem, and the related optimization problems using the two metrics. We investigate the problems for perfect consistent trees in Section III, and for bounded consistent trees in Section IV, respectively.

Iii Cascades as perfect trees

As remarked earlier, when the partial observation is accurate, one may want to infer the cascade structure via perfect consistent trees. The minimum (resp. weighted) perfect consistent tree problem, denoted as (resp. ) is to find the perfect consistent trees with minimum size (resp. weight) as the quality metric.

Though it is desirable to have efficient polynomial time algorithms to identify perfect consistent trees, the problems of searching and are nontrivial.

Proposition 1

Given a graph and a partial observation , (a) it is np-complete to determine whether there is a perfect consistent tree w.r.t. in ; and (b) the  and   problems are np-complete and apx-hard.

One may verify Proposition 1(a) by a reduction from the Hamiltonian path problem [32], which is to determine whether there is a simple path of length in a graph =. Following this, one can verify that the and problems are np-complete as an immediate result.

Proposition 1(b) shows that the and problems are hard to approximate. The apx class [32] consists of np optimization problems that can be approximated by a polynomial time (ptime) algorithm within some positive constant. The apx-hard problems are apx problems to which every apx problem can be reduced. Hence, the problem for computing a minimum (weighted) perfect consistent tree is among the hardest ones that allow ptime algorithms with a constant approximation ratio.

It is known that if there is an approximation preserving reduction (-) [32] from a problem to a problem , and if problem is apx-hard, then is apx-hard [32]. To see Proposition 1(b), we may construct an - from the minimum directed steiner tree () problem. An instance of a directed steiner tree problem = consists of a graph , a set of required nodes , a set of steiner nodes , a source node and a function which assigns to each node a positive weight. The problem is to find a minimum weighted tree rooted at , such that it contains all the nodes in and a part of . We show such a reduction exists. Since is apx-hard, is apx-hard.

Iii-a Bottom-up searching algorithm

Given the above intractability and approximation hardness result, we introduce a heuristic   for the problem. The idea is to (a) generate a “backbone network” of which contains all the nodes and edges that are possible to form a perfect consistent tree, using a set of pruning rules, and also rank the observed nodes in with the descending order of their time step in , and (b) perform a bottom-up evaluation for each time step in using a local-optimal strategy, following the descending order of the time step.

Backbone network. We consider pruning strategies to reduce the nodes and the edges that are not possible to be in any perfect consistent trees, given a graph = and a partial observation = . We define a backbone network = , where

  • = for each ; and

  • =

Intuitively, includes all the possible nodes and edges that may appear in a perfect consistent tree for a given partial observation. In order to construct , a set of pruning rules can be developed as follows: if for a node and each observed node in a cascade with time step , , then and all the edges connected to can be removed from .

Input: graph and partial observation . Output: a perfect consistent tree in . 1. tree = , where := , set level := for each , := ; 2. set := ; 3. if there is a node in and then return ; 4. set := ; 5. for each  do 6.    if there is no that    + then 7.    = ; 8.    = where or ; 9. graph := ; 10. list := where , , ; 11. for each  following descending order do 12.    := , := ;    := ;    := ; 13.    := ; 14.    construct = ; 15.     := ; 16. if  is a tree then return ; 17. return ;
Procedure  Input: A bipartite graph , node set , node set , a number ; Output: a forest . 1. = ; 2. construct as a minimum weighted steiner forest which cover as the required nodes; 3. for each tree do 4.     := where is the root of ; 5. return ;  

Fig. 3: Algorithm : initialization, pruning and local searching

Algorithm. Algorithm , as shown in Fig. 3, consists of the following steps:

Initialization (line 1). The algorithm  starts by initializing a tree , by inserting all the observation points into . Each node in is assigned with a level equal to its time step as in . The edge set is set to empty.

Pruning (lines 2-10). The algorithm  then constructs a backbone network with the pruning rules (lines 2-9). It initializes a node set within hop of the source node , where is the maximum time step in (line 2). If there exists some node that is not in , the algorithm returns , since there is no path from reaching with steps for (line 3). It further removes the redundant nodes and edges that are not in any perfect trees, using the pruning rules (lines 5-8). The network is then constructed with and at line 9. The partial observation is also sorted w.r.t. the time step (line 10).

Bottom-up local searching (lines 11-17). Following a bottom-up greedy strategy, the algorithm  processes each observation point as follows. For each in , it generates a (bipartite) graph . (a) It initializes a node set as the union of three sets of nodes , and (line 12), where (i) is the nodes in the observation points with time step , (ii) is the nodes in the current perfect consistent tree with level = , and (iii) is the union of the parents for the nodes in and . (b) It constructs an edge set which consists of the edges from the nodes in to the nodes in and . (c) It then generates with and the edge set , which is a bipartite graph. After is constructed, the algorithm  invokes procedure  to compute a “part” of the perfect tree , which is an optimal solution for , a part of the graph which contains all the observed nodes with time step . It expands with the returned partial tree (line 15). The above process (lines 11-15) repeats for each until all the nodes in are processed. Algorithm  then checks if the constructed is a tree. If so, it returns (line 16). Otherwise, it returns (line 17). The above procedure is as illustrated in Fig. 4.

Procedure . Given a (bipartite) graph , and two sets of nodes and in , the procedure  computes for a set of trees = with the minimum total weight (line 2), such that (a) each is a -level tree with a root in and leaves in , (b) the leaves of any two trees in are disjoint, and (c) the trees contain all the nodes in as leaves. For each , assigns its root in a level = (line 4).

is then returned as a part of the entire perfect consistent tree (line 5). In practice, we may either employ linear programming, or an algorithm for

problem (e.g., [28]) to compute .

Fig. 4: The bottom-up searching in the backbone network
Example 4

The cascade in Fig. 1, as a minimum weighted perfect consistent tree, can be inferred by algorithm  as illustrated in Fig. 4.   first initializes a tree with the node Mary. It then constructs as the graph induced by edges , , and . Intuitively, the three nodes as the parents of Mary are the possible nodes which accepts the message at time step . It then selects the tree with the maximum probability, which is a single edge , and adds it to . Following Mike, it keeps choosing the optimal tree structure for each level, and identifies nodes Jack. The process repeats until   reaches the source Ann. It then returns the perfect consistent tree as the inferred cascade from the partial observation .

Correctness. The algorithm  either returns , or correctly computes a perfect consistent tree w.r.t. the partial observation . Indeed, one may verify that (a) the pruning rules only remove the nodes and edges that are not in any perfect consistent tree w.r.t. , and (b)  has the loop invariant that at each iteration (lines 11-15), it always constructs a part of a perfect tree as a forest.

Complexity. The algorithm  is in time , where is the maximum time step in , and is the time complexity of procedure . Indeed, (a) the initialization and preprocessing phase (lines 1-9) takes time, (b) the sorting phase is in time, (c) the bottom-up construction is in , which is further bounded by if an approximable algorithm is used [28]. In our experimental study, we utilize efficient linear programming to compute the optimal steiner forest.

The algorithm  can easily be adapted to the problem of finding the minimum perfect consistent trees, where each edge has a unit weight.

Perfect consistent SP trees. The independent cascade model may be an overkill for real-life applications, as observed in [7, 19]. Instead, one may identify the consistent trees which follow the shortest path model [19], where cascades propagate following the shortest paths. We define a perfect shortest path () tree rooted at a given source node as a perfect consistent tree, such that for each observation point of the tree, = ; in other words, the path from to in the tree is the shortest path in . The (resp. ) problem for trees is to identify the trees with the maximum likelihood (resp. minimum size).

Proposition 2

Given a graph and a partial observation , (a) it is in ptime to find a tree w.r.t. ; (b) the and problems for perfect trees are np-hard and apx-hard; (c) the problem is approximable within , where is the diameter of , and (resp. ) is the maximum (resp. minimum) probability by the diffusion function .

We next provide an approximation algorithm to the problem for trees. Given a graph and a partial observation , the algorithm, denoted as  (not shown), first constructs the backbone graph as in the algorithm . It then constructs node sets = , and = . Treating as required nodes, as steiner nodes, and the log-likelihood function as the weight function,   approximately computes an undirected minimum steiner tree . If the directed counterpart of in is not a tree,   transforms to a tree: for each node in with more than one parent, it (a) connects and via the shortest path, and (b) removes the redundant edges attached to . It then returns as an tree.

One may verify that (a) is a perfect tree w.r.t. , (b) the weight is bounded by times of the optimal weight, and (c) the algorithm runs in time, leveraging the approximation algorithm for the steiner tree problem [32]. Moreover, the algorithm  can be used for the problem  for  trees, where each edge in has the same weight. This achieves an approximation ratio of .

Iv Cascades as bounded trees

In this section, we investigate the cascade inference problems for bounded consistent trees. In contrast to the intractable counterpart in Proposition 1(a), the problem of finding a bounded consistent tree for a given graph and a partial observation is in ptime.

Proposition 3

For a given graph and a partial observation , there is a bounded consistent tree in w.r.t. if and only if for each , , where is the distance from to in .

Indeed, one may verify the following: (a) if there is a node where , there is no path satisfies the time constraint and is empty; (b) if for each node , a tree rooted at with each node in as its internal node or leaf is a bounded consistent tree. Thus, to determine whether there is a bounded consistent tree is in time, via a traversal of from .

Given a graph and a partial observation , the minimum weighted bounded consistent tree problem, denoted as , is to identify the bounded consistent tree w.r.t. with the minimum (see Section II).

Theorem 1

Given a graph and a partial observation , the problem is

  • np-complete and apx-hard; and

  • approximable within , where (resp. ) is the maximum (resp. minimum) probability by the diffusion function over .

We can prove Theorem 1(a) as follows. First, the problem, as a decision problem, is to determine whether there exists a bounded consistent tree with no greater than a given bound . The problem is obviously in np. To show the lower bound, one may show there exists a polynomial time reduction from the exact 3-cover problem (). Second, to see the approximation hardness, one may verify that there exists an - from the minimum directed steiner tree () problem.

We next provide a polynomial time algorithm, denoted as , for the problem. The algorithm runs in linear time w.r.t. the size of , and with performance guarantee as in Theorem 1(b).

Input: graph and partial observation . Output: a bounded consistent tree in . 1. tree = , where := , := ; 2. compute bounded of in ; 3. for each do 4.     for each node where and = do 5.        if  then return ; 6.        find a path from to with the        minimum weight = for each ; 7.        = ; 8. return  as a bounded consistent tree;  

Fig. 5: Algorithm : searching bounded consistent trees via top-down strategy

Algorithm. The algorithm  is illustrated in Fig. 5. Given a graph and a partial observation , the algorithm first initializes a tree = with the single source node (line 1). It then computes the bounded directed acyclic graph ([3] of the source node , where is the maximum time step of the observation points in , and is a  induced by the nodes and edges visited by a traversal of from (line 2). Following a top-down strategy, for each node of ,   then (a) selects a path with the minimum from to , and (b) extends the current tree with the path (lines 3-7). If for some observation point , , then  returns as the tree (line 5). Otherwise, the tree is returned (line 8) after all the observation points in are processed.

Correctness and complexity. One may verify that algorithm  either correctly computes a bounded consistent tree , or returns . For each node in the observation point , there is a path of weight selected using a greedy strategy, and the top-down strategy guarantees that the paths form a consistent tree. The algorithm runs in time , since it visits each edges at most once following a traversal.

We next show the approximation ratio in Theorem 1(b). Observe that for a single node in , (a) the total weight of the path from to is no greater than , where is the length of ; and (b) the weight of the counterpart of in , denoted as , is no less than . Also observe that . Thus, . As there are in total such nodes, . Theorem 1(b) thus follows.

Minimum bounded consistent tree. We have considered the likelihood function as a quantitative metric for the quality of the bounded consistent trees. As remarked earlier, one may simply want to identify the bounded consistent trees of the minimum size. Given a social graph and a partial observation , the minimum bounded consistent tree problem, denoted as , is to identify the bounded consistent tree with the minimum size, i.e., the total number of nodes and edges. The problem is a special case of , and its main result is summarized as follows.

Proposition 4

The problem is (a) np-complete, (b) apx-hard, and (c) approximable within , where is the size of the partial observation .

Proposition 4(a) and 4(b) can both be shown by constructing reductions from the problem, which is np-complete and apx-complete [32].

Despite of the hardness, the problem can be approximated within in polynomial time, by applying the algorithm  over an instance where each edge has a unit weight. This completes the proof of Proposition 4(c).

V Experiments

We next present an experimental study of our proposed methods. Using both real-life and synthetic data, we conduct three sets of experiments to evaluate (a) the effectiveness of the proposed algorithms, (b) the efficiency and the scalability of  and .

(a) PCT@Enron: 
(b) PCT@Enron: 
(c) BCT@Enron: 
(d) BCT@Enron: 
(e) PCT@RT: 
(f) PCT@RT: 
(g) BCT@RT: 
(h) BCT@RT: 
Fig. 6: The and of the inference algorithms over Enron email cascades and Retweet cascades

Experimental setting. We used real-life data to evaluate the effectiveness of our methods, and synthetic data to conduct an in-depth analysis on scalability by varying the parameters of cascades and partial observations.

(a) Real-life graphs and cascades. We used the following real-life datasets. (i) Enron email cascades. The dataset of Enron Emails 111 enron/ consists of a social graph of nodes and edges, where a node is a user, and two nodes are connected if there is an email message between them. We tracked the forwarded messages of the same subjects and obtained cascades of depth no less than with more than nodes. (ii) Retweet cascades (RT). The dataset of Twitter Tweets 222 [35] contains more than million posts from more than million users, covering a period of 7 months from June 2009. We extracted the retweet cascades of the identified hashtags [35]. To guarantee that a cascade represents the propagation of a single hashtag, we removed those retweet cascades containing multiple hashtags. In the end, we obtain cascades of depth more than , with node size ranging from to . Moreover, we used the EM algorithm from [30]

to estimate the diffusion function.

(b) Synthetic cascades. We generated a set of synthetic cascades unfolding in an anonymous Facebook social graph 333, which exhibits properties such as power-law degree distribution, high clustering coefficient and positive assortativity [34]. The diffusion function is constructed by randomly assigning real numbers between and to edges in the network. The generating process is controlled by size . We randomly choose a node as the source of the cascade. By simulating the diffusion process following the independent cascade model, we then generated cascades w.r.t. and assigned time steps.

(c) Partial observation. For both real life and synthetic cascades, we define uncertainty of a cascade as = , where is the size of the nodes in , and is the size of the partial observation . We remove the nodes from the given cascades until the uncertainty is satisfied, and collect the remaining nodes and their time steps as .

(d) Implementation. We have implemented the following in C++: (i) algorithms , and ; (ii) two linear programming algorithms  and , which identify the optimal weighted bounded consistent trees and the optimal perfect consistent trees using linear programming, respectively; (iii) two randomized algorithms  and , which are developed to randomly choose trees from given graphs.  is developed using a similar strategy for , especially for each level the steiner forest is randomly selected (see Section III); as  does,  runs on bounded  directed acyclic graphs, but randomly selects edges. (iv) to verify various implementations of , an algorithm  is developed by using a greedy strategy to choose the steiner forest for each level (see Section III). We used LP_solve 5.5 444 as the linear programming solver.

We used a machine powered by an Intel(R) Core GHz CPU and GB of RAM, using Ubuntu 10.10. Each experiment was run by times and the average is reported here.

(a) Varying
(b) Varying
(c) Varying
(d) Varying
Fig. 7: Efficiency and scalability over synthetic cascades
Enron Twitter
Algorithms Precision = = = =

TABLE I: and over real cascades

Experimental results. We next present our findings.

Effectiveness of consistent trees. In the first set of experiments, using real life cascades, we investigated the accuracy and the efficiency of our cascade inference algorithms.

(a) Given a set of real life cascade T = , for each cascade , we computed an inferred cascade = according to a partial observation with uncertainty . Denote the nodes in the partial observation as . We evaluated the precision as = , and = . Intuitively,  is the fraction of inferred nodes that are missing from , while  is the fraction of missing nodes that are inferred by .

For Enron email cascades, Fig. 6(a) and Fig. 6(b) show the accuracy of  and  for inferring cascades, while is varied from to does not scale over the Enron dataset and thus is not shown. (i)  outperforms  and  on both  and . (ii) When the uncertainty increases, both the  and  of the three algorithms decrease. In particular,  successfully infers cascade nodes with  no less than and  no less than even when of the nodes in the cascades are removed. Using the same setting, the performance of  and  are shown in Fig. 6(c) and Fig.  6(d), respectively. (i) Both  and  outperform , and their  and  decrease while the uncertainty increases. (ii)  has better performance than . In particular, both  and  successfully infer the cascade nodes with the  no less than and with the  no less than , even when of the nodes in the cascades are removed.

For retweet cascades, the  and the  of  and  are shown in Fig. 6(e) and in Fig. 6(f), respectively. While the uncertainty increases from to , (i)  outperform  and , and (ii) the performance of all the algorithms decreases. In particular,  successfully infers the nodes with the  more than and the  more than , while the uncertainty is . Similarly, the  and the  of  and  are presented in Fig. 6(g) and Fig. 6(h), respectively. As  does not scale on retweet cascades, its performance is not shown. While the uncertainty increases, the  and the  of the algorithms decrease. For all outperforms ; in particular,  correctly infers the nodes with  no less than and  no less than , when is .

(b) To further evaluate the structural similarity of and as described in (a), we also evaluate (i) = for nodes = , where are the nodes with the same topological order in both and , and (ii) = for = , following the metric for measuring graph similarity [26]. The average results are as shown in Table I, for =, and the cascades of fixed depth. As shown in the table, for , the average is above , and the average is above over both datasets. Better still, the results hold even when we set = . For , and are above and above , respectively. For , and have almost consistent performance on both datasets; however, for , the and of the inferred Enron cascades are higher than those of the inferred retweet cascades. The gap might result from the different diffusion patterns between these two datasets: we observed that there are more than of cascades in the Enron dataset whose structures are contained in the directed acyclic graphs of , while in the Twitter Tweets there are less than of retweet cascades following the assumed graph structures of .

Efficiency over real datasets. In all the tests over real datasets,   and  take less than second.  does not scale for retweet cascades, while   does not scale for both datasets. On the other hand, while  takes less than seconds in inferring all the Enron cascades, it takes less than seconds to infer Twitter cascades where =, and seconds when = . Indeed, for Twitter network the average degree of the nodes is , while the average degree for Enron dataset is . As such, it takes more time for  to infer Twitter cascades in the denser Twitter network. In our tests, the efficiency of all the algorithms are not sensitive w.r.t. the changes to .

Efficiency and scalability over synthetic datasets. In the second set of experiments, we evaluated the efficiency and the scalability of our algorithms using synthetic cascades.

(a) We first evaluate the efficiency and scalability of  and compare  with  and .

Fixing uncertainty , we varied from to . Fig. 7(c) shows that  scales well with the size of the cascade. Indeed, it only takes seconds to infer the cascades with nodes.

Fixing size , we varied the uncertainty from to . Fig. 7(d) illustrates that while all the three algorithms are more efficient with larger is more sensitive. All the three algorithms scale well with .

As  does not scale well, its performance is not shown in Fig. 7(c) and Fig. 7(d).

(b) Using the same setting, we evaluated the performance of , compared with  and .

Fixing and varying , the result is reported in Fig. 7(a). First,  outperforms , and is almost as efficient as the randomized algorithm . For the cascade of nodes,  takes less than second to infer the structure, while  takes nearly seconds. Second, while  is not sensitive to the change of is much more sensitive.

Fixing and varying , Fig. 7(b) shows the performance of the three algorithms. The figure tells us that  and  are less sensitive to the change of than . This is because  and  identify bounded consistent tree by constructing shortest paths from the source to the observed nodes. When the maximum depth of the observation point is fixed, the total number of nodes and edges visited by  and  are not sensitive to .

Summary. We can summarize the results as follows. (a) Our inference algorithms can infer cascades effectively. For example, the original cascades and the ones inferred by  have structural similarity (measured by ) of higher than in both real-life datasets. (b) Our algorithms scale well with the sizes of the cascades, and uncertainty. They seldom demonstrated their worst-case complexity. For example, even for cascades with nodes, all of our algorithms take less than two seconds.

Vi Conclusion

Problem Complexity Approximation time
np-c, apx-hard

( tree)
np-c, apx-hard
( tree) np-c, apx-hard
np-c, apx-hard
np-c, apx-hard
TABLE II: Summary: complexity and approximability

In this paper, we investigated cascade inference problem based on partial observation. We proposed the notions of consistent trees for capturing the inferred cascades, namely, bounded consistent trees and perfect consistent trees, as well as quantitative metrics by minimizing either the size of the inferred structure or maximizing the overall likelihood. We have established the intractability and the hardness results for the optimization problems as summarized in Table II. Despite the hardness, we developed approximation and heuristic algorithms for these problems, with performance guarantees on inference quality, We verified the effectiveness and efficiency of our techniques using real life and synthetic cascades. Our experimental results have shown that our methods are able to efficiently and effectively infer the structure of information cascades.


  • [1] D. Arthur, R. Motwani, A. Sharma, and Y. Xu. Pricing strategies for viral marketing on social networks. Internet and Network Economics, pages 101–112, 2009.
  • [2] N. Bailey. The mathematical theory of infectious disease and its applications. 1975.
  • [3] J. Bang-Jensen and G. Z. Gutin. Digraphs: Theory, Algorithms and Applications. Springer, 2008.
  • [4] S. Bikhchandani, D. Hirshleifer, and I. Welch. A theory of fads, fashion, custom, and cultural change as informational cascades. Journal of political Economy, pages 992–1026, 1992.
  • [5] C. Budak, D. Agrawal, and A. El Abbadi. Limiting the spread of misinformation in social networks. In WWW, 2011.
  • [6] M. Cha, F. Benevenuto, Y.-Y. Ahn, and P. K. Gummadi. Delayed information cascades in flickr: Measurement, analysis, and modeling. Computer Networks, 56(3):1066–1076, 2012.
  • [7] W. Chen, C. Wang, and Y. Wang. Scalable influence maximization for prevalent viral marketing in large-scale social networks. In KDD, 2010.
  • [8] K. Dave, R. Bhatt, and V. Varma. Modelling action cascades in social networks. In AAAI, 2011.
  • [9] P. Domingos and M. Richardson. Mining the network value of customers. In KDD, 2001.
  • [10] M. Eslami, H. R. Rabiee, and M. Salehi. Dne: A method for extracting cascaded diffusion networks from social networks. In The Third IEEE International Conference on Social Computing, 2011.
  • [11] H. Fei, R. Jiang, Y. Yang, B. Luo, and J. Huan. Content based social behavior prediction: a multi-task learning approach. In CIKM, 2011.
  • [12] J. Goldenberg, B. Libai, and E. Muller. Talk of the network: A complex systems look at the underlying process of word-of-mouth. Marketing Letters, pages 211–223, August 2001.
  • [13] J. Goldenberg, B. Libai, and E. Muller. Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth. Marketing Letters, pages 211–223, 2001.
  • [14] M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and influence. TKDD, 5(4):21, 2012.
  • [15] M. Granovetter. Threshold models of collective behavior. American Journal of Sociology, 83(6), 1978.
  • [16] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In KDD, 2003.
  • [17] D. Kempe, J. Kleinberg, and E. Tardos. Influential nodes in a diffusion model for social networks. In ICALP, 2005.
  • [18] W. O. Kermack and A. G. Mckendrick. A contribution to the mathematical theory of epidemics. Proc R Soc Lond A, 115:700–721, 1927.
  • [19] M. Kimura and K. Saito. Tractable models for information diffusion in social networks. PKDD, pages 259–271, 2006.