With the emergence of large-scale online social networking applications in the last decade, influence maximization in online social networks has been widely considered as one of the fundamental and popular problems in social data management and analytics. In the seminal paper by Kempe et al. , this problem is defined as follows. Given a social network as well as an influence propagation (or cascade) model, the problem of influence maximization (im) is to find a set of initial users of size (referred to as seeds) so that they eventually influence the largest number of individuals (referred to as influence spread) in . Effective solutions to the im problem open up opportunities for commercial companies to design intelligent recommendation systems and viral marketing strategies .
Kempe et al.  proved that the im problem is NP-hard, and presented an elegant greedy approximate algorithm applicable to several popular cascade models, including the independent cascade (ic) model, and etc. A key strength of this algorithm lies in its guarantee that the influence spread is within of the optimal influence spread where is the base of the natural logarithm. Since then a large body of work (e.g., [3, 4, 5, 6]) have been proposed to improve the efficiency of im techniques as well as quality of influence spread. Variants of this classical im problem have also been proposed in recent times such as topic-aware im , conformity-aware im , and competitive im . In a latest research,  provides a uniform benchmark to evaluate these classical im solutions. In summary, this elegant work by Kempe et al. has had significant influence on the research community111 This work has garnered over 4,800 citations in Google Scholar and received the test-of-time award in ACM KDD 2014..
I-a Limitations of the Definition of Classical IM Problem
The classical im problem and its solution in  are grounded on the following implicit assumption. Assume that it takes time for influence spread from seed set to reach the largest number of nodes in a social network . Then, is assumed to be small so that the topology of can be assumed to remain static during . Consequently, the topology of is completely known during the propagation process. This is important as the dynamics of influence propagation for all cascade models in the classical im problem demands that neighbors of a node are known. For example, consider the popular ic model. In this model, we start with an initial set of active nodes, and the influence propagation unfolds in discrete steps where at step when a node becomes active, it gets a single chance to activate each of its inactive neighbor
with probability. If is successful in activating , then will become active in step . This process continues until no more activations are possible. Clearly, successful realization of this propagation process requires that the neighbors of each node are known so that influence can be propagated to its active neighbor(s).
A large volume of subsequent work on influence maximization (e.g., [3, 4, 6, 8]) also implicitly or explicitly make the above assumption as they are built on top of the classical im problem . Unfortunately, recent studies reveal that the aforementioned assumption may not hold in practice as time taken for influence propagation is significant, during which the topology of these networks evolves rapidly. For instance,  tested the spread of web advertisements through emails and websites and justified that on average it takes 1.5 days for an intermediary node to propagate the messages and the spread will not reach the largest scope until at least 8 propagations. That is, each cascade of web advertisement may consume up to two or more weeks. Meanwhile, it has been reported that active users of Facebook increased from just a million in 2004 to 1 billion in 2012, 8.57% growth per month on average . Similarly, the number of active users in Twitter increased from 100 million in September 2011 to 200 million in December 2012, 4.73% growth per month . In particular, it has been shown in  that the number of nodes for Answers, Delicious and LinkedIn grows quadratically to the elapsed weeks; and for Flickr, it grows exponentially. In summary, the above studies show that influence propagation can take significant amount of time to reach the largest scope (several weeks) during which social networks evolve.
Due to the aforementioned mismatch between the characteristics of real-world social networks and assumption made by the classical im problem, the quality of seeds selected by a state-of-the-art im algorithm is adversely impacted. In particular, the seeds selected from may not maximize the spread of influence due to the evolutionary nature of during influence propagation process. To elaborate further, suppose the influence propagation of takes time and terminates when there is no other node that can be activated. Meanwhile, the social network at time point evolves to during . Note that the classical im problem aims to compute from at ignoring its evolution during . Importantly, may not exhibit the maximal expected influence in . That is, the seed set in may not necessarily be identical to . Note that it is not possible in practice to run a state-of-the-art im algorithm on at time to get and then select seeds from as , which is the “best” seed set in that exhibits the maximal influence in . This is because it is unrealistic to assume that the complete topology of is known at time . We further illustrate this problem with the following example.
Suppose we wish to select one seed () at time on the social network depicted in Fig. 1(a). A state-of-the-art im algorithm will run on to select as the seed. As influence propagation may take time, assume that during this period evolves to as shown in Fig. 1(b). Specifically, two new nodes (i.e., and ) and three edges (i.e., , and ) are added during this time period. Consequently, may not influence the most number of nodes after the completion of influence propagation as during this period the topology of has evolved to . In fact, is a better choice as it may influence more nodes than after .
At first glance, it may seem that this problem can be easily addressed by running the im technique on instead of , which may result in seed set . Unfortunately, it is difficult at time to predict the topology of after time (i.e., ) in order to run an im technique on the latter! In other words, the complete topology of the network at time is unknown at time . Observe that this problem occurs regardless of when the im algorithm is run. For instance, suppose it is run at time on to select . However, now may have evolved to (Fig. 1(c)) during the influence propagation process and as a result may not be the optimal seed for maximizing influence anymore.
Fundamentally, the manifestation of this problem is due to the maximization of influence on a network instance at time (which is consistent with the classical im problem definition) instead of discovering seeds that maximize influence on a future instance of the network (i.e., at time ) assuming that influence propagation takes time. However, as remarked earlier it is difficult to know the exact topology of the future network at time .
I-B Can Recent IM Efforts on Dynamic Networks Address the Limitations?
Recently, several efforts have studied the im problem in the context of dynamic or temporal social networks [15, 16, 17, 18, 19]. At first glance, it may seem that the aforementioned problem of classical im can be addressed by deploying these techniques as they consider evolutionary nature of the underlying network. Unfortunately, this is not the case as these techniques either assume that the topology of the network is completely known at a specific time point or are oblivious to the impact of influence propagation time on the network state. Broadly, these techniques repeatedly run classical im algorithms (or their incremental versions) at different time points in order to find up-to-date seed sets. However, this strategy cannot address the aforementioned limitation of the classical im problem regardless of the way time points are separated or the choice of im algorithm. For instance, suppose at time evolves to at time where . Intuitively, one may select at time in order to maximize the influence at a different temporal state. However, can only assure that the influence is maximized in . As remarked earlier, as influence propagates from , the network evolves as well. Consequently, whenever influence of reaches the largest scope, the network may have evolved from to . Hence, repeatedly selecting seeds using a conventional im algorithm or its variant cannot lead to a superior quality seed set that maximizes influence at a future time point.
As an example, consider the MaxG algorithm in , which ignores the impact of influence propagation time. In other words, it assumes that the topology of the whole network can be easily observed at any timestamp, which is consistent with the assumption made by classical im as discussed earlier. It is worth noting that if the time consumed by the influence propagation process is not ignored, the probability of update operation has to be decayed even if the marginal gain increases. This is because the later a new node is interchanged into , the lesser is the time available for it to propagate to the influence scope as expected. In fact, it has been argued in , selecting same seeds at different timepoints may result in different influence spread in a dynamic network. The approach in  firstly separates a time period into several equal-length intervals (of length ) based on the entire evolution period of the network. Then, an algorithm is presented to select seeds at in order to maximize the influence at . Unfortunately, it demands the future network state as input in order to compute . For instance, in order to select in at time in Fig. 1(a), it requires at time (Fig. 1(c)) as input. Obviously, it is unrealistic to assume that is completely known at .
This paper makes four contributions. First, we theoretically prove that if the aforementioned assumption related to evolutionary nature of a social network and the impact of influence propagation time on seed selection is jettisoned by the classical im problem, then the approximation guarantee for greedy algorithms that the influence spread is within of the optimal influence spread does not hold anymore. Note that a large number of subsequent work [3, 8, 22] have used this guarantee as the building block to design new algorithms and derive new results.
Second, we revisit the classical im problem and redefine it as proteus-im222 The name honors Proteus, a sea god in Greek mythology, noted for his ability to assume different forms and to prophesy. The proteus-im problem discovers seeds from a social network that assumes different form from the current instance at the end of the influence propagation process. (Propagation Time-conscious Influence Maximization) problem by jettisoning the aforementioned assumption made by the former. Intuitively, it is defined as follows. Given a network at time that may evolve to at target time , the goal of the proteus-im problem is to select seeds at time such that information spread from can reach the largest scope in instead of 333 We shall elaborate in Section III the justification for choosing seed set from and not solely from ..
Observe that the proteus-im problem differs from the classical im in the following ways. Firstly, we assume that the underlying network evolves during the influence propagation time and the complete topology of the target network is unknown at time . Secondly, the seeds are selected in a network () whose topology is not identical to the one in which influence finally propagates to the largest scope (). In comparison, in the classical im problem these two network topologies are assumed identical. Thirdly, the influence propagation path in our problem may consists of nodes and edges that are currently absent in . In comparison, the influence propagation path in classical im, although randomly distributed, strictly sampled from the edges in the current network. We also prove that the proteus-im problem is NP-hard and the expected influence is submodular.
Third, we propose a greedy algorithm called proteus-genie to address the proteus-im problem. Specifically, it selects nodes at time whose expected influence at time is maximal. A distinguishing feature of the algorithm is that it takes into account evolution of the underlying network during influence propagation process. Note that this is a challenging problem as we cannot make unrealistic assumption that the topology of the network at is completely known apriori. To tackle this challenge, we resort to a popular network evolution model called the Forest Fire Model (ffm)  to predict the topology of the network at time 444 As our approach is loosely-coupled with the network evolution model, other models can also be adopted in this regard.. To the best of our knowledge, ffm has never been utilized in the context of im. Specifically, proteus-genie iteratively selects nodes with largest marginal gain in expected influence, taking into account the evolution of the network predicted by ffm. The proposed greedy algorithm can be time consuming for large networks as in each iteration we need to simulate network evolution and then select the next optimal seed node. Hence, we propose a Reverse Reachable (rr) set-based algorithm called proteus-seer which significantly reduces the running time while preserving similar influence spread quality. It first selects an instance number by utilizing a recent classical im technique  and then iteratively predict instances of the target network, . We select candidate seeds from each and aggregate them to finally select the top- seeds.
Fourth, we investigate the performance of proteus-genie and proteus-seer on real-world social networks. Our experimental study reveals that, as predicted by theory, algorithms designed for the proteus-im problem consistently outperform state-of-the-art classical im techniques in terms of influence spread quality for all datasets, even when the underlying network has changed slightly during influence propagation (i.e., may be small). Interestingly, our results emphasize that it is not necessary to possess a complete and accurate knowledge of the topology of to achieve such superior performance. Note that this is important as assuming such complete knowledge renders the im problem unrealistic. Additionally, proteus-seer significantly reduces the running time while preserving similar result quality.
|A social network at time|
|The number of seeds|
|Independent cascade probability|
|The expected influence function in static network|
|The number of influenced node through edges|
|The expected influence of dynamic network at time|
|Forward burning probability|
|Backward burning ratio|
|Number of rounds of simulation in computing expected influence|
|Number of rounds for simulating the evolution|
|Number of predicted instances in proteus-seer|
|The set of nodes in that can reach|
I-D Paper Organization
The rest of this paper is organized as follows. We review classical im techniques in Section II. We formally define the proteus-im problem in Section III. Sections IV and V present the proteus-genie and proteus-seer algorithms to address this problem. We present the experimental results in Section VI. Finally, we conclude this paper in the last section.
Ii Classical Influence Maximization Problem
In this section, we review related work in classical influence maximization (im) problem for both static and dynamic networks. Table I describes the key notations used in this paper.
Ii-a IM in Static Networks
Kempe et al.  are the first to consider choosing the seeds for im problem as a discrete optimization problem. In their seminal paper, they defined the classical influence maximization problem as follows.
[Classical Influence Maximization Problem] Let be a network and be the expected influence of a set of nodes under a given cascade model, measured by the number of nodes that are eventually influenced. Then given a budget , the influence maximization (im) problem aims to select a seed set () such that the expected influence spread is maximized, which can be formally described as
Kempe et al.  proposed a general greedy algorithm that returns near optimal results (i.e., within ). Since then a large body of im techniques [1, 3, 8, 24, 25] are reported in the literature to improve efficiency, scalability, and influence spread quality. As highlighted in Section I, the classical im algorithms assume that the topology of the underlying network is completely known and it does not evolve during influence propagation. Hence, they suffer from the limitations discussed earlier leading to relatively poorer quality of influence spread (detailed in Section VI).
The seed set in classical im problem (Definition 1) may not exhibit the largest expected influence when the underlying network evolves during influence propagation. In other words, , where , , such that .
The approximation guarantee that the influence spread is within of the optimal influence spread for greedy hill-climbing-based classical im algorithms does not hold when the underlying network evolves during influence propagation.
We can easily design a network evolution scenario justifying that guarantee does not hold eventually. Without loss of generality, suppose consists of nodes and only one edge. Let evolve in the following way, at each step, it replicate a copy of . For instance, consists of and , which are isomorphic but disconnected with each other; consists of and , and etc. Thus, () will consist of disconnected copies of . Suppose we are maximizing the influence spread using greedy hill-climbing-based classical im algorithms in . According to the quality guarantee, , and obviously . Therefore . However, we can easily find that , which can be achieved by selecting seed from each of disconnected copies of in . Obviously, , , that is, .
Remark. Kempe et al. showed in  that a non-progressive (i.e., nodes can switch from inactive to active state and vice versa) im problem can be reduced to a progressive (i.e., nodes can only switch from inactive to active state but not vice versa) case in a different graph. Unfortunately, when the underlying network evolves during influence propagation, the im problem cannot be transformed to a non-progressive case (and subsequently to progressive case) due to the following reasons. They designed a new concept, namely layered graph, which is defined as follows. Given a graph and time limit , a layered graph on contains a copy for each node in and each time step . Firstly, in a non-progressive case, no matter how many layers in the layered graph , the topology of the network in each layer is completely known. However, in reality a social network does not satisfy this property. That is, if we model the evolving network into a layered graph, then the topology of the network in each layer is unknown and these networks in different layers are different. Secondly, the influence in non-progressive case is measured by the sum over the number of time steps that all nodes are active. However, in the presence of evolution it cannot be measured in this way. On one hand, is not fixed in our problem setting; on the other hand, the influence should be measured as the number of active nodes at a target time (i.e., the end of propagation in progressive im) instead of summing over the number of steps the nodes are active.
Ii-B IM in Dynamic Networks
Recently, there have been increasing efforts to address the im problem in dynamic networks. Zhuang et al.  proposed an algorithm called MaxG to select seed nodes at a specific time step
. It utilizes a heuristicprobing strategy such that at a target time step, it only needs to probe a limited number of nodes, whose change in the local connections can best uncover the actual influence propagation process. As remarked in Section I-B, it assumes the topology of the whole network can be easily observed at any timepoint. The same limitation also exists in , which focuses on tracking influential nodes. More recently,  proposed an index model using rr set introduced in  to answer influence maximization query at any temporal state during network evolution. Similar to , this work also suffers from two key drawbacks. Firstly, it assumes that every atomic evolution step (e.g., single node/edge addition) can be fully observed at any timepoint, which is unrealistic in practice. Secondly, it ignores the influence propagation time and fails to anticipate the network state during influence propagation. Consequently, any answer of  (i.e., a set of seeds) towards an influence maximization query at time may not necessarily generate the expected influence cascade as the network, in which influence eventually propagates, is typically not the same with the one at , based on which it answers .
Aggarwal et al.  studied the problem of selecting seed nodes at time , such that a piece of information propagated from these nodes will spread to the largest scope (i.e., number of nodes) at time , taking into account that the network may evolve during the period from to . However, as discussed in Section I-B, it assumes that the complete topology of the final network where influence eventually propagates to the largest scope is known and seeds are selected from this “known” network.
Iii Propagation Time-conscious Influence Maximization
In this section, we revisit this decade-old im problem and redefine it to address the aforementioned limitation. We begin by introducing some terminology that we shall be using in this paper. Then, we formally redefine the classical im problem as propagation time-conscious im problem.
We model a social network as directed graph , where nodes in represent individuals in the network and edges in represent relationships between them. The order of is and its size is . Recall that traditional im assumes influence propagates between nodes according to a specific cascade model and selects nodes in as seeds to spread a piece of information such that the information will be propagated to the maximal number of other nodes. However, such influence propagation can take time in reality (which can be several weeks). During this time, the social network may evolve from at time to at time . We refer to as current network and as target network. Correspondingly, and are referred to as current time and target time, respectively. For the sake of generality, we assume that is given by the user as it is application and network dependent. We assume and as most real-world social networks grow over time. Furthermore, and . We denote the expected influence at time (i.e., the number of influenced nodes at ) for seeds under a given cascade model as . For ease of exposition, in the sequel, we assume the independent cascade (ic) model, where influence propagates according to an independent probability along any edge , for influence propagation as it is one of the most popular model in the literature. However, our proposed problem is also applicable to other types of cascade models.
Iii-B Redefining IM Problem
The classical influence maximization problem (Definition 1) ignores the influence propagation time which can be significant in reality, during which the underlying social network may evolve. Hence, we formally redefine this classical influence maximization problem as follows.
[Propagation Time-conscious Influence Maximization Problem] Let be the current network at time and be the budget. Let be the influence propagation time during when evolves to where and . Then, the goal of Propagation Time-conscious Influence Maximization (proteus-im) Problem is to select a set of seed nodes () at such that the expected influence spread is maximized at assuming that the complete topology of is unknown at . That is,
Observe that according to the above definition, seeds are selected from current instance of the network instead of future instances of the network i.e., . This is because it is difficult to know at which users may potentially join or leave a social network in the future (before ), how will they be connected to other users, and whether they will be part of the seeds. In fact, as remarked earlier, it is unrealistic to assume accurate and complete topological knowledge of future instances of the social network (i.e., ) at time . Hence, given that influence propagation may take time, it is more realistic to choose a seed set (i.e., users who currently exist in the social network) in order to maximize the expected influence spread in the target network . Also, observe that in the classical im problem, as the topology of the network is assumed to be static since is negligible.
The expected influence function at an arbitrary time for node set under the ic model, namely , defined in Definition 2 is sub-modular.
Let be the set of nodes that can be reached from on a path comprising of the live edges at time and be the number of nodes that can be reached from through . In other words, . Given two node sets , consider the following expression: , which is the number of elements in that are not already in . It is at least as large as the number of elements in that are not in . Hence, , which is submodular. Moreover, the expected influence of at for all possible , i.e., , can be computed as According to the equation, can be viewed as a non-negative linear combination of submodular functions, which is also submodular.
The proteus-im problem defined in Definition 2 under the ic model is NP-hard.
Given a network at , suppose we are solving new-im over at . If remains static for a sufficiently long period until the influence propagation ends at time (i.e., the influence reaches the largest scope), in this case is the same with at . Therefore, the new-im in is equivalent to im in . That is, the problem of maximizing in degenerates to the problem of maximizing in . Therefore, the new-im is at least as hard as im, which is NP-hard.
Since the proteus-im problem is NP-hard, in the sequel we present two approximate solutions. It is worth emphasizing that given the rich body of work on classical im techniques, our design principle behind these solutions is not to jettison all these efforts but to leverage on the benefits of these techniques wherever possible, while bringing in novel ideas to address the aforementioned limitations of classical im. Hence, our first solution is a greedy hill-climbing approach called proteus-genie. Our second solution, called proteus-seer, exploits Reverse Reachable (rr) set and is significantly more efficient than proteus-genie while preserving good result quality.
Iv A Greedy Solution
In this section, we present a novel greedy algorithm called proteus-genie (Propagation Time-conscious GrEedy selectioN of Influential sEeds) that addresses the proteus-im problem. Observe that designing such algorithm is challenging. On one hand, it is unrealistic to assume complete knowledge of the topology of the target network at time . On the other hand, without knowing the topology of it is very difficult to compute the expected spread in it using existing cascade models (e.g., ic model).
We tackle this challenge by predicting the expected topology of from by exploiting a popular network evolution model called the Forest Fire Model [23, 27]. Consequently, we utilize this predicted topology of to determine the expected spread in it using an existing cascade model. We begin by briefly introducing this model. Interestingly, as we shall see in Section VI, by leveraging the predicted topology of , our proposed algorithms can consistently produce superior quality seeds compared to classical im techniques. That is, we do not need to know the actual topology of at time to produce superior quality seeds!
Iv-a Forest Fire Model (FFM)
Majority of social networks are evolutionary in nature and exhibit series of properties and phenomenons, including shrinking diameter, densification power law, etc . Several network evolution models [14, 23, 27, 28] have been proposed in the literature to simulate the evolution of real-world online social networks. Among these models, we chose the Forest Fire Model (ffm) , as it outperforms other models . Formally, this model is defined as follows.
[Forest Fire Model] Let be a network at time , consist of only the first node. Given an incoming node at time , the network at time can be updated to according to the following rules.
Uniformly select an ambassador node from and establish a directed edge from to , .
Sample two numbers and
, from a pair of binomial distributions whose means areand , respectively. Afterwards, uniformly selects in-links and out-links incident to , respectively. Let be the other ends of the selected links. In particular, is a preset forward burning probability, is a preset backward burning ratio such that is backward burning probability.
Establish directed edges from to , respectively. Similarly, establish directed edges from to , respectively. Then, we apply step (2) recursively for each of until there is no new link to be added. As this process continues, nodes can only be visited once such that there is no cyclic sub-structure.
It has been shown in  that the network generated by ffm satisfies majority of real-world network properties, including not only static ones (e.g., Heavy-tailed in-degrees and out-degrees ) but also dynamic ones (e.g., Densification Power Law and Shrinking Diameter ). It has also been demonstrated in  that evolutions of many real-world networks can be well simulated and predicted using this model. Therefore, we utilize ffm to predict the evolution of a network at target time . Specifically, our proteus-genie algorithm integrates the ffm with node selection during influence maximization to facilitate discovery of superior quality seeds. We now elaborate on this algorithm in detail.
Iv-B The proteus-genie Algorithm
The goal of the proteus-genie algorithm is to greedily select the nodes with the maximal marginal expected influence taking into account the evolution of the underlying network from time to by predicting its topology using ffm.
Intuitively, seeds selection in proteus-genie is as follows. Firstly, given the current network at time , it evaluates the marginal expected influence of all nodes that are predicted to be in at time , namely . Note that the topological structure of at target time is generated using ffm based on . The forward burning probability and backward burning ratio are selected by fitting the model using the network evolution historical logs before . Secondly, it selects the node with the largest expected influence as the first node and removes it from . Thirdly, it performs the previous two steps iteratively for rounds such that it selects seeds as . Observe that in previous steps, we generate one target network using ffm, which results in a deterministic network at time . However, the network evolution using ffm during to is a random process which cannot be accurately described using a single-round simulation. Therefore, the previous three steps are executed for rounds independently, resulting in different instances of , denoted by . Consequently, the seeds sets are generated after rounds. Finally, it aggregates the ranks of these seeds and selects the top- nodes with the highest overall ranks as the final seed set . We now formally describe the algorithm.
The formal procedure is outlined in Algorithm 1. Firstly, it simulates the evolution of the network to using ffm (Definition 3) and then initialize a seed set instance as empty (Lines 3-4). Then, it iteratively selects seed nodes into (Lines 5-12). For the selection of each seed node, we generate graph by removing each edge in independently with probability , resulting in a spanning graph . In this manner, can be viewed as live edges set at time , from which we can compute the marginal influence for each . This process repeats for times and the marginal influences of each node are aggregated (Lines 6-9). Afterwards, the algorithm selects the nodes with the maximal accumulated marginal influence so far (denoted as ) and inserts it into and removes it from . Meanwhile, it also records the rank of each seed in as . The above steps are iteratively performed for times, until each of is filled with seeds (Lines 2-12).
So far, we have instances of seed node set, each of which consists of nodes as well as their ranks . Hence, for each of the nodes that appears in at least once, the algorithm aggregates its ranks (Lines 13-15). Finally, it selects the top- nodes as the final seed set (Lines 16-18).
Consider Fig. 2. Suppose , , and . Let the current network at time is as shown in the left-hand side of the figure. First, the proteus-genie algorithm utilizes ffm to randomly predict an instance of the target network at time . Then it randomly removes each edge with probability for times from . This results in three instances of influence. Accordingly, it finds a ranked seed set consisting of the top- node with the largest expected influence scope over these instances. Afterwards, these steps are repeated twice (i.e., ) by randomly predicting two other instances of the target network, resulting in and . Similarly, the algorithm selects another two seed sets, and . Finally, it assembles into a bag of nodes , from which the top- node with the maximal frequency is returned.
The time complexity of the proteus-genie algorithm (Algorithm 1) is .
Let be the subgraph that joined during to and be the subgraph generated by ffm from to . If is identical to with probability , then each corresponding to generated by the proteus-genie algorithm (Algorithm 1) is guaranteed to achieve -approximation for proteus-im with probability .
Due to the newly joined edges and nodes, for each , the expected influence of it at , namely can be separated as two parts, one from original graph (i.e., ), the other from the marginal increase in the expected influence caused by , denoted by . Let be the latter part, then the expected influence of in predicted graph can be computed as: . As is identical with with probability , then with probability . Therefore, with probability .
Moreover, as proteus-im degenerates to classical hill-climb algorithm if graph is static, which is guaranteed to be within of the optimal, then the seeds selected by proteus-im in is within of the optimal. Putting it together, the seeds selected by proteus-im can achieve -approximation with probability .
Remark. Typically network evolution may be slower than influence propagation. However, our framework does not demand any correlation between the time steps of the ffm and the influence propagation time. As long as network is evolving and influence propagation takes time, our proposed model and algorithm fit well. In fact, when ffm is extremely slow (i.e., network hardly evolves) and influence propagation is extremely fast (i.e., is negligible), the proteus-im problem is close to the classical im problem. Particularly, in the unrealistic case when ffm is very slow (e.g., each time only one node is added and we have enough time to grasp the topology of network at any temporal state) and is negligible, MaxG  works well. In contrast, whenever influence propagation takes time, our solution fits well.
V A Reverse Reachable Set-based Solution
Observe that the time complexity of proteus-genie is highly influenced by and (Theorem 2). These values are large for real-world networks containing millions of nodes and hence the efficiency of the greedy algorithm can be adversely affected when dealing with such networks. In this section, we address this issue by proposing an algorithm called proteus-seer (Propagation Time-conscious SEed SElection using RR set), which leverages the notion of Reverse Reachable (rr) Set  in addition to ffm for seed selection. For the sake of completeness, we first briefly introduce the concept of rr set before discussing our algorithm to address the proteus-im problem.
V-a Reverse Reachable Set
Let be a node in and be a graph obtained by removing each edge in with probability . The reverse reachable (rr) set  for in , denoted as , is the set of nodes in that can reach . That is, for each node in the rr set, there is a directed path from to in . For example, consider Fig. 1(a). The rr set for node in contains all nodes in that can reach . That is, .
Let be the distribution of induced by the randomness in edge removals from . A random rr set  is an rr set generated on an instance of randomly sampled from for a node selected uniformly at random from .
Note that the notion of rr set is currently the most efficient and promising way to answer influence maximization problem with guaranteed result quality, and has been recently deployed in [17, 26] to generate “near-optimal” solution for the im problem. However, these techniques either assume the network is static or ignore the influence propagation time. More importantly, they cannot be trivially extended to handle the proteus-im problem.
V-B The proteus-seer Algorithm
The key idea of the algorithm is to compute the rr set by considering the evolution of the network due to random prediction using ffm. Since the target network is randomly predicted several times, we utilize a bag of nodes to assemble all instances of the rr sets computed from different random instances of the predicted network. Specifically, our algorithm comprises of the following key steps.
First, we use the ffm to simulate the evolution of , from which we get . This process is iteratively repeated for times, such that we can get different instances of , denoted by . In particular, is computed as in .
Second, for each instance , uniformly sample a node from as , and generate a rr set for it denoted as . Consequently, we have such sets, each corresponds to a sampled node. In the sequel, we denote these sets as
Finally, we greedily select from all rr sets in the node which appears in the most number of rr sets and then remove these sets from . We iteratively select such nodes and then output them as final seeds set .
Algorithm 2 outlines the formal procedure. Similar to the proteus-genie algorithm, it simulates the evolution of based on ffm to generate an instance of target network (Line 4). Based on , it uniformly samples a node and generates a random rr set of this node with respect to , resulting in (Lines 5-9). The generation of each rr set is implemented as a randomized breath-first search on . Given a node , it first creates an empty queue and then flips a coin for each incoming edge of . It retrieves the node with probability from which starts and inserts it into the queue. Subsequently, the algorithm iteratively extracts the node at the top of the queue and examines each incoming edge of . If starts from an unvisited node , it adds into the queue with probability . This iterative process terminates when the queue becomes empty. Finally, the algorithm collects all nodes visited during this process (including ) and use them to form .
The aforementioned steps are repeated (in parallel) for times resulting in (Line 10). Let be the set of nodes that appear in any of these (Line 11). For all the nodes in , the algorithm greedily selects the one, say , which appears in the most number of rr sets in (Lines 13-16), indicating that can reach the maximal number of nodes in . Then it removes from the rr sets where appears (Lines 17-18). In particular, we denote if , and otherwise. This seeds selection process is iteratively performed for rounds to identify the final set of seed nodes (Lines 12-18).
|Network||nodes||edges||Degree of Change (DoC)|
The time complexity of the proteus-seer Algorithm is .
As reported in , the time complexity of computing is where is a quality factor which controls the results quality. The time complexity of evolution simulation based on FFM is . The process of generating a rr set requires a bfs, which is . As has been computed and fixed, then generating all different rr sets requires . The time complexity of seeds selection is . Moreover, is a predefined quality parameter (always set as ). Therefore, the overall time complexity of Algorithm 2 is .
In this section, we investigate the performance of proteus-genie and proteus-seer. All algorithms considered for our investigation are implemented in C++. We ran all experiments on 3.2GHz Quad-Core Intel i7 machines with 16gb ram, running Windows 7. Note that there is no existing im algorithm that addresses the proteus-im problem. Hence, we are confined to use state-of-the-art algorithms designed for the classical im problem as baseline methods.
Specifically, we investigate the following key issues. (1) Is the seed set selected at target time differs significantly from those selected at current time ? (2) Do our proposed algorithms designed for the proteus-im problem consistently produce superior quality seeds compared to state-of-the-art algorithms designed for the classical im problem? (3) Is the running time of the proteus-seer algorithm reasonable for large networks without significantly compromising the quality of influence spread? (4) What is the impact of various parameters (e.g., , , ) on the performance of the proposed algorithms?
Vi-a Experimental Setup
Datasets. Recall from Section I, influence propagation may take several weeks to months and different networks may evolve at varying rates during this period. Hence, we choose real-world and synthetic datasets for our experiments to represent these varying degree of change (DoC). Table II summarizes these datasets. We use two real-world datasets to generate different snapshots of a network representing different degree of evolution. The first one is high-energy physics (Hep) paper citation networks collected through Arxiv555 http://arxiv.com during the period from January 1993 to April 2003 (124 months). It contains the historical logs for the appearance timestamp of each paper as well as its citation links666 Downloaded from http://www.cs.cornell.edu/projects/kddcup/datasets.html.. Since each node is associated with a timestamp indicating when it has joined the network, we can construct different instances of the social network at different time points. The networks Ph- and Ph- represent two temporal states of the citation graph. The second dataset, Patents777 Downloaded from http://www.nber.org/patents/., comprises of information on almost 3 million U.S. patents granted between January 1963 and December 1999 and all citations made to these patents between 1975 and 1999 (over 16 million). A specific temporal state is extracted by selecting all citations (edges) that appear before a specific timestamp. In particular, Pa-, Pa-, Pa- and Pa- are selected as four representative temporal states of this citation network. Note that we can extract any temporal states (e.g., weekly, monthly, or yearly) from these two networks. We also generate synthetic datasets using the Forest Fire model888 According to steps described in http://snap.stanford.edu/snap-1.8/download.html. (with default and ) in order to simulate snapshots of a network with small degree of changes. Specifically, we generate three temporal snapshots with slightly varying number of nodes and edges, denoted by Syn-, Syn-, and Syn-.
The last column in Table II specifies the degree of change in the network w.r.t. the number of nodes and edges compared to the preceding snapshot. In summary, the synthetic datasets represent networks with small degree of evolution. The real-world datasets, on the other hand, represent networks with moderate (Hep) or high (Patents) degree of change. It is worth emphasizing that different real-world networks may have different degree of evolution during influence propagation. Hence, the seed set selection is impacted by the evolution characteristics of the underlying network as well as .
Forest Fire Model (FFM) parameters. As discussed in Section IV-B, the forward burning probability and backward burning ratio are selected by fitting the model using the network evolution historical logs before . That is, we select the states of networks before Ph- and Pa-, and fit ffm accordingly. Specifically, we set for Ph-; for Pa-.
Algorithms. We run the following im algorithms under ic model (with ) for our experimental study:
“Greedy”: MixGreedyIC algorithm  to address the classical im problem, as it exhibits the best seeds quality.
“irie”: irie algorithm proposed in .
“imm”: imm algorithm proposed in .
“MaxG”: MaxG algorithm (with ) , which is a dynamic im algorithm that requires the full knowledge of network evolution. Specifically, it assumes (a) the complete evolution logs of the network is known; (b) each time a new node arrives, there is sufficient time to update the seeds; and (c) the influence propagation time is negligible.
“pro-genie”: The greedy algorithm in Section IV for the proteus-im problem.
“pro-seer”: The rr set-based method proposed in Section V to address the proteus-im problem.
Unless specified otherwise, we set and for pro-genie, pro-seer, and Greedy, respectively.
Vi-B Experimental Results
Seeds at current and target times. In Section I, we remarked that the seeds selected at current time can be significantly different from the seeds selected at target time due to the evolution of the underlying network. Hence, we first investigate whether this is indeed true. That is, whether the seeds selected by a state-of-the-art im algorithm at differ significantly from those selected using the same algorithm at time . To this end, we take a pair of current and target networks (, ) at time points and . We plot the ranked seed nodes in on the X-axis by running a classical im algorithm on it, which can be considered as the ground-truth seed set. For clarity, we only consider the top-10 most influential seed nodes (ranked by their expected influence) in that also exist in . Specifically, in our experiments these seeds are selected by running Greedy on . Then, we plot on Y-axis the corresponding ranks of these seeds in by running the Greedy and proteus-genie algorithms on . Hence, in our plot if a seed occupies the coordinate then it means that is ranked -th at time (i.e., it exhibits the -th maximal marginal expected influence in ) and ranked -th in at time . Consequently, the larger the deviation from , the worse is the quality of selected seeds as the seeds selected at differ significantly from the seeds needed to maximize influence at (recall that our goal is to identify seeds at that maximizes influence spread at ).
Note that for networks with high degree of change, it is intuitive to expect the seeds to be different in and . Hence, we use the synthetic datasets for this experiment as it exhibits low degree of change (can be considered as “worst” case scenario). Fig. 4 plots the ranks of the top-10 seeds at times and for different pairs of networks. For instance, in Fig. 4(a), and are network snapshots Syn- and Syn-, respectively. We have the following observations. First, the ranks of seeds selected by Greedy using deviates significantly from those on for all datasets. That is, the seeds selected by classical im algorithms at times and differ significantly. Consequently, seeds selected at may not be suitable for maximizing the influence at (further validated below). Second, the ranks of the seeds selected by proteus-genie at time are relatively closer to the top-10 seeds in for all datasets, emphasizing the need for reformulating the classical im problem as proteus-im problem.
Effect of degree of change and . The above experiments demonstrate that the seeds at current and target times are different. We now study how the degree of change to the network impact the influence and seed set. Observe that degree of change is correlated with . Intuitively, the longer is the influence propagation time () the greater is the degree of change to the network. First, we investigate the influence spread quality by varying the duration between and (i.e., influence propagation time ). Since influence propagation may take weeks, we use the real-world networks to report the effect of target time by selecting several states of at different target times such that ranges from 0 to 15 weeks. The results are shown in Figures 5(a) and (b), where are Pa- and Pa-, respectively. Note that the procedure to extract a at a specific week is same as the one to extract different snapshots in the Patents network (Section VI-A). For instance, in Fig. 5(a) we fix as Pa- and acquire each temporal state of the network at weeks after Pa- by selecting all citations (edges) that appear before the corresponding week. We evaluate the influence spread of the seeds selected from by different algorithms to those selected from different states of using Greedy (run directly on ). Here denotes the seeds selected by different algorithms running on while denotes the ideal solution by running Greedy in . Therefore, the Y-axis shows the ratio that compares the expected influence for different algorithms to that of running Greedy in . When (i.e., ), the problem degenerates to classical im. Consequently, seeds of all algorithms share almost the same quality. However, as increases, the quality of seeds set selected by different algorithms at decreases. Clearly, in comparison to classical im techniques, our proteus-seer exhibits the highest influence spread quality for .
Next, we vary the effect network change rate by selecting several states of such that . Note that this simulates different degree of change to the topology of the network at . We evaluate the influence spread of the seeds selected from by different algorithms to those selected from different state of using Greedy (run directly over ). The results are shown in Figures 5(c) and (d). When (i.e., ), the problem degenerates to classical im. As the difference between and increases, the quality of seeds set selected by different algorithms at decrease. Clearly, our proposed techniques exhibit the highest influence spread quality.
Influence spread for different . Next, we simulate the influence spread of selected seeds for networks with varying and investigate whether state-of-the-art im algorithms exhibit similar or different influence spread quality compared to our proposed algorithms for the proteus-im problem. Fig. 6 plots the influence spreads (with influence probability ) for different for networks exhibiting different DoC. In each of the figures, we select top- seeds in using Greedy, irie, imm, MaxG, proteus-genie, and proteus-seer and simulate the influence spread process in . The influence spread is measured by the number of eventually influence nodes that is averaged over 10,000 simulations. We compare the influence spread results with that of the seeds selected using Greedy in , which can be viewed as the ground-truth seeds set. Note that closer the influence spread (computed by a specific technique) is to this ground-truth seed set, the better is its influence spread quality. Specifically, are chosen to represent different degree of change (DoC).
Observe that proteus-genie and MaxG achieve the best influence spread quality, followed by our heuristic approach proteus-seer. Notably, MaxG iteratively updates the selected seeds whenever a new node arrives in the network. Interestingly, despite the impractical assumptions made by MaxG as mentioned in Section VI-A, it cannot provide distinguishable performance benefit compared to our algorithms. In other words, proteus-genie and proteus-seer demonstrate comparable seed set quality without assuming the knowledge of complete topology of the target network (unlike MaxG). In summary, the influence spread of our proposed approaches are within 83% - 99% of the ideal solution. In contrast, the state-of-the-art classical im approaches only achieve 65%-89% of the ideal influence spread.
Running times. Fig. 7 reports the running times of different algorithms for different DoC. Specifically, we run Greedy, irie, imm, proteus-genie, and proteus-seer on 999 MaxG keeps on running during the evolution of a network in contrast to all other competitors. Hence, for fair comparison its running time is not included. for the three different datasets. Observe that although proteus-genie produces most accurate results, it also consumes the longest time as it requires iterations of network evolution simulations. On the other hand, proteus-seer is significantly faster than proteus-genie as the former avoids huge number of iterations caused by and . Our proteus-seer finishes within an hour on the largest network while providing near-optimal influence spread quality. Therefore, proteus-seer is suitable for time-sensitive tasks and gives a good balance between influence spread quality and running time. Note that although our techniques are slower than irie and imm, as reported earlier, these approaches have poorer influence spread quality. It is important to reemphasize that the seed set quality is paramount to companies as they would like to maximize the influence spread of their products.