One of the key problems in viral marketing in Online Social Networks (OSNs) is Influence Maximization (), which aims at selecting a set of users, called a seed set, in a social network so that the expected number of influenced nodes is maximized. Kempe et al.  first introduced the problem under two classical information diffusion models, namely, Independent Cascade () and Linear Threshold (
), as a combinatorial optimization and designed a-approximation algorithm. Due to its immense application potential, a vast amount of work focused on in many respects: designing effective algorithms [1, 13, 22, 21] and studying variants with marketing applications [20, 24, 3, 4, 16, 25] and its application for rumor/misinformation blocking [26, 15, 18].
Recently, Borgs et al.  make a theoretical breakthrough by proposing a reverse influence sketch (RIS) algorithm which is the foundation for later efficient algorithms. This algorithm captures the influences in a reverse manner and guarantees
-approximation solution with probability at leastwhere is the number of nodes in the network. In a sequential work, Tang et al.  first presented a -approximation algorithm that is scalable for billion-size networks with the running time reduces to .  later proposed the IMM algorithm, which further reduced the number of samples of RIS process by using Martingale analysis. Nguyen et al. [13, 11] algorithms to further reduce the running time up to orders of magnitude by modifying the original RIS framework.
In a more realistic scenario with taking into account both arbitrary cost for selecting a node and arbitrary benefit for influencing a node, Nguyen et al.  studied Cost-aware Targeted Viral Marketing () problem, a generalization of
, which aims to select a seed set within limited budget so that the expected total benefit over the influenced nodes (benefit function) is maximized. In this work, the benefit function can be estimated through benefit samples sampling algorithm, a generalized version of RIS and the authors proposedalgorithm, a -approximation algorithm with high probability with the number of required samples at least under model. In another direction, Li et al.  solved with an almost exact algorithm approach, which can return the solution within a ratio of with high probability. The algorithm needs at most samples and no bound on the time complexity as this is an exact approach, not the approximation algorithm approach. However, the authors have shown that can run on the Twitter datasets within four hours .
In this paper, we tackle via an approximation approach (not exact) with a goal of obtaining the same approximation ratio as in , but significantly reducing the number of samples to samples in the worst-case for . Our algorithm, namely Importance sample-based for Viral Marketing (), contains two innovative techniques: 1) We note that importance samples (in the space of all benefit samples) can be used to estimate the benefit function. This leads to a general result of using importance sketches to estimate the influence spread function for . 2) Base on that we design a new strategy to check approximation guarantee condition of candidate solutions. We develop two lower and upper bound functions to check approximation guarantee condition and adequate statistical evidence on the solution quality for termination. Our algorithm takes lower total samples than , which is state of the art method with same approximation guarantee in both theoretical analysis and practical. In summary, the contributions of this paper are as follows.
We first present Importance Benefit Sample () and Importance Benefit Sampling Algorithm (), an algorithm to generate . We then show that the benefit function can be estimated through (Lemma 3).
We proposed , an efficient approximation algorithm which returns with high probability and requires at most samples in the worst-case for under model.
We conduct experiments on various real social networks. The experiments on some social networks suggest that better than , a current best method on with the same approximation guarantee, in terms of running time, number of required samples, and used memory. It achieves up to 153 times speed-up and the total required samples less than 112 times than that of .
Organization. The rest of the paper is organized as follows. In Section 2, we present the model and the problem definition. Section 3 presents an analysis of generating to estimate the benefit function. Our algorithm along with its theoretical analysis are introduced in Section 4. Experimental results are shown in Section 5. Finally Section 6 concludes the paper.
2 Model and Problem Definitions
In this section, we present the well-known Independent Cascade () model and the problem. The frequently used notations are summarized in Table 1.
|# nodes and # of edges in .||seed set|
|The benefit of||An estimation of|
|An optimal solution||.|
In this model, a social network can be abstracted as a directed graph with a node set and a directed edge set , and . Let and be the set of in-neighbors and out-neighbor of , respectively. Each edge has a probability that represents the information transmission from to . The diffusion process from a seed set to the rest of the network happens round by round as follows. At step 0, all nodes in are activated while the rest of the nodes are inactive. At step , an active node in the previous step has a single chance to activate each currently inactive out neighbour node with the successful probability . Once a node becomes activated, it remains in that state in all subsequent steps. The influence propagation stops when no more node can be activated.
Kempe et al.  showed model is equivalent to the reachability in a random graph , called live-edge or sample graph. We generate a sample graph with the set of nodes be and the set of edges be by: (1) setting , and (2) selecting into with probability . The probability to generate from is: and the influence spread of is calculated by:
where denotes the set of reachable nodes from in . In , each node has a cost if it is selected into the seed set and a benefit if it is activated. The total benefit over all influenced nodes (benefit function) of seed set is defined as follows:
problem is formally defined as follows.
Definition 1 ()
Given a social network with a node set and a directed edge set under a model. Each node has a selecting cost and a benefit if is active. Given a budget , find a seed set with the total cost to maximize .
3 Importance Benefit Sampling
In this section, we first recap the Benefit Sampling Algorithm (BSA) to estimate the benefit function [10, 14]. We then introduce our novel Importance Benefit Sample () concept along with the algorithm to generate these samples.
Benefit Sampling Algorithm (BSA)  generates Benefit Sample according to the following steps: (1) picking a node as a source node with probability , (2) generating a sample graph from and 3) returning as the set of nodes that can reach in . Denote
as a collection of benefit samples generated by BSA and define a random variable. Nguyen et al.  prove the following Lemma to estimate the benefit function:
For any set of nodes , we have:
Let be a set of all benefit samples and be a benefit sample with source node , the probability of generating is
We now describe the and algorithm that generates . The main idea of this method is based on the observation that benefit samples containing only one node contributes insignificantly in calculating the benefit function. For a source node , assume is set of all benefit samples that has source node . We divide into two components: - singular benefit samples which contain only node , and - importance benefit samples which contain at least two nodes. Let , denote as the event that none of nodes in is selected, we have . The probability of generating an with source node is equal to . Denote as the event that is the first selected node, we have:
Events are disjoint sets and . The probability of generating an that has source node with is the first selected node is
Denote as the probability spaces of all s, we have:
The probability that is a source node of an in is
. By normalizing factor to fulfill a probability distribution of a sample space, the probability thatis a source node of an in is calculated as follows:
For any , we have
Based on the above analysis, we propose , an algorithm to generate an , which is depicted in Algorithm 1. The algorithm first selects a source node with a probability according to eq. (7) (line 1). It next picks the first incoming node to (line 2). The rest algorithm is similar to the Importance Influence Sampling Algorithm .
For any is generated by , we define random variables , and
We have , with .
For any set of nodes , we have:
4 Importance Sample-based Viral Marketing Algorithm
We present Importance Sample-based Viral Marketing (), an -approximation algorithm for . includes two components: generating to estimate the benefit function and new strategy to find candidate solution and checks its approximation guarantee condition by developing two lower and upper bound functions.
Algorithm description. Our algorithm is depicted in Algorithm 2. At first glance, the maintains one stream of samples similar to . However, it has changed checking solution quality and limiting the maximum required samples. It first calculates the maximum number of s (line 1), where is a lower-bound of , which significantly reduces the number of required samples while still ensuring total samples. then generates a set of s contains samples (line 2). The main phrase consists of at most iterations (line 4-14). In each iterator , the algorithm maintains a set consists and finds a candidate solution by using Improve Greedy Algorithm () for Budgeted Maximum Coverage (BMC) problem . finds solution for instance in which is a set of samples, is the universal set and is the budget. This algorithm returns -approximation solution  (due to limitation of space, we put and alg. for calculating in Appendix). The main algorithm then calculates: - a lower bound of and - a upper bound of optimal value (line 7). We show that (Lemma 6) and (Lemma 7). The algorithm checks approximation guarantee condition: (line 8). If this condition is true, it returns as a solution and terminates. If not, it doubles number of samples (line 12) and moves onto the next iterator .
If be a form of martingale, , for , and
where denotes the variance of a random variable. Then, for
denotes the variance of a random variable. Then, for any, we have:
Apply Martingale theory , we have the following Lemma:
For any , is the mean of , and an estimation of is , we have:
Assume that is a feasible solution of . Since we do not known the size of , we can observe that the number of possible solutions is less than , where .
For , . If the number of samples , returns a ()-approximation solution with probability at least .
We can apply Theorem 4.1 to obtain the following Corollary:
At iterator ,
Due to the space constraint, we omit some proofs and presented our full version . By using Lemma 5, we give two lower-bound and upper-bound functions in Lemma 6 and Lemma 7. They help the main algorithm check the approximate condition of the candidate based on statistical evidence.
Lemma 6 (Lower-bound)
For any , a set of s and is an estimation of over by (14). Let and
Lemma 7 (Upper-bound)
For any , a set of s , is a solution return by for input data , and is an estimation of over by (14). Let
Theorem 4.2 (Main Theorem)
returns -approximation solution algorithm with probability at least .
We consider following events. , and . According to Lemmas 6, 7, and Corollary 1, we have: and . Apply the union bound the probability that none of events at least . Under this assumption, we will show that returns a solution satisfying . If the algorithm stops with condition , the solution satisfies approximation guarantee due to Corollary 1. Otherwise, if stops at some iterator . At this iterator, the condition in line 8 is satisfied, i.e, This completes the proof. ∎
From Corollary 1, needs at most samples. We have . Combine with , we have .
In this section, we briefly conduct experiments to compare the performance of our algorithm to other algorithms for on for aspects: the solution quality, running time, number of required samples and used memory.
5.1 Experimental Settings
Datasets. We select a diverse set of 4 datasets including Gnutella, Epinion Amazon and DBLP. The description used datasets is provided in Table 2.
Algorithms compared. We compare the algorithm against the algorithm , the state-of-the-art algorithm for with the same approximation ratio, and two baseline algorithms: Random and Degree.
Parameter setting. We follow previous works on and [10, 14, 21] to set up parameters. The transmission probability is randomly selected in according to the Trivalency model. The cost of a node proportional to the out-degree : . In all the experiments, we choose a random of all the nodes to be the target set and assign benefit 1 and we set and as a default setting. The budget varies from 1 to 1000.
5.2 Experiment results
outperforms other algorithms and gives the best result on Amazon network. It provides up to 5.4 times better than on Amazon. For Gnutella, Epinions and DBLP networks gives similar result to . This is because these two algorithms give the same approximation ratio for .
Fig. 1 shows the benefit value provided by algorithms. outperforms other algorithms and gives the best result on Amazon network. It provides up to 5.4 times better than on Amazon. For Gnutella, Epinions and DBLP networks gives similar result to . This is because these two algorithms give the same approximation ratio for .
|Algoirthm||Total samples ()||Memory usage (M)|
The running time of algorithms is shown in Table 3. The running time of our algorithm in all networks are significantly lower than that of . is up to 6.4, 7.1, 153 and 4.8 times faster than on Gnutella, Epinion, Amazon and DBLP networks.
Table 4 displays the memory usage and the number of required samples of and when the budget . The number of samples generated by is up to more 112 times smaller than that of . However, the memory usage of is only 1.5 to 4.6 times smaller than those of because of the memory for storing the graph is counted into the memory usage of each algorithm. This results also confirm our theoretical establishment in Section 4 that requires much less number of samples needed.
In this paper, we propose , an efficient approximation algorithm for , which has an approximation ratio of and the number of required samples is , which is significantly lower than that of the state-of-the-art . Experiments show that is up to 153 times faster and requires up to 112 times fewer total samples than the algorithm. For the future work, we plan to implement this importance sampling concept on the exact approach to evaluate potential benefits of the importance sampling in terms of running time and number of required samples.
This work is partially supported by NSF CNS-1443905, NSF EFRI 1441231, and NSF NSF CNS-1814614 grants.
-  (2014) Maximizing social influence in nearly optimal time. In Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, pp. 946–957. Cited by: §1, §1.
-  (2006) Survey: concentration inequalities and martingale inequalities: A survey. Internet Mathematics 3 (1), pp. 79–127. Cited by: §4, §4.
-  (2012) Cheap, easy, and massively effective viral marketing in social networks: truth or fiction?. In 23rd ACM Conference on Hypertext and Social Media, HT ’12, pp. 165–174. Cited by: §1.
-  (2014) On the approximability of positive influence dominating set in social networks. J. Comb. Optim. 27 (3), pp. 487–503. Cited by: §1.
-  (2003) Maximizing the spread of influence through a social network. In Proceedings of the Ninth ACM International Conference on Knowledge Discovery and Data Mining SIGKDD, pp. 137–146. Cited by: §1, §2.
-  (1999) The budgeted maximum coverage problem. Inf. Process. Lett. 70 (1), pp. 39–45. Cited by: §4.
-  (2007) The dynamics of viral marketing. ACM TWEB 1 (1), pp. 5. Cited by: Table 2.
-  (2007) Graph evolution: densification and shrinking diameters. TKDD 1 (1), pp. 2. Cited by: Table 2.
-  (2017) Why approximate when you can get the exact? optimal targeted viral marketing at scale. In 2017 IEEE Conference on Computer Communications, INFOCOM 2017, pp. 1–9. Cited by: §1.
-  (2016) Cost-aware targeted viral marketing in billion-scale networks. In 35th Annual IEEE International Conference on Computer Communications, INFOCOM, pp. 1–9. Cited by: §1, §1, §3, §3, §5.1.
-  (2018) Revisiting of ’revisiting the stop-and-stare algorithms for influence maximization’. In 7th International Conference on Computational Data and Social Networks, CSoNet 2018, pp. 273–285. Cited by: §1.
-  (2017) Importance sketching of influence dynamics in billion-scale networks. In 2017 IEEE International Conference on Data Mining, ICDM 2017, pp. 337–346. Cited by: §1, §3, §3.
-  (2016) Stop-and-stare: optimal sampling algorithms for viral marketing in billion-scale networks. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD, pp. 695–710. Cited by: §1, §1.
-  (2017) A billion-scale approximation algorithm for maximizing benefit in viral marketing. IEEE/ACM Trans. Netw. 25 (4), pp. 2419–2429. Cited by: §3, §5.1.
-  (2013) Analysis of misinformation containment in online social networks. Computer Networks 57 (10), pp. 2133–2146. Cited by: §1.
-  (2019) Competitive influence maximization within time and budget constraints in online social networks: an algorithmic approach. Applied Sciences 9 (11). External Links: Cited by: §1.
-  Importance sample-based approximation algorithm for cost-aware targeted viral marketing. External Links: Cited by: §4.
-  (2018) Maximizing misinformation restriction within time and budget constraints. J. Comb. Optim. 35 (4), pp. 1202–1240. External Links: Cited by: §1.
-  (2003) Trust management for the semantic web. In The Semantic Web - ISWC 2003, Second International Semantic Web Conference, pp. 351–368. Cited by: Table 2.
-  (2012) Interest-matching information propagation in multiple online social networks. In 21st ACM International Conference on Information and Knowledge Management, CIKM’12, pp. 1824–1828. Cited by: §1.
-  (2015) Influence maximization in near-linear time: A martingale approach. In Proceedings of the 2015 ACM International Conference on Management of Data (SIGMOD), pp. 1539–1554. Cited by: §1, §1, §5.1.
-  (2014) Influence maximization: near-optimal time complexity meets practical efficiency. In International Conference on Management of Data, SIGMOD 2014, pp. 75–86. Cited by: §1, §1.
-  (2012) Defining and evaluating network communities based on ground-truth. In 12th IEEE International Conference on Data Mining, ICDM 2012, pp. 745–754. Cited by: Table 2.
-  (2016) Least cost influence maximization across multiple social networks. IEEE/ACM Trans. Netw. 24 (2), pp. 929–939. Cited by: §1.
-  (2016) Profit maximization for multiple products in online social networks. In 35th Annual IEEE International Conference on Computer Communications, INFOCOM 2016, San Francisco, CA, USA, April 10-14, 2016, pp. 1–9. Cited by: §1.
-  (2015) Limiting the spread of misinformation while effectively raising awareness in social networks. In Computational Social Networks - 4th International Conference, CSoNet 2015, Beijing, China, August 4-6, 2015, Proceedings, pp. 35–47. Cited by: §1.
Proof of Lemmas and Theorems
Proof (Proof of Lemma 2)