Importance Sample-based Approximation Algorithm for Cost-aware Targeted Viral Marketing

10/03/2019 ∙ by Canh V. Pham, et al. ∙ University of Florida 0

Cost-aware Targeted Viral Marketing (CTVM), a generalization of Influence Maximization (IM), has received a lot of attentions recently due to its commercial values. Previous approximation algorithms for this problem required a large number of samples to ensure approximate guarantee. In this paper, we propose an efficient approximation algorithm which uses fewer samples but provides the same theoretical guarantees based on generating and using important samples in its operation. Experiments on real social networks show that our proposed method outperforms the state-of-the-art algorithm which provides the same approximation ratio in terms of the number of required samples and running time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the key problems in viral marketing in Online Social Networks (OSNs) is Influence Maximization (), which aims at selecting a set of users, called a seed set, in a social network so that the expected number of influenced nodes is maximized. Kempe et al. [5] first introduced the problem under two classical information diffusion models, namely, Independent Cascade () and Linear Threshold (

), as a combinatorial optimization and designed a

-approximation algorithm. Due to its immense application potential, a vast amount of work focused on in many respects: designing effective algorithms [1, 13, 22, 21] and studying variants with marketing applications [20, 24, 3, 4, 16, 25] and its application for rumor/misinformation blocking [26, 15, 18].

Recently, Borgs et al. [1] make a theoretical breakthrough by proposing a reverse influence sketch (RIS) algorithm which is the foundation for later efficient algorithms. This algorithm captures the influences in a reverse manner and guarantees

-approximation solution with probability at least

where is the number of nodes in the network. In a sequential work, Tang et al. [22] first presented a -approximation algorithm that is scalable for billion-size networks with the running time reduces to . [21] later proposed the IMM algorithm, which further reduced the number of samples of RIS process by using Martingale analysis. Nguyen et al. [13, 11] algorithms to further reduce the running time up to orders of magnitude by modifying the original RIS framework.

In a more realistic scenario with taking into account both arbitrary cost for selecting a node and arbitrary benefit for influencing a node, Nguyen et al. [10] studied Cost-aware Targeted Viral Marketing () problem, a generalization of

, which aims to select a seed set within limited budget so that the expected total benefit over the influenced nodes (benefit function) is maximized. In this work, the benefit function can be estimated through benefit samples sampling algorithm, a generalized version of RIS and the authors proposed

algorithm, a -approximation algorithm with high probability with the number of required samples at least under model. In another direction, Li et al. [9] solved with an almost exact algorithm approach, which can return the solution within a ratio of with high probability. The algorithm needs at most samples and no bound on the time complexity as this is an exact approach, not the approximation algorithm approach. However, the authors have shown that can run on the Twitter datasets within four hours [9].

In this paper, we tackle via an approximation approach (not exact) with a goal of obtaining the same approximation ratio as in [10], but significantly reducing the number of samples to samples in the worst-case for . Our algorithm, namely Importance sample-based for Viral Marketing (), contains two innovative techniques: 1) We note that importance samples (in the space of all benefit samples) can be used to estimate the benefit function. This leads to a general result of using importance sketches to estimate the influence spread function for [12]. 2) Base on that we design a new strategy to check approximation guarantee condition of candidate solutions. We develop two lower and upper bound functions to check approximation guarantee condition and adequate statistical evidence on the solution quality for termination. Our algorithm takes lower total samples than , which is state of the art method with same approximation guarantee in both theoretical analysis and practical. In summary, the contributions of this paper are as follows.

  • We first present Importance Benefit Sample () and Importance Benefit Sampling Algorithm (), an algorithm to generate . We then show that the benefit function can be estimated through (Lemma 3).

  • We proposed , an efficient approximation algorithm which returns with high probability and requires at most samples in the worst-case for under model.

  • We conduct experiments on various real social networks. The experiments on some social networks suggest that better than , a current best method on with the same approximation guarantee, in terms of running time, number of required samples, and used memory. It achieves up to 153 times speed-up and the total required samples less than 112 times than that of .

Organization. The rest of the paper is organized as follows. In Section 2, we present the model and the problem definition. Section 3 presents an analysis of generating to estimate the benefit function. Our algorithm along with its theoretical analysis are introduced in Section 4. Experimental results are shown in Section 5. Finally Section 6 concludes the paper.

2 Model and Problem Definitions

In this section, we present the well-known Independent Cascade () model and the problem. The frequently used notations are summarized in Table 1.

Symbol Notation Symbol Notation

 

# nodes and # of edges in . seed set
The benefit of An estimation of
An optimal solution .
Table 1: Table of symbols

In this model, a social network can be abstracted as a directed graph with a node set and a directed edge set , and . Let and be the set of in-neighbors and out-neighbor of , respectively. Each edge has a probability that represents the information transmission from to . The diffusion process from a seed set to the rest of the network happens round by round as follows. At step 0, all nodes in are activated while the rest of the nodes are inactive. At step , an active node in the previous step has a single chance to activate each currently inactive out neighbour node with the successful probability . Once a node becomes activated, it remains in that state in all subsequent steps. The influence propagation stops when no more node can be activated.

Kempe et al. [5] showed model is equivalent to the reachability in a random graph , called live-edge or sample graph. We generate a sample graph with the set of nodes be and the set of edges be by: (1) setting , and (2) selecting into with probability . The probability to generate from is: and the influence spread of is calculated by:

(1)

where denotes the set of reachable nodes from in . In , each node has a cost if it is selected into the seed set and a benefit if it is activated. The total benefit over all influenced nodes (benefit function) of seed set is defined as follows:

(2)

problem is formally defined as follows.

Definition 1 ()

Given a social network with a node set and a directed edge set under a model. Each node has a selecting cost and a benefit if is active. Given a budget , find a seed set with the total cost to maximize .

3 Importance Benefit Sampling

In this section, we first recap the Benefit Sampling Algorithm (BSA) to estimate the benefit function [10, 14]. We then introduce our novel Importance Benefit Sample () concept along with the algorithm to generate these samples.

Benefit Sampling Algorithm (BSA) [10] generates Benefit Sample according to the following steps: (1) picking a node as a source node with probability , (2) generating a sample graph from and 3) returning as the set of nodes that can reach in . Denote

as a collection of benefit samples generated by BSA and define a random variable

. Nguyen et al. [10] prove the following Lemma to estimate the benefit function:

Lemma 1

For any set of nodes , we have:

Let be a set of all benefit samples and be a benefit sample with source node , the probability of generating is

(3)

We now describe the and algorithm that generates . The main idea of this method is based on the observation that benefit samples containing only one node contributes insignificantly in calculating the benefit function. For a source node , assume is set of all benefit samples that has source node . We divide into two components: - singular benefit samples which contain only node , and - importance benefit samples which contain at least two nodes. Let , denote as the event that none of nodes in is selected, we have . The probability of generating an with source node is equal to . Denote as the event that is the first selected node, we have:

(4)

Events are disjoint sets and . The probability of generating an that has source node with is the first selected node is

(5)

Denote as the probability spaces of all s, we have:

(6)

The probability that is a source node of an in is

. By normalizing factor to fulfill a probability distribution of a sample space, the probability that

is a source node of an in is calculated as follows:

(7)
Lemma 2

For any , we have

Based on the above analysis, we propose , an algorithm to generate an , which is depicted in Algorithm 1. The algorithm first selects a source node with a probability according to eq. (7) (line 1). It next picks the first incoming node to (line 2). The rest algorithm is similar to the Importance Influence Sampling Algorithm [12].

Input: Graph under model
Output: A Benefit Important Samples
1 Pick a source node with probability in eq. (7)
2 Select an in-neighbour of with probability
3 Initialize a queue and
4 for  to  do
5       With probability : and
6 end for
7while  is not empty do
8      
9       foreach  do
10             ,
11       end foreach
12      
13 end while
return
Algorithm 1 for model

For any is generated by , we define random variables , and

(8)

We have , with .

Lemma 3

For any set of nodes , we have:

(9)
Proof

Let and . From Lemma 1, we have

(10)

Since each contains only source node, if . In this case, we have , with . Put it into (10), we have:

(11)
(12)
(13)

This completes the proof.

Basically, Lemma 3 generalizes the result of Lemma 3 in [12] in which important reverse reachable sets (sketches) can be used to estimate the influence spread. Therefore, an estimation over a collection of is:

(14)

4 Importance Sample-based Viral Marketing Algorithm

We present Importance Sample-based Viral Marketing (), an -approximation algorithm for . includes two components: generating to estimate the benefit function and new strategy to find candidate solution and checks its approximation guarantee condition by developing two lower and upper bound functions.
Algorithm description. Our algorithm is depicted in Algorithm 2. At first glance, the maintains one stream of samples similar to . However, it has changed checking solution quality and limiting the maximum required samples. It first calculates the maximum number of s (line 1), where is a lower-bound of , which significantly reduces the number of required samples while still ensuring total samples. then generates a set of s contains samples (line 2). The main phrase consists of at most iterations (line 4-14). In each iterator , the algorithm maintains a set consists and finds a candidate solution by using Improve Greedy Algorithm () for Budgeted Maximum Coverage (BMC) problem [6]. finds solution for instance in which is a set of samples, is the universal set and is the budget. This algorithm returns -approximation solution [6] (due to limitation of space, we put and alg. for calculating in Appendix). The main algorithm then calculates: - a lower bound of and - a upper bound of optimal value (line 7). We show that (Lemma 6) and (Lemma 7). The algorithm checks approximation guarantee condition: (line 8). If this condition is true, it returns as a solution and terminates. If not, it doubles number of samples (line 12) and moves onto the next iterator .

Input: Graph , budget , and
Output: seed
1 .
2 , ,
3 ,
4 repeat
5       Generate more s and add them into
6      
7       Calculate by Lemma 6 and calculate by Lemma 7.
8       if  then
9             return
10      else
11             , ,
12       end if
13      
14until ;
return ;
Algorithm 2 algorithm

Theoretical analysis. We observe that . Let randomly variable , where . For a sequence random variables we have . Hence, be a form of martingale [2]. We have following result from [2]

Lemma 4

If be a form of martingale, , for , and

(15)

where

denotes the variance of a random variable. Then, for any

, we have:

(16)

Apply Martingale theory [2], we have the following Lemma:

Lemma 5

For any , is the mean of , and an estimation of is , we have:

(17)
(18)

Assume that is a feasible solution of . Since we do not known the size of , we can observe that the number of possible solutions is less than , where .

Theorem 4.1

For , . If the number of samples , returns a ()-approximation solution with probability at least .

We can apply Theorem 4.1 to obtain the following Corollary:

Corollary 1

At iterator ,

Due to the space constraint, we omit some proofs and presented our full version [17]. By using Lemma 5, we give two lower-bound and upper-bound functions in Lemma 6 and Lemma 7. They help the main algorithm check the approximate condition of the candidate based on statistical evidence.

Lemma 6 (Lower-bound)

For any , a set of s and is an estimation of over by (14). Let and

we have

Lemma 7 (Upper-bound)

For any , a set of s , is a solution return by for input data , and is an estimation of over by (14). Let

(19)

we have

Theorem 4.2 (Main Theorem)

returns -approximation solution algorithm with probability at least .

Proof

We consider following events. , and . According to Lemmas 6, 7, and Corollary 1, we have: and . Apply the union bound the probability that none of events at least . Under this assumption, we will show that returns a solution satisfying . If the algorithm stops with condition , the solution satisfies approximation guarantee due to Corollary 1. Otherwise, if stops at some iterator . At this iterator, the condition in line 8 is satisfied, i.e, This completes the proof. ∎

From Corollary 1, needs at most samples. We have . Combine with , we have .

5 Experiment

In this section, we briefly conduct experiments to compare the performance of our algorithm to other algorithms for on for aspects: the solution quality, running time, number of required samples and used memory.

5.1 Experimental Settings

Datasets. We select a diverse set of 4 datasets including Gnutella, Epinion Amazon and DBLP. The description used datasets is provided in Table 2.

Dataset #Node #Edge Type Avg. degree Source
Gnutella 6.301 20.777 directed 3.3 [8]
Epinion 75.879 508.837 directed 6.7 [19]
Amazon 262.111 1.234.877 directed 4.7 [7]
DBLP 317.080 1.049.866 undirected 3.21 [23]
Table 2: Dataset

Algorithms compared. We compare the algorithm against the algorithm [10], the state-of-the-art algorithm for with the same approximation ratio, and two baseline algorithms: Random and Degree.
Parameter setting. We follow previous works on and [10, 14, 21] to set up parameters. The transmission probability is randomly selected in according to the Trivalency model. The cost of a node proportional to the out-degree [10]: . In all the experiments, we choose a random of all the nodes to be the target set and assign benefit 1 and we set and as a default setting. The budget varies from 1 to 1000.

5.2 Experiment results

outperforms other algorithms and gives the best result on Amazon network. It provides up to 5.4 times better than on Amazon. For Gnutella, Epinions and DBLP networks gives similar result to . This is because these two algorithms give the same approximation ratio for .

Figure 1: The benefit function achieved by algorithms

Fig. 1 shows the benefit value provided by algorithms. outperforms other algorithms and gives the best result on Amazon network. It provides up to 5.4 times better than on Amazon. For Gnutella, Epinions and DBLP networks gives similar result to . This is because these two algorithms give the same approximation ratio for .

Network Budget
100 200 300 400 500 600 700 800 900 1000
Gnutella 0.01
0.02 0.015 0.02 0.016 0.018 0.02 0.022 0.021 0.01 0.01
Epinion 1.09 1.4 0.9 1 1.1 1 1 0.9 0.87 0.9
7.8 0.95 6.7 3.1 3.6 3.4 3.4 3.5 1.1 1
Amazon 0.01 0.01 0.01 0.01 0.012 0.012 0.01 0.03 0.04 0.03
1.73 0.31 0.89 0.50 0.49 0.49 0.31 0.27 0.32 0.4
DBLP 1.7 0.14 0.8 0.4 0.2 0.23 0.13 0.13 0.14 0.14
2.6 0.4 1.7 1.9 1.1 1.2 0.7 0.6 0.5 0.4
Table 3: Running time between and (sec.) for
Algoirthm Total samples () Memory usage (M)
Gnutella Epinion Amazon DBLP Gnutella Epinion Amazon DBLP
0.99 1.12 1.25 1.27 5.9 46 53 66
10 10 270 140 22 67 95 102
Table 4: Number of samples and total memory between and for

The running time of algorithms is shown in Table 3. The running time of our algorithm in all networks are significantly lower than that of . is up to 6.4, 7.1, 153 and 4.8 times faster than on Gnutella, Epinion, Amazon and DBLP networks.

Table 4 displays the memory usage and the number of required samples of and when the budget . The number of samples generated by is up to more 112 times smaller than that of . However, the memory usage of is only 1.5 to 4.6 times smaller than those of because of the memory for storing the graph is counted into the memory usage of each algorithm. This results also confirm our theoretical establishment in Section 4 that requires much less number of samples needed.

6 Conclusion

In this paper, we propose , an efficient approximation algorithm for , which has an approximation ratio of and the number of required samples is , which is significantly lower than that of the state-of-the-art . Experiments show that is up to 153 times faster and requires up to 112 times fewer total samples than the algorithm. For the future work, we plan to implement this importance sampling concept on the exact approach to evaluate potential benefits of the importance sampling in terms of running time and number of required samples.

Acknowledgements

This work is partially supported by NSF CNS-1443905, NSF EFRI 1441231, and NSF NSF CNS-1814614 grants.

References

  • [1] C. Borgs, M. Brautbar, J. T. Chayes, and B. Lucier (2014) Maximizing social influence in nearly optimal time. In Proceedings of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, pp. 946–957. Cited by: §1, §1.
  • [2] F. R. K. Chung and L. Lu (2006) Survey: concentration inequalities and martingale inequalities: A survey. Internet Mathematics 3 (1), pp. 79–127. Cited by: §4, §4.
  • [3] T. N. Dinh, D. T. Nguyen, and M. T. Thai (2012) Cheap, easy, and massively effective viral marketing in social networks: truth or fiction?. In 23rd ACM Conference on Hypertext and Social Media, HT ’12, pp. 165–174. Cited by: §1.
  • [4] T. N. Dinh, Y. Shen, D. T. Nguyen, and M. T. Thai (2014) On the approximability of positive influence dominating set in social networks. J. Comb. Optim. 27 (3), pp. 487–503. Cited by: §1.
  • [5] D. Kempe, J. M. Kleinberg, and É. Tardos (2003) Maximizing the spread of influence through a social network. In Proceedings of the Ninth ACM International Conference on Knowledge Discovery and Data Mining SIGKDD, pp. 137–146. Cited by: §1, §2.
  • [6] S. Khuller, A. Moss, and J. Naor (1999) The budgeted maximum coverage problem. Inf. Process. Lett. 70 (1), pp. 39–45. Cited by: §4.
  • [7] J. Leskovec, L. A. Adamic, and B. A. Huberman (2007) The dynamics of viral marketing. ACM TWEB 1 (1), pp. 5. Cited by: Table 2.
  • [8] J. Leskovec, J. M. Kleinberg, and C. Faloutsos (2007) Graph evolution: densification and shrinking diameters. TKDD 1 (1), pp. 2. Cited by: Table 2.
  • [9] X. Li, J. D. Smith, T. N. Dinh, and M. T. Thai (2017) Why approximate when you can get the exact? optimal targeted viral marketing at scale. In 2017 IEEE Conference on Computer Communications, INFOCOM 2017, pp. 1–9. Cited by: §1.
  • [10] H. T. Nguyen, T. N. Dinh, and M. T. Thai (2016) Cost-aware targeted viral marketing in billion-scale networks. In 35th Annual IEEE International Conference on Computer Communications, INFOCOM, pp. 1–9. Cited by: §1, §1, §3, §3, §5.1.
  • [11] H. T. Nguyen, T. N. Dinh, and M. T. Thai (2018) Revisiting of ’revisiting the stop-and-stare algorithms for influence maximization’. In 7th International Conference on Computational Data and Social Networks, CSoNet 2018, pp. 273–285. Cited by: §1.
  • [12] H. T. Nguyen, T. P. Nguyen, N. H. Phan, and T. N. Dinh (2017) Importance sketching of influence dynamics in billion-scale networks. In 2017 IEEE International Conference on Data Mining, ICDM 2017, pp. 337–346. Cited by: §1, §3, §3.
  • [13] H. T. Nguyen, M. T. Thai, and T. N. Dinh (2016) Stop-and-stare: optimal sampling algorithms for viral marketing in billion-scale networks. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD, pp. 695–710. Cited by: §1, §1.
  • [14] H. T. Nguyen, M. T. Thai, and T. N. Dinh (2017) A billion-scale approximation algorithm for maximizing benefit in viral marketing. IEEE/ACM Trans. Netw. 25 (4), pp. 2419–2429. Cited by: §3, §5.1.
  • [15] N. P. Nguyen, G. Yan, and M. T. Thai (2013) Analysis of misinformation containment in online social networks. Computer Networks 57 (10), pp. 2133–2146. Cited by: §1.
  • [16] C. V. Pham, H. V. Duong, H. X. Hoang, and M. T. Thai (2019) Competitive influence maximization within time and budget constraints in online social networks: an algorithmic approach. Applied Sciences 9 (11). External Links: ISSN 2076-3417 Cited by: §1.
  • [17] C. V. Pham, H. V. Duong, and M. T. Thai Importance sample-based approximation algorithm for cost-aware targeted viral marketing. External Links: Link Cited by: §4.
  • [18] C. V. Pham, M. T. Thai, H. V. Duong, B. Q. Bui, and H. X. Hoang (2018) Maximizing misinformation restriction within time and budget constraints. J. Comb. Optim. 35 (4), pp. 1202–1240. External Links: Link, Document Cited by: §1.
  • [19] M. Richardson, R. Agrawal, and P. M. Domingos (2003) Trust management for the semantic web. In The Semantic Web - ISWC 2003, Second International Semantic Web Conference, pp. 351–368. Cited by: Table 2.
  • [20] Y. Shen, T. N. Dinh, H. Zhang, and M. T. Thai (2012) Interest-matching information propagation in multiple online social networks. In 21st ACM International Conference on Information and Knowledge Management, CIKM’12, pp. 1824–1828. Cited by: §1.
  • [21] Y. Tang, Y. Shi, and X. Xiao (2015) Influence maximization in near-linear time: A martingale approach. In Proceedings of the 2015 ACM International Conference on Management of Data (SIGMOD), pp. 1539–1554. Cited by: §1, §1, §5.1.
  • [22] Y. Tang, X. Xiao, and Y. Shi (2014) Influence maximization: near-optimal time complexity meets practical efficiency. In International Conference on Management of Data, SIGMOD 2014, pp. 75–86. Cited by: §1, §1.
  • [23] J. Yang and J. Leskovec (2012) Defining and evaluating network communities based on ground-truth. In 12th IEEE International Conference on Data Mining, ICDM 2012, pp. 745–754. Cited by: Table 2.
  • [24] H. Zhang, D. T. Nguyen, H. Zhang, and M. T. Thai (2016) Least cost influence maximization across multiple social networks. IEEE/ACM Trans. Netw. 24 (2), pp. 929–939. Cited by: §1.
  • [25] H. Zhang, H. Zhang, A. Kuhnle, and M. T. Thai (2016) Profit maximization for multiple products in online social networks. In 35th Annual IEEE International Conference on Computer Communications, INFOCOM 2016, San Francisco, CA, USA, April 10-14, 2016, pp. 1–9. Cited by: §1.
  • [26] H. Zhang, H. Zhang, X. Li, and M. T. Thai (2015) Limiting the spread of misinformation while effectively raising awareness in social networks. In Computational Social Networks - 4th International Conference, CSoNet 2015, Beijing, China, August 4-6, 2015, Proceedings, pp. 35–47. Cited by: §1.

Appendix

Proof of Lemmas and Theorems

Proof (Proof of Lemma 2)

We have

(20)
(21)
(22)
(23)
(24)
Proof (Proof of Lemma 5)

Since , we have

Apply Cauchy’s inequality, we have . Therefore,

(25)

On the other hand

(26)

Combine (25) and (26), we have . Since , , and

(27)

Apply Lemma 4, we chose and