With the proliferations of Online Social Networks (OSNs) such as Facebook and LinkedIn, the paradigm of viral marketing through the “word-of-mouth” effect over OSNs has found numerous applications in modern commercial activities. For example, a company may provide free product samples to a few individuals (i.e, “seed” nodes) in an OSN, such that more people can be attracted to buy the company’s products through the information cascade starting from the seed nodes.
Kempe et.al.  have initiated the studies on the NP-hard -Seed-Selection (SS) problem in OSNs, where the goal is to select most influential nodes in an OSN under some contagion models such as the independent cascade model and the more general triggering model. After that, extensive studies in this line have appeared to design efficient approximation algorithms for the SS problem and its variations [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
. However, all the algorithms proposed in these studies can be classified as Influence Maximization (IM) algorithms, because they have the same goal of optimizing theinfluence spread (i.e., the expected number of influenced nodes in the network).
In many applications for viral marketing, one may want to seek a “best bang for the buck”, i.e., to select a set of seed nodes with the minimum total cost such that the influence spread (denoted by ) is no less than a predefined threshold . This problem is called as the Min-Cost Seed Selection (MCSS) problem and has been investigated by some prior work such as [12, 13, 14]. It is indicated by these work that the existing IM algorithms are not appropriate for the MCSS problem, as the IM algorithms require the knowledge on the total cost of the selected seed nodes in advance, while this knowledge is exactly what we pursue in the MCSS problem.
The existing MCSS algorithms [12, 13, 14], however, suffer from several major deficiencies. First, these algorithms only consider the uniform cost (UC) case where each node has an identical cost 1, so none of their performance ratios holds under the general cost (GC) case where the nodes’ costs can be heterogeneous. Nevertheless, the GC case has been indicated by existing work to be ubiquitous in reality [10, 14, 15]. Second, most of them only propose bi-criteria approximation algorithms, which cannot guarantee that the influence spread of the selected nodes is no less than
. Third, the theoretical bounds proposed by some existing work only hold for OSNs with special influence propagation probabilities, but do not hold for general OSNs. Last but not the least, the existing MCSS algorithms do not scale well to big networks with millions/billions of edges/nodes.
Although the MCSS problem looks like a classical Submodular Set Cover (SSC) problem , the conventional approximation algorithms for SSC cannot be directly used to find solutions to MCSS in polynomial time, as computing the influence spread of any seed set is essentially a #P-hard problem . One possible way for overcoming this hurdle is to apply the existing network sampling methods proposed for the SS problem (e.g., [4, 6, 10]), but it is highly non-trivial to design an efficient sampling algorithm to get a satisfying approximation solution to MCSS, due to the essential discrepancies between MCSS and SS. Indeed, the number of selected nodes in the SS problem is always pre-determined (i.e., ), while this number can be uncertain and highly correlated with the generated network samples in MCSS. This requires us to carefully build a quantitative relationship between the generated network samples and the stopping rule for selecting seed nodes, such that a feasible solution with small cost can be found in as short time as possible. Unfortunately, the existing studies have not made a substantial progress towards tackling the above challenges in MCSS, so they suffer from several deficiencies on the theoretical performance ratio, practicability and efficiency, including the ones described in last section.
In this paper, we propose several algorithms for the MCSS problem based on network sampling, and our algorithms advance the prior-art by achieving better approximation ratios and lower time complexity. More specifically, our major contributions include:
Under the GC case, we propose the first polynomial-time bi-criteria approximation algorithms with provable approximation ratio for MCSS, using a general contagion model (i.e., the triggering model ). Given any and any OSN with nodes and edges, our algorithms achieve an approximation ratio and output a seed set satisfying with the probability of at least . Our algorithms also achieve an expected time complexity of , where is the maximum influence spread of any single node in the network.
Under the UC case, our proposed algorithms have approximation ratios and can output a seed set satisfying with probability of at least . Compared to the existing algorithm  that has the best-known approximation ratio under the same setting with ours, the running time of our algorithms is at least times faster, while our approximation ratio can be better than theirs under the same running time.
In contrast to some related work such as , the performance bounds of all our algorithms do not depend on particularities of the network data (e.g., the influence propagation probabilities of the network), so they are more general.
We test the empirical performance of our algorithms using real OSNs with up to 1.5 billion edges. The experimental results demonstrate that our algorithms significantly outperform the existing algorithms both on the running time and on the total cost of the selected seed nodes.
The rest of our paper is organized as follows. We formally formulate our problem in Sec. II. We propose bi-criteria approximation algorithms and approximation algorithms for the MCSS problem in Sec. III and Sec. IV, respectively. The experimental results are presented in Sec. V. We review the related work in Sec. VI before concluding our paper in Sec. VII.
Ii-a Model and Problem Definition
We model an online social network as a directed graph where is the set of nodes and is the set of edges. Each node has a cost which denotes the cost for selecting as a seed node. For convenience, we define for any .
When the nodes in a seed set are influenced, an influence propagation is caused in the network and hence more nodes can be activated. There are many influence propagation/contagion models, among which the Independent Cascade (IC) model and the Linear Threshold (LT) model  are the most popular ones. However, it has been proved that both the IC model and the LT model are special cases of the triggering model , so we adopt the triggering model for generality.
In the triggering model, each node is associated with a triggering distribution over where . Let denote a sample taken from for any ( is called a triggering set of ). The influence propagation with any seed set under the triggering model can be described as follows. At time , the nodes in are all activated. Afterwards, any node will be activated at time iff there exists a node which has been activated at time . This propagation process terminates when no more nodes can be activated. Let denote the Influence Spread (IS) of , i.e., the expected number of activated nodes under the triggering model. The problem studied in this paper can be formulated as follows:
Definition 1 (The MCSS problem).
Given an OSN with and , a cost function , and any , the Min-Cost Seed Selection (MCSS) problem aims to find such that and is minimized.
The MCSS problem has been studied by prior work under different settings/assumptions such as the UC setting and the GC setting explained in Sec. I. Besides, under the Exact Value (EV) setting, it is assumed that the exact value of can be computed in polynomial time, while this assumption does not hold under the Noisy Value (NV) setting.
The existing MCSS algorithms can also be classified into: (1) APproximation (AP) algorithms : these algorithms regard any satisfying as a feasible solution; (2) Bi-criteria Approximation (BA) algorithms [13, 14]: these algorithms regard any satisfying as a feasible solution, where is any given number in .
Ii-B The Greedy Algorithm for Submodular Set Cover
It has been shown that is a monotone and submodular function , i.e., for any we have and . Therefore, the MCSS problem is an instance of the Submodular Set Cover (SSC) problem [16, 17], which can be solved by a greedy algorithm under the EV setting. For clarity, we present a (generalized) version of the greedy algorithm for the SSC problem, shown in Algorithm 1:
Fact 1 ().
Let be the output of for any . Then we have and . This bound holds even if is an arbitrary monotone and non-negative submodular function defined on the ground set .
Unfortunately, it has been shown that calculating is #P-hard . Therefore, implementing under the EV setting is impractical. Indeed, the existing work usually uses multiple monte-carlo simulations to compute in Algorithm 1 [1, 2], so
is a noisy estimation of. However, such an approach has two drawbacks: (1) The time complexity of is still high (though polynomial); (2) The theoretical performance bounds of under the EV setting (such as Fact 1) no longer hold.
Ii-C RR-Set Sampling
Recently, Borgs. et. al.  have proposed an elegant network sampling method to estimate the value of , whose key idea can be presented by the following equation:
where is a random subset of , called as a Reverse Reachable (RR) set. Under the IC model , an RR-set can be generated by first uniformly sampling from , then reverse the edges’ directions in and traverse from according to the probabilities associated with each edge. According to equation (1), the value of can be estimated unbiasedly by any set of RR-sets as follows:
Iii Bi-Criteria Approximation Algorithms
In this section, we propose bi-criteria approximation (BA) algorithms for the MCSS problem under the GC+NV setting.
Iii-a The BCGC Algorithm
Our first bi-criteria approximation algorithm is called , shown in Algorithm 2. first generates a set of RR-sets (line 2) according to the input variables and , and then calls the function to find a min-cost node set satisfying , which is returned as the solution to MCSS.
It can be verified that is a monotone and non-negative submodular function defined on the ground set , so is essentially a (deterministic) greedy submodular set cover algorithm similar to .
The key issue in is how to determine the values of and . Intuitively, and should be large enough such that can output a feasible solution (i.e., ) with a cost close to . On the other hand, we also want and to be small such that the time complexity of can be reduced. To see how achieve these goals, we first introduce the following functions:
For any and , define
where is the base of natural logarithms.
From Eqn. (5), it can be seen that
, which can be proved by the concentration bounds in probability theory:
Given any and any set of RR-sets, if and , then we have ; if and , then we have
Let be any set of RR-sets satisfying . We have
Moreover, when , Lemma 1 gives us:
When , and , returns a set satisfying and with the probability of at least .
Let denote an optimal solution to the optimization problem:
As is essentially a deterministic greedy submodular cover algorithm and , we can use Fact 1 to get
Therefore, if holds, then we must have and hence
According to (7), the probability that does not hold is at most . Besides, as , the probability that returns an infeasible solution in is no more than , which is bounded by due to (6). The theorem then follows by using the union bound. ∎
Note that the approximation ratio of nearly matches the approximation ratio (i.e., ) shown in Fact 1, which is derived under the EV setting. For example, if we set , then the approximation ratio of is at most , which is larger than by only .
spends most of its running time on generating RR-sets. According to the setting of , it can be seen that generates at most RR sets (see the proof of Theorem 2). Therefore, we can get:
Let . can achieve the performance bound shown in Theorem 1 under the expected time complexity of .
Iii-B A Trial-and-Error Algorithm
It can be seen that behaves in an “once-for-all” manner, i.e, it generates all the RR-sets in one batch, and then finds a solution using the generated RR-sets. In this section, we propose another “trial-and-error” algorithm (called ) for the MCSS problem, which runs in iterations and “lazily” generates RR-sets when necessary.
where the parameter will be explained shortly. Then calls in line 3 to find an approximation solution satisfying
However, it is possible that is an infeasible solution in . Therefore, calls to judge whether (line 3). If returns , it implies that is a feasible solution w.h.p., so terminates and returns . Otherwise, enters into another iteration and adds more RR-sets into . This “trial and error” process repeats until achieves a predefined threshold (line 3). As is set similarly with that in (lines 3-3), is guaranteed to find a feasible solution with high probability.
The parameter in is roughly explained as follows. Intuitively, indicates the total probability of the “bad events” (e.g., ) happen in any iteration of when . In the first iteration, we set and is decreased by a factor in every subsequent iteration. We also constrain the probability that the bad events happen when by (lines 3-3). Using the union bound, the total probability that returns a “bad” solution conflicting our performance bounds is upper-bounded by .
By similar reasoning with that in Theorem 1, we can prove that has the same approximation ratio as :
When and , returns a set satisfying and with the probability of at least .
The Design of function : Next, we explain how the function is implemented. maintains three threshold values . When , it simply generates RR-sets and returns them (lines 4-4). Otherwise, it keeps generating RR-sets until either or , where is the set of generated RR-sets. Note that is called with , which implies that the total number of generated RR-sets in never exceeds .
Intuitively, if is very large, then () must have a high probability to be , so there is a high probability that and hence returns . Conversely, if is very small, then there is a high probability that and hence returns . By setting the values of and (line 4) based on the Chernoff bounds, we get the following theorem:
For any , if and , then the probability that returns is at least ; if and , then the probability that returns is at least .
Note that is called by with , and . So Theorem 4 implies that always returns a feasible solution with high probability. When , it is possible that returns , but this does not harm the the correctness of and only results in more iterations; moreover, the probability for this event to happen can be very small as is usually small.
Iii-C Theoretical Comparisons for the BA Algorithms
We compare the theoretical performance of and with the state-of-the-art algorithms as follows.
Iii-C1 Comparing with Goyal et.al.’s Results 
To the best of our knowledge, the only prior algorithm with a provable performance bound for the MCSS problem under the GC+NV setting is the one proposed in , which is based on the algorithm. We quote the result of  in Fact 2:
Fact 2 ().
Let . Then we have and , where and satisfy:
However, Goyal et.al.  did not mention how to implement . A crucial problem is that they require in the implementation of (i.e., ); otherwise their proof for the approximation ratio in Fact 2 does not hold. However, as computing is #P-hard and in is computed by monte-carlo simulations, actually implies that infinite times of monte-carlo simulations should be conducted to compute according to the Chernoff bounds . Due to this reason, we think that the approximation ratio shown in Fact 2 is mainly valuable on the theoretical side and is not likely to be implemented in polynomial time.
Iii-C2 Comparing with Kuhnle et.al.’s Results 
Very recently, Kuhnle et.al.  have proposed some elegant bi-criteria approximation algorithms for the MCSS problem, but only under the UC setting. We quote their results below:
Fact 3 ().
Define the CEA assumption as follows: for any such that , there always exists a node such that . Under the UC setting, the STAB algorithms can find a set satisfying and 1) or 2) listed below with the probability of at least . The STAB-C1 algorithm has time complexity. The STAB-C2 algorithm has time complexity.
when the CEA assumption holds.
without the CEA assumption.
Note that Fact 3 depends on the CEA assumption, and there is a large gap between and (at least ) if without the CEA assumption. In contrast, the performance ratios of or do not require any special properties of the network and we can guarantee w.h.p. for any . Most importantly, Fact 3 only holds under the UC setting while our performance bounds shown in Theorems 1-3 hold under the GC setting.
Iv Approximation Algorithms for Uniform Costs
In this section, we propose approximation (AP) algorithms for the MCSS problem under the UC+NV setting.
Iv-a The AAUC Algorithm
We first propose an algorithm called . first generates a set of RR-sets, and then calls to return a set satisfying , where is a variable in . Although looks similar to , its key idea (i.e., how to set the parameters and ) is very different from that in , which is explained as follows.
Recall that is a greedy algorithm. Suppose that sequentially selects after it is called by . Let denote . Let and . As each node has a cost and is submodular, we can use the submodular functions’ properties to prove:
For any , we have , where .
Clearly, when is sufficiently small, and must be very close. Indeed, we can prove:
When , we have .
When and , returns a set such that and with the probability of at least .
By very similar reasoning with that in Theorem 2, we can also prove the time complexity of as follows:
Let . can achieve the performance bounds shown Theorem 5 under the expected time complexity of where .
Iv-B An Adaptive Trial-and-Error Algorithm
It can be seen from Theorem 6 that the running time of is inversely proportional to , which can be a small number. To address this problem, we propose an adaptive trial-and-error algorithm called , shown in Algorithm 6. uses a dynamic parameter for generating RR-sets, and adaptively changes the value of until a satisfying approximate solution is found.
More specifically, first determines a threshold using (lines 6-6), then calls a function with any that is larger than . Note that the performance bound of (see Theorem 7) does not depend on the value of , and setting is only for reducing the number of generated RR-sets. In each iteration, first generates a set of RR-sets in a similar way with that in (line 6) to ensure , where is set by a similar way with that in . After that, calls to find such that and . If , decreases and and enters the next iteration (lines 6). If , it implies that is small enough, so calls to judge whether (line 6). If returns , then returns as the solution (line 6), otherwise it decreases the value of and enters the next iteration (line 6).