One of the promises of a highly connected world is that of an impartial spread of opinions driven by free and unbiased sources of information. As a consequence, any opinion could have been equitably exposed to the wide public. On the contrary, the social network platforms that are currently governing news diffusion, while offering many seemingly-desirable features like searching, personalization, and recommendation, are reinforcing the centralization of information spreading and the creation of what is often termed echo chambers and filter bubbles [GDFMGM18]. Stated differently, algorithmic personalization of news diffusion are likely to create homogeneous polarized clusters where users get less exposure to conflicting viewpoints. A good illustration of this issue was given by Conover et al. [CRF11] who studied the Twitter network during the 2010 US congressional midterm elections. The authors demonstrated that the retweet network had a highly segregated partisan structure with extremely limited connectivity between left-wing and right-wing users. Consequently, instead of giving users a diverse perspective and balancing users opinions by exposing them to challenging ideas, social media platforms are likely to make users more extreme by only exposing them to views that reinforce their pre-existing beliefs [CRF11, DVBZ16].
To address this issue from a computational perspective, Garimella et al. [GGPT17] introduced the problem of balancing information exposure in a social network. Following the influence maximization paradigm going back to the seminal work of Kempe et al. [KKT03, KKT15], their problem involves two opposing viewpoints or campaigns that propagate in a social network following the independent cascade model. Given initial seed sets for both campaigns, a centralized agent is then responsible for selecting a small number of additional seed users for each campaign in order to maximize the number of users that are reached by either both or none of the campaigns. The authors study this problem in two different settings, namely the heterogeneous and correlated
settings. The heterogeneous setting corresponds to the general case in which there is no restriction on the probabilities with which the campaigns propagate. Contrarily, in the correlated setting, the probability distributions for different campaigns are identical and completely correlated. After proving that the optimization problem of balancing information exposure is-hard, the authors designed efficient approximation algorithms with an approximation ratio of for both settings.
We address the main open problem in [GGPT17], that is we generalize their optimization problem to a setting with possibly more than two campaigns. More precisely, let and be fixed constants such that . In our general problem, there are opposing campaigns and the task is to maximize the number of nodes in the network that are reached by at least campaigns or remain oblivious to all of them. We term this problem the --Balance problem. Interestingly, we obtain results that surprisingly differ from the ones of Garimella et al. [GGPT17]. Indeed, while we show that any --Balance problem can be approximated within a constant factor in the correlated setting (Section 5), we obtain strong approximation hardness results in the heterogeneous setting. In particular, when , we show that under the Gap Exponential Time Hypothesis [Man17], there is no -approximation algorithm with for the --Balance problem where is the number of nodes. Moreover, when , we show that if a certain class of one-way functions exists [App13], there is no -approximation algorithm for the --Balance problem where is a constant which depends on (Section 3). We mitigate these hardness results by designing an algorithm with an approximation factor of for the case where and is an arbitrary constant (Section 4).
There is a large literature on influence maximization, we refer the interested reader to [BBCL14, KKT15] and references therein. Here we focus on the literature about multiple campaigns running simultaneously on the same network. Budak et al. [BAEA11] studied the problem of limiting as much as possible the spread of a “bad” campaign by starting the spreading of another “good” campaign that blocks the first one. The two campaigns compete on the nodes that they reach: once a node becomes active in one campaign it cannot change campaign. They prove that the objective function is monotone and submodular and hence they obtain a constant approximation ratio. Similar concepts of competing cascades in which a node can only participate in one campaign have been studied in several works [AM11, BKS07, CNWVZ07, DGDM06, KOW08, LCL15, ML12]. Game theoretic aspects like the existence of Nash equilibria have been also investigated in this case [AFPT10, GHK14, TAM12]. Borodin et al. [BBLO17] consider the problem of controlling the spread of multiple campaigns by a centralized authority. Each campaign has its own objective function to maximize associated with its spread and the aim of a central authority is to maximize the social welfare defined as the sum of the selfish objective function of each campaign. They propose a truthful mechanism to achieve theoretical guarantees on the social welfare.
Two other works closely related to ours are the ones of Aslay et al. [AMGG18] and Matakos et al. [MG18]. The former work tackles an item-aware information propagation problem in which a centralized agent must recommend some articles to a small set of seed users such that the spread of these articles maximizes the expected diversity of exposure of the agents. The diversity exposure is measured by a sum of agent-dependent functions that takes into account user leanings. The authors show that the -hard problem they define amounts to optimizing a monotone and submodular function under a matroid constraint and design a constant factor approximation algorithm. The latter paper models the problem of maximizing the diversity of exposure in a social network as a quadratic knapsack problem. Here also the problem amounts to recommending a set of articles to some users in order to maximize a diversity index taking into account users’ leanings and the strength of their connections in the social network. The authors show that the resulting diversity maximization problem is inapproximable and design a polynomial algorithm without an approximation guarantee.
2.1 Independent Cascade model
We introduce the well-known Independent Cascade model. We mostly follow the terminology and notation from Kempe et al. [KKT15]. Given a directed graph , probabilities and an initial node set called a set of seed nodes. Define . For , we call a node active at time . If a node is active at time but was not active at time , i.e., (formally let ), it tries to activate each neighbor , independently, with a probability of success equal to . In case of success becomes active at step , i.e., . If at some time , we have that we say that the process has quiesced and call the time of quiescence. For an initial set , we denote with the expected number of nodes activated at the time of quiescence when running the process with seed nodes . Kempe et al. showed that this process is equivalent to what is referred to as the Triggering Model, see [KKT15, Proof of Theorem 4.5]. For a node , let denote all in-neighbors of . Here, every node independently picks a triggering set according to a distribution over subsets of its in-neighbors, namely with probability . For a possible outcome of triggering sets for the nodes , let be the set of nodes reachable from in the outcome . Note that after sampling , the quantity is deterministic. According to Kempe et al. [KKT15], this model is equivalent to the Independent Cascade model and it holds that , where the expectation is over the outcome profile . While it is not feasible to compute for all outcome profiles , it is possible to obtain a -approximation to , with probability at least , by sampling possible outcomes and computing the average over the corresponding values , see [KKT15, Proposition 4.1].
2.2 The --Balance problem
Inspired by the work of Garimella et al. [GGPT17], we consider several information spread processes, we also call them “campaigns”, unfolding in parallel, each following the Independent Cascade model described above. Formally, we are given a graph and probability functions , where each is a probability function as in the Independent Cascade model described above, i.e., .111For , we use to denote the set . For an index , let be a possible outcome sampled using probabilities . Then for a seed set , we denote with the set of nodes reachable from in outcome . For an arbitrary sequence of subsets of , we define
to be the number of nodes that are contained in none or in sufficiently many, i.e., in at least , of the sets in . Let be an outcome profile by letting be a possible outcome according to distribution . Then, for with , we denote with the set of reached nodes in the outcome from seed sets . For two sequences of sets , , and a set , we let be the element-wise union and be the element-wise intersection with the set .
For constant integers , we consider the following optimization problem:
--Balance Input: Graph , probabilities , seed sets , and .
Find: sets with , such that is maximum, where We refer to the objective function simply by , in case , , and are clear from the context. We assume as otherwise the problem becomes trivial by choosing for every . Moreover, we assume w.l.o.g. that and , since and are input parameters and and are constant numbers. Following Garimella et al. [GGPT17], we distinguish two settings. (1) The heterogeneous setting corresponds to the general case in which there is no restriction on . (2) In the correlated setting, the distributions are identical and completely correlated for all . That is, if an edge propagates a campaign to , it propagates all campaigns that reach to .
Decomposing the Objective Function.
In all of our algorithms, we use the approach of decomposing the objective function into summands and approximating the summands separately. For an outcome profile , and seed sets , we define , for , to be the set of nodes that are reached by exactly campaigns from the seed sets . Formally, for any value ,
where denotes the set . We write , if the initial seed sets are clear from the context. In the above definition, by convention an empty union is the empty set, while an empty intersection is the whole universe, here . Accordingly, we define
Note that is the expected number of nodes that are reached by or at least campaigns, resulting from nodes that have been reached by exactly campaigns from . Now, the objective function decomposes as
using linearity of expectation and that sets are disjoint. Furthermore, we will denote by
Again, denotes the expected number of nodes that are reached by sufficiently many campaigns or none of them resulting from nodes that have previously been reached by at least campaigns. Clearly, . For convenience, in what follows, we will often refer to as a set of pairs in , where picking pair into corresponds to picking into set . We fix the following observations:
For , is optimal when . The achieved value is the expected size of : .
For , the function is monotone and submodular.
A First Structural Lemma.
When applying the standard greedy hill climbing algorithm to finding a set of size maximizing a submodular set function the key property that is used in the analysis is the following. At any stage of the greedy algorithm there exists an element which leads to an improvement that is at least a fraction of of the difference of the optimal and the current solution, compare for example [Hoc97, Lemma 3.13]. Maybe the most important structural lemma underlying our algorithms is a very similar result for the functions .
Let and with and define . Then, satisfies where is an optimal solution of size to maximizing .
Let be an outcome profile and let be an arbitrary node in . Let us denote by the indicator function that is one if is reached by at least campaigns in outcome profile from seed sets and zero otherwise. We note that . Now, define , i.e., are the sets of nodes in of size . We now argue that the following inequality holds for and :
If the left hand side is not positive, the inequality holds, since the right hand side cannot be negative by monotonicity. Hence, assume that the left hand side is positive. In that case it holds that , but , i.e., in outcome profile , is reached by at least campaigns from seed sets but not from seed sets . For such , there must be a set such that adding to results in being reached by campaigns (recall that and thus is already reached by at least campaigns). Thus, there exists a set in that contributes a value of 1 on the right hand side and we may conclude that (1) holds. Now, using linearity of expectation and (1), we obtain
Using linearity of expectation again, we obtain that the right hand side above is equal to . Then, the statement follows by the maximality of and the fact that . ∎
The Correlated Case.
For the correlated setting, where probability functions are identical for all campaigns and the cascade processes are completely correlated, we introduce an additional function called . First note that in the correlated setting, the outcome profile in the definition of satisfies . In order to define , we introduce an additional fictitious campaign, call it campaign , that spreads with the same probability as the other campaigns. We extend the outcome with to contain also an identical copy and define by
Observe that measures the expected number of nodes that are either (1) reached by more than campaigns from or (2) are reached by at least one campaign from and are reached by the fictitious campaign from . Note that nodes from (1) are already reached by sufficiently many campaigns while nodes from (2) have been reached by some campaign from and, as witnessed by , can be reached from the nodes in . Note that is monotone and submodular in which follows directly from having these properties.
Approximating and .
As mentioned above, already in the standard independent cascade process, it is not feasible to evaluate the function exactly. However, can be approximated to within a factor of by sampling a polynomial number of times. A very similar approach works for approximating the functions and for . That is, there is an algorithm approx that, for , sets and , and parameters returns a -approximation of with probability . We prove this fact in Appendix A in Lemma 14. The proof relies on a Chernoff bound and is very similar to the original proof of Proposition 4.1 in [KKT15] for the -function.
All of our algorithms are of a greedy flavor, that is, we greedily choose sets in order to build the output set . We investigate the impact of the approximation on this approach in the following lemma. To this end, let be a function from and, for some , let be a -approximation of with , where depends on , namely for and for . We denote with the universe over which is defined, i.e., for , while for .
Let and be as above for some . Let , with , and let denote a set maximizing of size . Then, either
where , and .
We defer the proof to Appendix A. In summary: either already yields a -approximation of the optimum of or a set of size maximizing an approximation of can lead to a progress of at least an -fraction of the maximum progress possible.
Maximizing and .
Here, we fix the result that the standard greedy hill climbing algorithm, we refer to it as Greedy, can be applied in order to approximate both to within a factor of for any with probability at least for any . This is based on the fact that these functions are submodular and monotone set functions. See Appendix A for a pseudo-code implementation and a proof of the submodularity property. Since we can only evaluate and approximately, we obtain the additive -term.
Let and let and . With probability at least , Greedy returns satisfying where is an optimal solution of size to maximizing .
3 Hardness of Approximation for the Heterogeneous Case
We now let be a constant. In this section, we show that in the heterogeneous setting for , the --Balance problem is as hard to approximate as the Densest--Sub--hypergraph problem [CDK18]. Notably, this result has the following consequences: if there is no -approximation algorithm with for --Balance under the Gap Exponential Time Hypothesis (Gap-ETH). For general , we get that there is no -approximation algorithm for a given constant which depends on under the assumption that a particular class of one way functions exists [App13]. We recall the definition of the Densest--Sub--hypergraph problem.
Densest--Sub--hypergraph Input: -Regular Hypergraph , integer .
Find: set with , s.t. is maximum, where
A -regular hypergraph is a hypergraph in which all hyperedges are composed of exactly vertices, where is a constant. When , Densest--Sub--hypergraph is known as the Densest--Subgraph problem. For the hardness of approximation proof, we consider the following transform of an instance of the Densest--Sub--hypergraph problem into an instance of the --Balance problem.
Define , where , i.e., for each node , we get a node in . Moreover, let , and be the set of permutations of ; we then define as i.e., for each edge , we create nodes, where and . That is, each set of campaigns in , induces nodes for each in .
The initial seed sets are defined as , .
The budget is the same as in the Densest--Sub--hypergraph problem, i.e., .
Note that each node in is already covered by campaigns and that the instance generated is deterministic, in the sense that probability values are either 0 or 1.
Let us now fix a --Balance instance resulting from the transform as image of a Densest--Sub--hypergraph instance . Clearly, is of cardinality and is of cardinality . Let us denote by the set of feasible solutions for . For each , it holds that the objective function can be decomposed as where
for being the only possible (deterministic) outcome profile. Now, let , , and denote optimal solutions to the problem of maximizing , , and , respectively, over . The following lemma whose proof can be found in Appendix B collects three statements. The first statement says that an optimal solution to also maximizes . The second statement says that there exists a feasible solution to which achieves at least a multiple of of the objective value in Densest--Sub--hypergraph with . In the third statement, we observe that from a feasible solution to , we can construct a feasible solution to while loosing only a factor of in objective value.
An optimal solution to also maximizes , i.e., .
It holds that , where is the optimal value of Densest--Sub--hypergraph in and .
Given , we can, in polynomial time, build a feasible solution of such that .
We are now ready to show the following relations between the complexity of the two problems. Note that the assumption that is non-increasing is w.l.o.g.
Let , , and , then we have the following two cases:
Case : Let with being non-increasing, and and .
Case : Let where is a constant which depends on , and , with .
In both cases the following statement holds: If there is an -approximate algorithm for the deterministic --Balance problem, then there is a -approximate algorithm for Densest--Sub--hypergraph. Here and denote the number of vertices in the --Balance and the Densest--Sub--hypergraph problems, respectively and .
Let be an instance of the Densest--Sub--hypergraph problem and let be the instance of the --Balance problem obtained by the transform . For brevity, let and . Moreover, let be an -approximate solution to , that is . We show how to construct a -approximate solution to .
Using Lemma 4, (3), we obtain a feasible solution to with . We proceed by lower-bounding . We can w.l.o.g. assume that and that . Indeed, if then and we can build in polynomial-time a better solution by identifying one edge and propagating campaign in . This further implies that as . We obtain
Case : Since is non-increasing, we get where we used and (as ). This completes this case.
Case : In this case where we used and . This completes this case.∎
To sum up, our reduction shows that: (1) as Densest--Sub--hypergraph cannot be approximated within for some constant which depends on , if a particular class of one way functions exists [App13], we have shown that the same hardness result holds for any --Balance problem with ; (2) moreover as Densest--Subgraph cannot be approximated within , if the Gap-ETH holds [Man17], we have shown that the same hardness result holds for any --Balance problem with .
Other approximation hardness results exist for Densest--Subgraph. We review them here, highlighting the hardness results that our reduction implies in each case.
Densest--Subgraph cannot be approximated within any constant, if the Unique Games with Small Set Expansion (UGSSE) conjecture holds [RS10]. Therefore, under the UGSSE conjecture it is easy to prove that the reduction given above shows that any --Balance problem with cannot be approximated within any constant.
Densest--Subgraph cannot be approximated within , for some constant if the exponential time hypothesis holds [Man17]. Under the same conjecture, our reduction implies the same hardness result for any --Balance problem with .
4 Approximation Algorithm for the Heterogeneous Case
Our approach for maximizing decomposes it as and works on each summand separately. In the following two subsections, we give two different algorithms for maximizing . At the end of the section, we show how to combine them.
Greedily Picking Tuples.
In this paragraph, we present GreedyTuple that, for given , computes a solution to maximizing . For the algorithm is identical to the standard greedy hill climbing algorithm. For the general case of , we will show the following theorem. The algorithm is inspired by a greedy algorithm, called Greedy1, due to [DOS18] for solving the so-called maximum coverage with pairs problem.
Let , , and . If , with probability at least , the algorithm GreedyTuple returns a solution satisfying where is an optimal solution to .
We let denote the set at the end of iteration of the algorithm. The main idea underlying the analysis of GreedyTuple is very much related to the analysis of the standard greedy algorithm. That is (ignoring the approximation issue), every step of the algorithm incurs a factor of . For , this coincides with the standard case.
Let , , and . With probability at least , after each iteration of Algorithm 1, it either holds that
Proof of Theorem 6.
Let denote the set returned by the algorithm. Clearly, , where denotes the number of iterations of the while loop in the algorithm. By assumption and thus . Using Lemma 7 for yields that either or
For the former case, note that is greater than the approximation factor required by the theorem. For the latter case note that, as for any real , we have
where the last inequality uses that and for any . This completes the proof. ∎
Being Iteratively Greedy.
Recall that, at the beginning of this section, we have defined . We now extend this notation by letting
where and ; we will mainly be working with the case . The function measures the expected number of nodes that are reached by at least campaigns from within the set of nodes that have originally been reached by at least campaigns from . Our goal now is to maximize through the following iterative scheme: for from to , we find sets of size maximizing , where . That is, in the iteration, we maximize the number of nodes reached by campaigns that have previously been reached by at least campaigns. The approach is motivated by the observation that, for any and initial sets , the function is monotone and submodular in , compare with Section 2.2 where we used this fact for . Using Lemma 3 applied to with we get that the standard greedy algorithm can be used in order to obtain a -approximate solution when maximizing . Note that our algorithm, called GreedyIter is inspired by a similar greedy algorithm called Greedy2 from [DOS18] that is used there for the maximum coverage with pairs problem. We will prove the following theorem in this section.
Let and . With probability , GreedyIter returns satisfying