The problem of selecting influential users in a network, has been extensively studied in the past decade ((Wang et al., 2015; Sun and Ng, 2012; Trusov et al., 2010)). These works mainly focus on the problem of maximizing diffusion in a given network. If we assume that users of the network are willing to be selected by the mechanism, it is desirable that the selection mechanism will not only have good diffusion performance, but also be incentive compatible; namely, that a user could not affect his probability of being selected by strategic erasure (or formation) of links. For instance, suppose that Twitter would like to start a new epidemic trend. It would like to find a very influential user and give him some benefit in order to kickoff its campaign. If the users are aware of these intents, they might remove or add out-going links in order to get chosen. A similar problem would be to select an academic paper in a particular field and awarding it as ‘the most influential’ in this field; measuring the influence according to quotes from other papers and taking into account both direct and propagated influence.
The following simple example demonstrates the tendency between diffusion maximization and incentive compatibility.
A directed edge from to indicates that follows (or cites) . In any plausible diffusion model, the most influential vertex in network is , while the most influential vertex in network is . Suppose the mechanism simply selects the most influential vertex. Then user , knowing that the network is , might have an incentive not to follow user (although she is interested in user’s content) in order to be selected by the mechanism.
As a first step towards understanding incentive compatible diffusion mechanisms, our main focus in this paper is on the problem of selecting a single user when the graph is acyclic. Note that in environments where users arrive in horological order (e.g., scientific papers or scale-free networks; see, for example, (Dorogovtsev et al., 2000; Dorogovtsev and Mendes, 2002)) acyclic graphs are indeed a good proxy for the actual environment.
One of the simplest ideas to develop mechanisms (hopefully incentive compatible) that select vertices with (hopefully) high diffusion is the following: choose a node (at random) and ask him to mention a single user that he follows. Informally speaking, by arguments in the spirit of the friendship paradox ((Feld, 1991; Lattanzi and Singer, 2015)), we expect that will have better diffusion than . We can then ask who is his friend, etcetera. Thus, proceeding along a path increases (in expectation) the influence of the observed user. Two obstacles arise with this idea:
When do we stop the process?
What if a node reports that he does not follow anyone?111Note that we cannot stop and select him, as this will incentive every node to report that he follows no one..
The Two Path mechanism that we suggest is based on this simple idea, but instead of tracking a single path, it tracks two. Regarding the first obstacle, we now have a natural candidate for the selected user — the first intersection of the two paths. Regarding the second obstacle, if both paths have ended without intersecting, we simply re-execute the process with a modification that all the already tracked users cannot be selected (this is needed for the incentive compatibility). Note that the (informally) presented mechanism is very simple for implementation in practical settings and it requires only partial knowledge about the network (i.e., full revelation of information by all users is not needed).
Our results are as follows. In Proposition 3.1 we show that the Two Path mechanism is indeed incentive compatible on DAGs. In Theorem 3.3 we show that the Two Path mechanism performs very well on trees and achieves an approximation ratio of 1.5. In Theorem 3.5 we show that for forests which are balanced (see Definition 3.6) the Two Path mechanism achieves a constant approximation ratio222The balanceness condition is necessary as demonstrated in Example 3.8.. However, on DAGs (even balanced), the Two Path mechanism might not perform so well, as demonstrated in Example 4.1. In Section 4 we develop a mechanism that is based on the analysis of the distribution induced by Two Path, and achieve a constant approximation ratio for DAGs that are both balanced, and monotone (i.e., DAGs in which a user is always more influential than his followers; see Theorem 4.4). In Section 5 we extend the Two Path mechanism to an incentive-compatible mechanism for general networks. Test results of this extension and of the Two Path mechanism on simulated scale-free graphs are presented in Section 6.
1.1. Related work
Our work is a merger of two branches of literature. Namely, the study of incentive-compatible (IC) selection mechanisms and the study of diffusion models in networks. In this paper we offer, for the first time, IC mechanisms which try to maximize the overall diffusion.
1.1.1. Incentive-compatible mechanisms
Incentive compatibility has been studied in many different contexts. The most relevant to ours are the following.
Alon et al. ((Alon et al., 2011)) and Fischer and Klimm ((Fischer and Klimm, 2014)) studied incentive compatible mechanisms in networks. The goal there is to select a vertex with a good approximation to the maximal in-degree. The in-degree can be viewed as a ”one-step diffusion”. In our work, on the other hand, we focus on a more complex diffusion process.
Holzman and Moulin ((Holzman and
Moulin, 2013)) introduced an axiomatic approach to the problem of selecting a prize winning paper based on peer-reviews. In their model, each agent reports one other nominee for a prize and the mechanism selects one winner based on these votes. They presented a set of desirable features a selection mechanism should posses, and asked which subsets of these features we may/ may not have together with incentive-compatibility.
Their paper focuses on self-selection of a winner, which again can be viewed as ”one-step diffusion”.
More recent works with the same theme can be found in Kurokawa et al. ((Kurokawa et al., 2015)), Aziz et al. ((Aziz et al., 2016)) and Tamura ((Tamura, 2016)).
1.1.2. Diffusion in networks
The diffusion of information in networks (and more specifically, social networks) has been extensively studied ever since the seminal paper of Kempe, Kleinberg and Tardos ((Kempe
et al., 2003)). This literature focuses on maximization of diffusion in a given network. In our work we assume that the designer is unaware of the network structure, and asks this information from the users.
et al., 2003), the authors have presented two diffusion models — linear threshold and independent cascade, and offered an algorithm to select a -subset of the vertices. Their purpose was to show an algorithmically-efficient (i.e., polynomial) mechanism with a finite approximation ratio, but not necessarily IC.
Our influence model relates the influence of vertex over vertex to the probability of reaching in a random walk starting at . Using random walks to measure popularity/influence is not new either: Gualdi, Medo and Zhang ((Gualdi et al., 2011)) used the same influence model to rank the influence of academic papers (in their model the graph is acyclic); the popular search engine, Google, uses random walks in its ranking algorithm, called PageRank ((Brin and Page, 1998)); and Andersen et al. ((Andersen et al., 2008)) used random walk in their suggestion of a trust-based recommendation system.
Further reading on networks, their structures and dynamics can be found in the monograph of Easley and Kleinberg ((Easley and Kleinberg, 2010)) and in (Bornholdt and Schuster, 2003).
2. The model
2.1. Network and Diffusion
A network is a directed graph , where a vertex is a Twitter user (academic paper) and an edge from to means that follows (quotes) . We denote the in-neighbours (followers) and the out-neighbours (followees) of vertex by
respectively, and its out-degree by . Our notion of diffusion is defined as follows. We denote by the family of all simple paths (no vertex repetition) from to . The influence of on is defined to be
and the total influence of is
As we explain below, this diffusion model neatly relates the influence of a user to the probability of reaching it in a random walk, and is closely related to other well studied models of diffusion.
The rational behind our notion of diffusion is as follows.
If follows only , then he is a groupie fan of and there is a high probability that he will be affected by any trend introduced by . If on the other hand, follows and some other 99 users, then ’s influence over is much lower. Moreover, if influences and in turn influences , then has some indirect influence over . Concretely, we assume that each user divides its attention uniformly between those he follows. A message from user diffuses along each path backwards with probability equal to the multiplication of the ‘attention’ on the edges.
To gain some intuition about the notion of diffusion, we demonstrate an example of a network and calculate the influence of each vertex.
Example 2.1 ().
Consider the following Twitter network with six nodes.
The edges weights in blue denote the ‘attention fraction’ of this link. For example, node follows and , thus each link has a weight of 1/2. Suppose that node posts a tweet with a recommendation for a new product. Our model assumes that with probability 1/3 node will be affected. If is affected, then with probability 1 is affected. If is affected, then with probability 1/2 is affected. Node will either be affected directly from or from through , thus his probability to be affected from is 1/3. We get that starting with , the expected diffusion is (1 for , for and for ). Similarly, the influence of :
The influence of the rest of the nodes:
We note here the relation of our influence measurement to other popular measures.
A random path is a path generated by selecting a vertex uniformly at random and ‘walking’ a random walk; each time selecting an out-edge uniformly at random, and stopping when we reach a previously visited vertex or a vertex with no out-edges. We can equivalently define to be the probability of visiting in a random path, multiplied by .
The progeny of vertex is the number of vertices which has a path to (excluding ). If is a forest, then is the progeny of plus one. Moreover, for any , let be a random graph generated by picking for each vertex, uniformly at random, one of its out-edges, and removing the rest of its edges. Then is the expected progeny of in plus one.
Google’s PageRank ((Brin and Page, 1998)) is an algorithm which takes as input a digraph , and a damping factor
, and outputs a probability distribution on; this distribution represents the likelihood that an infinite random walk will arrive at any particular vertex. At each step, with probability the random walk continues, and with probability it jumps to a random vertex. Denote the PageRank value of with damping factor 1 by . Then can be expressed as ((PRW, 2017))
If is an acyclic directed graph (DAG), then we can relate the influence of to the influence of its neighbours,
The similarity of these two equations is visible. For example, we can deduce that they induce the same ranking. For any define . Clearly induces the same ranking as ; but
and as , this definition converges to that of .
Finally, consider the Independent Cascade model for diffusion ((Goldenberg et al., 2001b, a; Kempe et al., 2003)). Independent Cascade is defined for weighted digraphs with weights in the interval . If we take a random graph by taking each edge independently with probability equal to its weight, then the Independent-Cascade diffusion measure of is the expected progeny of in plus 1. To see the difference from our model, consider the graph with two vertices , and two edges from to , each with weight . Our model will give an influence value of (one for itself and one for ), while the Independent-Cascade score of is (since there is only 3/4 chance that it will reach ).
2.2. Incentive Compatibility
Next, we define what is a selection mechanism and the properties we will be interested in.
Definition 2.2 ().
A selection mechanism is a function which gives for every a probability distribution on .
The empty-set value, , means that the mechanism has not selected any vertex. We denote by the probability that the mechanism picks when the input is , and by the expected influence of the selected vertex333We take in the calculation of the expectation.. When it is clear from the context what is the graph , we sometimes just write and . Let be the maximal influence in , and be the set of optimal vertices.
Assume that is also the payoff function of . We would like our mechanism to be such that for any , it is a best action to report its true out-edges. In addition, we would like our mechanism to give a bounded ratio between the maximal influence and the expected influence of the selected vertex. Our main mechanism will be intended for the subfamily of directed graphs (DAGs); meaning, that only when the reported graph is in this subfamily, we require that it is IC and has bounded ratio.
Definition 2.3 ().
A selection mechanism for the family of graphs is:
incentive-compatible (IC), if and , , where is the family of all graphs we get from by adding and removing outgoing edges of 444Note that we do not require .;
efficient with approx. ratio , if .
We consider the following four nested families.
Definition 2.4 ().
Let be a directed graph.
is a tree if there is a unique vertex, called the root, with no out edges and the rest of the vertices have precisely one out-edge.
is a forest if it is a disjoint union of trees.
is monotone555Observe that a forest is always monotone, and that a monotone graph is acyclic. if for any edge , .
is a directed acyclic graph (DAG) if it contains no cycles.
3. The Two Path Mechanism
We will now present the algorithm of the Two Path mechanism, which we denote . A random path is a random walk which starts at a random vertex, and ends when we reach a vertex with no out-edges or when we return to a previously visited vertex. The idea of the Two Path mechanism is to start two independent random paths from two randomly chosen vertices. If they intersect on an ‘unmarked’ vertex, it is selected; if they intersect on a ‘marked’ vertex, the mechanism returns ‘null’; if they do not intersect, all the vertices in these paths are marked, and the mechanism repeats. After presenting the algorithm we will prove it is IC in the family of DAGs; we will then analyse its approximation ratio in the family of trees and in the family of forests.
Proposition 3.1 ().
Mechanism is IC in the family of DAGs.
Notice that a vertex can be selected only in the first stage in which it is queried for its out-edges (afterwards it will either be selected or marked). It is enough, then, to show that in the first time a vertex is reached by one of the random paths, reporting its true edges is a best action. Suppose reaches vertex and we query for its out-edges for the first time. Vertex can only be selected if reaches it before it reaches any other vertex of . Since has no cycles, a true report by will lead to vertices which can be reached by only after has ‘missed’ . Hence reporting its true edges cannot hurt its chances of being selected. Surely, reporting additional edges cannot help it. If reaches an unmarked vertex , then either is already selected (if ) or it will never be selected regardless of the edges it reports. Hence this mechanism is incentive compatible. ∎
Mechanism is generally not IC when the graph contains cycles, as demonstrated by the next example.
Example 3.2 ().
Take the graph with two vertices, , and two edges . Since always includes both vertices, the winner will be determined by the starting vertex of . However, if we remove the edge , then will be selected when either or starts from . Thus has an incentive to remove its edge to .
3.1. Two Path on the family of trees
When is a tree, every two paths intersect, and mechanism is guaranteed to return a vertex in the first stage. Suppose is a path of length . The mechanism then returns either or , whichever is further along the path. Thus, the expected result in this case is . In the next theorem we show that this is the worse scenario when is a tree. Hence, we find that the exact approximation ratio on trees is 1.5. This theorem is fundamental for the proof of the bound of the approximation ratio on forests (Theorem 3.5).
Theorem 3.3 ().
In the family of all trees, .
For a vertex we denote by the subtree which is under . Since all the vertices, except the root, have one out-edge, . Vertex is selected if and only if and they are not in the same proper subtree of . We get:
We define the function from the family of all trees of order to by:
Lemma 3.4 ().
Let be the path with vertices. Then,
We will first show that the lemma completes the proof. Indeed, if is maximized when is a path, then is minimized in this case. When is a path, the mechanism will always return either or , whichever is further along the path. The expected value of the mechanism in this case is, therefore, the expected value of , where. We get
In fact, , hence we have found the exact value of in trees. ∎
It remains to prove the lemma.
Proof of Lemma 3.4.
Let be any tree of order which maximizes . Assume for contradiction that is not a path. Then there is a vertex such that there are two paths: , such that are leafs and all have in-degrees one. Let be the tree in which we remove the edge and add the edge . It is enough to show that . Notice that the only vertices which have different contribution to the sums in and are (since is different) and . Hence,
3.2. Two Path on the family of forests
Let be a forest. Denote by the set of roots of . We denote . Suppose is ‘large’, e.g. . Then in a single stage, there is a high probability that the two random paths will be in different trees and those two paths will be marked. We claim, however, that if the distribution of the orders of the trees is reasonably concentrated near the average, there is a positive probability that the first time the two paths intersect, they will be intersected in a tree in which no vertex is marked. This will imply, together with Theorem 3.3, a bound on the approximation ratio. To be precise, define
the average influence of the sinks. We will prove the following.
Theorem 3.5 ().
For any forest ,
We define the following measure for the balance of a forest.
Definition 3.6 ().
For , a forest is -balanced if,
This definition formally captures the idea of ‘reasonable distribution’ of the trees’ orders mentioned above. Thus Theorem 3.5 implies the following bound on the approximation ratio.
Corollary 3.7 ().
In the family of -balanced forests,
Note that our purpose here is to show that the approximation ratio can be finitely-bounded using a natural parameter of the graph. Although our theoretical bound of might be quite high, our simulations in Section 6 demonstrate that the actual approximation ratio in natural classes of networks is usually low.
Before turning to the proof, we show in the next example that indeed when the approximation ratio of cannot be bounded.
Example 3.8 ().
Consider a forest made of one star of order and isolated vertices.
The centre of the star, , will only be selected if both paths hit the star for the first time on the same stage. With high probability, this event will not happen777At the first round the probability that will be selected is , whereas the probability that will be marked is . Same is true for all the first rounds. Therefore, the probability that will be selected during at the first rounds is , while the probability that it will be marked is close to 1.; hence, the result will be either or a vertex with influence one.
Proof of Theorem 3.5.
For any , we denote by the tree of . We define to be the event that the mechanism has picked a vertex from without marking before. That is, is the event that the first time the two paths meet, they meet in , and none of the paths in previous stages was in . From Theorem 3.3 we know that
Denote the probability that the two paths intersect in a single stage by . Let be the probability that in a single stage, the two paths did not intersect and none of them was in . Then,
Now, since the events are pairwise disjoint, we may bound
In order to bound the last sum, we partition to two parts:
and bound this sum for each part separately.
Where the last inequality is due to the convexity of . Let be the solution of . Since , either or . In the former case we bound
and in the latter case we bound
Since , we can use convexity to bound ,
Hence, we get that in any case,
4. Mechanism for monotone DAGs
We remind that we define a monotone graph to be a graph in which any user is more influential than his followers. Clearly, a monotone graph must be acyclic. Monotonicity is a natural property in domains where the statement ”my friend is more influential than I am” is true for any vertex.
Let be a DAG. Denote by the set of sinks of , i.e. the set of vertices with no out-edges. Denote . In a forest, is the set of roots and, as we proved in Theorem 3.5, for a constant . The next example shows that this claim is not true for all monotone DAGs. We will show later in this section a mechanism which is somewhat related to , and which generalizes Theorem 3.5 for monotone DAGs.
Example 4.1 ().
Consider a matrix of vertices and another vertex, . Suppose each vertex in row has an edge to every vertex in row , and the vertices of row all have a single edge to vertex .
The vertices of row all have influence , and vertex has influence . Thus, this graph is monotone. Since is the only sink, . However, . Indeed, with high probability, both will be somewhere in the first rows of the matrix. Since, for each of the top rows the random paths have an independent probability of to intersect, we get that with high probability the two paths will intersect before reaching . We therefore get that for this monotone DAG, the Two Path mechanism does not have a bounded approximation ratio.
The mechanism which we are about to suggest for monotone DAGs, will not be described as an algorithmic procedure, but rather as an explicit distribution formula. We obtain this formula by first finding an explicit expression for the distribution of , and then generalizing it in a natural way to monotone DAGs. We start by finding the distribution of when is a tree. In this case, the probability of two independent random paths to intersect in is precisely . However, this is not the probability , because is only selected if it is the first intersection of the two paths. Define the recursive function:
where is the progeny set of , i.e. all vertices which have a path to (not including itself). We can then write
and we have found an explicit expression for the distribution induced by . We remark that it is not hard to prove, using simple induction, that
Now suppose that is a forest. Observe the following:
The probability that is selected in the first stage is .
In every subsequent stage, will have probability to be selected, provided the two paths did not intersect in previous stages and was not marked.
Let denote the probability that in a single stage the two paths intersect. Let be the graph we get from after removing all the out-edges of . Then is the probability that in a single stage the Two Path intersected but none of them went through , unless is the intersection vertex (this is because is a root vertex in ). Thus, if we denote by the probability that in a single stage the two paths did not intersect and none of them went through , then
We conclude that
We will now use this last expression as a baseline for our ‘analytic’ Two Path mechanism, denoted .
Let be a DAG. In a forest the influence of one vertex over the other, , is either 1 or 0. This is no longer the case for DAGs. To account for this difference we alter the function we defined at (3):
Now, our mechanism for monotone DAGs, denoted , is defined by the following distribution:
when is a monotone DAG; if is not a monotone DAG, the mechanism returns .
Proposition 4.2 ().
Mechanism is well-defined and incentive compatible in the family of monotone DAGs.
We will need the following lemma.
Lemma 4.3 ().
For any monotone DAG and for every ,
If is a sink then . Assume then that is not a sink. On the one hand, the influence of some of the sinks of is lower in , but on the other hand is an extra sink which was not in . Thus,
Now, , and by monotonicity . Hence, we have,
and the upper bound follows.
For the lower bound it is enough to observe that by Cauchy-Schwarz,
Proof of Proposition 4.2.
Since it is clear from (6) that does not depend on the out-edges of , the mechanism is IC.
To prove that it is well-defined, we need to show that for every monotone DAG, all the probabilities are non-negative and the sum of probabilities is at most 1 888If the sum is strictly less than 1, then the rest of the probability goes to .. For the former, we use the monotonicity assumption, which means that for all . Hence,
For the latter, we use the lemma.
We are ready to prove the parallel of Theorem 3.5.
Theorem 4.4 ().
Let be a monotone DAG with sinks, . Then,
Define, for every ,
It is enough then, to bound . Let be the progeny of . We will prove by induction on that
When , and by Lemma 4.3,
In the general case, we write