Epidemic models represent spreading phenomena on an underlying graph Newman2014a
. Such phenomena includes diseases spreading among a population, security breaches in networks (malware attacks on computer/mobile networks), chains of activation in various biological networks (activation of synapses, variations in the levels of gene expression), circulation of information/influence (rumors, news—real or fake, viral videos, advertisement campaigns) and so on.
Most settings assume the underlying graph is known ( the gene regulatory network), and focus on modeling epidemics DelVicario2016 ; Wu2018 ; Gomez-Rodriguez2013 ; Cheng2014 ; Zhao2015 , detecting them Arias-castro2011 ; Arias-castro ; Milling2015 ; Milling2012 ; Meirom2014 ; Leskovec2007 ; Khim2017 , finding their source Shah2010 ; shah2012rumor ; shah2010detecting ; spencer2015impossibility ; wang2014rumor , obfuscating the source, fanti2016rumor ; Fanti2014 ; Fanti2017 , or controlling their spread Drakopoulos2014 ; Drakopoulos2015 ; Hoffmann2018 ; Farajtabar2017 .
The inverse problem, learning the graph from times of infection during multiple epidemics, has also been extensively studied. The first theoretical guarantees were established by Netrapalli2012 for discrete-time models. Abrahao2013
tackled the problem for continuous-time models, for exponential distributions.Khim2018 solved the problem for correlated cascades. It has been shown in hoffmann2019 that it is possible to robustly learn the graph from noisy epidemic cascades, even in the presence of arbitrary noise.
However, this line of research always assumes that the epidemic cascades are all of the same kind, and spread on one unique graph which entirely captures the dynamics of the spread. In reality, our observations of cascades are far more granular: different kinds of epidemics spread on the same nodes but through different mechanisms, i.e., different spreading graphs. Epidemic cascades we observe will (most) often be a mixture of different kinds of epidemics. Without knowledge of the “label” of the epidemic, how can we recover the individual spreading graphs? For a concrete example, consider the ubiquitous Twitter graph. Individuals usually have multiple interests, and will share tweets differently according to the underlying topics of the tweets. For instance, if a user is extremely interested in football and moderately interested in politics, she may retweet posts on football frequently, whereas her retweets on politics will be rare. Interesting settings are those where the epidemic label (in this simple case, “football” and “politics”) is not observable. While “football” and “politics” may be easy to distinguish via basic NLP, the majority of settings will not enjoy this property (e.g., she retweets football posts relating to certain teams, outcomes or special plays). In fact, the focus on recovering the spreading graph stems precisely from the desire to study very poorly-understood epidemics where we do not understand spreading mechanisms, symptoms, etc. Examples outside the twitter realm (e.g., HIV when it first appeared cohen2006making ) abound.
In such cases, applying existing techniques for estimating the spreading graph would recover the union of graphs in the mixture. For Twitter and other social networks, this is essentially already available. More problematic, this union is typically not informative enough to predict the spread of tweets, and may even be misleading.
We address precisely this problem. We consider a mixture of epidemics that spread on two unknown weighted graphs when, for each cascade, the kind of epidemic (and hence the spreading graph) remains hidden. We aim to accurately recover the weights of both the graphs from such cascades.
To the best of our knowledge, this is the first paper to study the inverse problem of learning mixtures of weighted graphs from epidemic cascades. We address the following questions:
Identifiability: We prove the problem is unidentifiable when one of the graph of the mixture has a connected component with less than three edges.
Recovery: Under the assumption that the underlying graphs are connected, have at least three edges and the edge weights of the mixtures are separated, we prove the problem is solvable and give an efficient algorithm to recover the weights of mixture of connected graphs with priors on the same set of vertices.
Sample Complexity: We prove an information-theoretical lower bound on the samples complexity of the problem and show that our algorithm matches the lower bound up to log factors.
2.1 Model for Sample Generation
We consider an instance of the independent cascade model Goldberg2001 ; Kempe2003 . We observe independent epidemics spreading on a mixture of two graphs. In this section, we specify the dynamics of the spreading process, the observation model, and the learning task.
Mixture model: We consider two graphs and on the same set of vertices . The edges in the two graphs are weighted. For each edge , we have the weight and, similarly, for each we have the weight .
We assume the following from hereon, unless stated otherwise:
1. Both the graphs are undirected, , and .
2. The minimum weight of an edge is positive, ,
3. For an edge, , its weights are well-separated, .
We observe independent and identically distributed epidemic cascades, which comes from the following generative model.
Component Selection: For the -th cascade the i.i.d. Bernoulli random variable
-th cascade the i.i.d. Bernoulli random variabledecides the component of the mixture, i.e. its source graph is if for . We first consider the setting where are unbiased Bernoulli random variable. We then consider the extension of this model to a mixture of two graphs with unknown prior : for each epidemic , the label is given by independent random variable , such that .
Epidemic Spreading: Once the component of the mixture is fixed, the epidemic spreads in discrete time on graph according to a regular one-step Susceptible Infected Removed (SIR) process Netrapalli2012 ; hoffmann2019 . At , epidemic starts on a source, chosen uniformly at random among the nodes of . The source is in the Infected state, while all the other nodes are in the Susceptible state. Let (resp ) be the set of nodes in the Infected (resp. Removed) state at time . At each time step , all nodes in the Infected state try to infect their neighbors in the Susceptible state, before transitioning to the Removed state during this same time step (i.e., ) 111Once a node is in the Removed state, the spread of the epidemic proceeds as if this node were no longer on the graph.. If is in the Infected state at time , and is in the Susceptible state at the same time ( ), then infects with probability
with probabilityif , and if . Note that multiple nodes in the Infected state can infect the same node in the Susceptible state.
The process ends at the first time step such that all nodes are in the Susceptible or Removed state (i.e., no node is in the Infected state). One realization of such a process from randomly picking the component of the mixture and the source at to the end of the process is called a cascade.
Observation: For each cascade we do not have the knowledge of the underlying component , and we treat this as a missing label. For each cascade, we have access to the complete list of infections: we know which node infected which node at which time (one node can have been infected by multiple nodes). Such a list is called a sample.
2.2 Learning Objective:
Our goal is to learn all the weights of all the edges of both graphs of the mixture, up to precision . Specifically, we want to provide and for all vertex pairs such that .
3 Main Results
In this section we present our main results on the impossibility and recoverability of edge weights in a mixture of components. The proofs of Theorem 1 and Theorem 3 are deferred to the Appendix, whereas we present proof sketch for Theorem 2 for the unbiased mixture case along with an algorithm.
1. Impossibility Result Under Infinite Samples
The graph is connected and has at least three edges: .
The above condition is a necessary condition for recovery.
Suppose Condition 1 is violated, then it is impossible to recover the edge weights corresponding to each graph (even with infinite samples and for unbiased mixtures).
2. Recoverability Result with Finite Samples
The mixtures in the graph are well-separated, that is, and the bias of the mixture is known.
The above condition along with the previous one turn out to be sufficient for recovering the mixtures.
Condition 1 is necessary for identifiability. Suppose has two (or more) connected components and . We write . Since all epidemics have a unique source, no single cascade involves edges in both and . Let . We notice both and yield identical cascade distributions. Therefore, if the union of the graphs is not connected, the solution is not unique.
Furthermore, Suppose both has exactly two nodes and , with . We notice that the cascades on these two nodes have the same distribution independently of the value of . Therefore, it is impossible to disentangle the mixture. For three nodes, the situation is slightly more complex: if , we cannot recover the weights of the mixture (see counterexamples in Appendix A.1). If , we can recover the edges as long as they form a triangle. This special case is treated in Appendix B.4. For connected graphs of four nodes or more, we always have at least three edges.
Condition 1 and 2 both are needed for our algorithm to work. However, we note that if the graph we obtain by removing all non well-separated edges is still connected, we can detect and learn all the edges of the graph (see Appendix A.2). This in particular makes our algorithm resistant to the presence of bots in the network that would retweet everything indifferently. Furthermore, we believe when is unknown then there are instances with four nodes which are impossible to learn. We leave recoverability open for unknown .
3. Lower Bound on Sample Complexity
The mixtures in the graph each exhibit correlation decay, and for some .
4 Learning graphs for unbiased mixture priors
In this section, we provide our main algorithm that recovers the edge weights on the graph under the conditions presented in Theorem 1. We focus on the setting of unbiased mixture priors for ease of exposition. We refer the reader to Supplementary D for the results on the biased mixture case.
Outline of Main Algorithm 1
We first notice that we can learn the degree of each node with respect to by looking at a simple estimator (Section 4.1).
We then notice that if a node is of degree two, or at least three, we can design estimators to learn all the edges of its neighborhood from the cascades. Although the estimators are different in these two cases, they both allow us to learn those edges independently of the rest of the graph (Section 4.2).
4.1 Learning the Edges in
We recall that is the set containing the unique source of the epidemic for cascade .
Let and be two distinct nodes of V. We define:
We compute the limit of this estimator:
If and are two distinct nodes of such that , then:
Using the law of large number and Slutsky’s Lemma, it is immediate to see that. Then:
This estimator can therefore be used to learn whether or not there exists an edge in the mixture.
Let and be two distinct nodes of . Then:
If there exists an edge between and in , then .
If there exists no edge between and , then .
We can write an algorithm LearnEdges, which takes as inputs all the for all pairs , and returns all the edges of (See Appendix, Algorithm B).
4.2 Learning Star and Line Vertices: Base Estimators
In this section, we show how we recover the weights of both mixture components in a node-by-node fashion. We start from the easy case, when the node we consider has degree at least three.
4.2.1 Star vertex
A star vertex is a vertex of degree at least three in . Let and be two distinct neighbors of . For a star vertex, we define:
In the limit of infinite samples, for star vertex and , and as in Figure 0(a):
One crucial point to notice is that if is the source, no other node could have infected or before infected them. Therefore, we know for a fact that both and are susceptible. If the source had been any other node, the probability that (or ) was not removed would have depended on the (unknown) weights of both mixtures, and we could not have obtained the simplification desired. Once this has been noticed, the proof for is almost identical to the proof in Claim 1. ∎
We notice that for , . Therefore,
Since , we have or , which gives the required result. ∎
We can write an algorithm LearnStar, which takes as input a star vertex , the set of edges of , and all the and for all distinct neighbors of , and returns all the weights of the edges connected to in both mixtures (See Appendix, Algorithm C).
4.2.2 Line vertex
Let and be four distinct nodes of V, such that and belong in . Suppose also that has degree exactly two in . We call such a node a line vertex (see Figure 0(b)). For a line vertex, we define:
In the limit of infinite samples, for line vertex and , and as in Figure 0(b):
The result for was proven in Claim 2.
However, the proof for and is different. The proof relies heavily on the fact that is of degree 2, which implies in particular that there are no edges between and (or in other words, .
The proof is almost identical for . ∎
In this case, there is no edge between and , which implies that . Hence, we cannot use a variation of the equation above for finding the edges of a star structure without dividing by zero. Therefore, we need to make use of the estimator . We notice a remarkable simplification:
4.3 Resolving Sign Ambiguity across Base Estimators
The following lemma handles the sign ambiguity () introduced above.
From previous analysis, we have . Therefore:
Thus fixing sign of one edge gives us the signs of all the other edges adjacent to a star vertex. A similar relationship can be established among the edges of a line vertex, using .
4.4 Main Algorithm
We can now turn to the main algorithm (see Algorithm 1). This algorithm first uses the algorithms LearnEdges (Corollary 2) to learn all the edges of . It then calls Learn2Nodes as an initialization, which learns all weights of the edges connected to two initial nodes (see Appendix, Algorithm E). The set , which only contains nodes for which the weights of each incident edge has been learned, is initialized with these two nodes.
Then, at each iteration, of our algorithm we pick one node connected to . If this node is a star vertex, we learn all the weights of its neighborhood using LearnStar (Corollary 3). If this node is a line vertex, we we learn all the weights of its neighborhood using LearnLine (Corollary 4). If this node has degree one, we have already learned all the weights of the edges connected to it, so we proceed without doing anything. Finally, since we have learned all the weights of the edges connected to this new node, we can add it to . We keep growing until .
4.4.1 Correctness of Algorithm 1
To prove the correctness of the main algorithm, we show the following invariant:
At any point in the algorithm, the entire neighborhood of any node of has been learned and recorded in :
We prove the above by induction on the iteration of the while loop. Note that by Lemma I, after calling Learn2Nodes, contains all edges adjacent to the two vertices in . Hence the base case is true. Let us assume that after iterations of the loop, the induction hypothesis holds. If:
: By Corollary 3, we recover all edges adjacent to the star vertex by using LearnStar. Sign consistency is ensured using edge since .
: There exists such that since and is connected. Since , there exists such that . Now if then and we are done. If then is a line vertex for . Corollary 4 guarantees recovery of all edges on the line by using LearnLine. Sign consistency is ensured through edge .
: Since , we have , so we are done.
Thus by induction, after every iteration of the for loop, the invariant is maintained. ∎
Since at every iteration, the size of increases by 1, after at most iterations, we have . Using the above lemma we have that contains all the edges of the graph. ∎
4.4.2 Finite Sample Complexity
In this section, we investigate what happens if we use a finite sample estimate of the limit of the estimators defined above (see Appendix B.3 for detailed calculations):
-  Bruno Abrahao, Flavio Chierichetti, Robert Kleinberg, and Alessandro Panconesi. Trace complexity of network inference. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’13, page 491, 2013.
-  Ery Arias-castro, Emmanuel J Candès, and Arnaud Durand. Detection of an anomalous cluster in a network. The Annals of Statistics, 39(1):278–304, 2011.
-  Ery Arias-castro and S T Nov. Detecting a Path of Correlations in a Network. pages 1–12.
-  Justin Cheng, Lada A. Adamic, P. Alex Dow, Jon Kleinberg, and Jure Leskovec. Can Cascades be Predicted? In Proceedings of the 23rd international conference on World wide web (WWW’ 14), 2014.
-  Jon Cohen. Making headway under hellacious circumstances, 2006.
-  Michela Del Vicario, Alessandro Bessi, Fabiana Zollo, Fabio Petroni, Antonio Scala, Guido Caldarelli, H. Eugene Stanley, and Walter Quattrociocchi. The spreading of misinformation online. Proceedings of the National Academy of Sciences, page 201517441, 2016.
-  Kimon Drakopoulos, Asuman Ozdaglar, and John N. Tsitsiklis. An efficient curing policy for epidemics on graphs. arXiv preprint arXiv:1407.2241, (December):1–10, 2014.
-  Kimon Drakopoulos, Asuman Ozdaglar, and John N. Tsitsiklis. A lower bound on the performance of dynamic curing policies for epidemics on graphs. (978):3560–3567, 2015.
-  Giulia Fanti, Peter Kairouz, Sewoong Oh, Kannan Ramchandran, and Pramod Viswanath. Rumor source obfuscation on irregular trees. In Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science (SIGMETRICS’ 16 ), pages 153–164. ACM, 2016.
-  Giulia Fanti, Peter Kairouz, Sewoong Oh, Kannan Ramchandran, and Pramod Viswanath. Hiding the Rumor Source. IEEE Transactions on Information Theory, 63(10):6679–6713, 2017.
-  Giulia Fanti, Peter Kairouz, Sewoong Oh, and Pramod Viswanath. Spy vs. Spy: Rumor Source Obfuscation. Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’ 14), pages 271–284, 2015.
Mehrdad Farajtabar, Jiachen Yang, Xiaojing Ye, Huan Xu, Rakshit Trivedi, Elias
Khalil, Shuang Li, Le Song, and Hongyuan Zha.
Fake News Mitigation via Point Process Based Intervention.
Proceedings of the 34th International Conference on Machine Learning (ICML’ 17), 2017.
-  Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A Constant Time Collaborative Filtering Algorithm. Information Retrieval, 4(2):133–151, 2001.
-  Manuel Gomez-Rodriguez, Jure Leskovec, and Bernhard Schölkopf. Structure and Dynamics of Information Pathways in Online Media. In 6th International Conference on Web Search and Data Mining (WSDM 2013), 2013.
-  Jessica Hoffmann and Constantine Caramanis. The Cost of Uncertainty in Curing Epidemics. Proceedings of the ACM on Measurement and Analysis of Computing Systems (SIGMETRICS’ 18), 2(2):11–13, 2018.
-  Jessica Hoffmann and Constantine Caramanis. Learning graphs from noisy epidemic cascades. arXiv preprint arXiv:1903.02650, 2019.
-  David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’03, 2003.
-  Justin Khim and Po-Ling Loh. Permutation Tests for Infection Graphs. pages 1–28, 2017.
-  Justin Khim and Po-Ling Loh. A theory of maximum likelihood for weighted infection graphs. pages 1–47, 2018.
-  Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, and Natalie Glance. Cost-effective Outbreak Detection in Networks. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’07), page 420, 2007.
-  Eli A. Meirom, Chris Milling, Constantine Caramanis, Shie Mannor, Ariel Orda, and Sanjay Shakkottai. Localized epidemic detection in networks with overwhelming noise. pages 1–27, 2014.
-  Chris Milling, Constantine Caramanis, Shie Mannor, and Sanjay Shakkottai. Network Forensics : Random Infection vs Spreading Epidemic. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems (SIGMETRICS’ 12), 2012.
-  Chris Milling, Constantine Caramanis, Shie Mannor, and Sanjay Shakkottai. Local detection of infections in heterogeneous networks. Proceedings - IEEE INFOCOM, 26:1517–1525, 2015.
-  Praneeth Netrapalli and Sujay Sanghavi. Learning the Graph of Epidemic Cascades. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems (SIGMETRICS’ 12), pages 211–222, 2012.
-  M. E. J. Newman. Networks: An Introduction, volume 23. 2014.
-  Devavrat Shah and Tauhid Zaman. Detecting sources of computer viruses in networks: theory and experiment. In ACM SIGMETRICS Performance Evaluation Review, volume 38, pages 203–214. ACM, 2010.
-  Devavrat Shah and Tauhid Zaman. Rumors in a Network : Who ’ s the Culprit ? IEEE Transactions on information theory, 57(8):1–43, 2010.
-  Devavrat Shah and Tauhid Zaman. Rumor centrality: a universal source detector. In ACM SIGMETRICS Performance Evaluation Review, volume 40, pages 199–210. ACM, 2012.
-  Sam Spencer and R Srikant. On the impossibility of localizing multiple rumor sources in a line graph. ACM SIGMETRICS Performance Evaluation Review, 43(2):66–68, 2015.
-  Zhaoxu Wang, Wenxiang Dong, Wenyi Zhang, and Chee Wei Tan. Rumor source detection with multiple observations: Fundamental limits and algorithms. In ACM SIGMETRICS Performance Evaluation Review, volume 42, pages 1–13. ACM, 2014.
-  Liang Wu and Huan Liu. Tracing Fake-News Footprints: Characterizing Social Media Messages by How They Propagate. In (WSDM 2018) The 11th ACM International Conference on Web Search and Data Mining, 2018.
-  Qingyuan Zhao, Murat A. Erdogdu, Hera Y. He, Anand Rajaraman, and Jure Leskovec. SEISMIC: A Self-Exciting Point Process Model for Predicting Tweet Popularity. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’15 ), 2015.
Appendix A Preliminaries
Let be the union of the graphs from both mixtures. In this subsection, we prove it is impossible to learn the weights of and if has less than three edges:
For a graph on two nodes, we have already seen that the cascade distribution are identical if , for any value of , which proves the problem is not solvable.
When we have two nodes and two edges, we can without loss of generality assume that node 1 is connected to node 2 and node 3. Then, if:
The cascade distribution is identical for any value of . By simple calculations, we can show the following,
Fraction of cascades with only node 1 infected: .
Fraction of cascades with only node 2 infected: .
Fraction of cascades with only node 3 infected: .
Fraction of cascades where 3 infected 1, but 1 did not infect 2: .
Fraction of cascades where 3 infected 1, 1 infected 2: .
Fraction of cascades where 1 infected 3, but 1 did not infect 2: .
Fraction of cascades where 1 infected 2, but 1 did not infect 3: .
Fraction of cascades where 1 infected 3 and 2: .
Fraction of cascades where 2 infected 1, but 1 did not infect 3: .
Fraction of cascades where 2 infected 1, then 1 infected 3: .
Since the distribution of cascades is the same for any value of , the problem is not solvable.
a.2 Mixtures which are not well-separated
In this section, we show how to detect and deduce the weights of edges which have the same weight across both component of the mixture. We assume both and follow Conditions 1 and 2 if we remove all non-distinct edges, and in particular remain connected.
Suppose there exists an edge in the graph, such that . Then in particular, there exists another edge connecting to the rest of the graph through node , such that . Then:
Suppose and follow assumption 2 after removing all non-distinct edges. We can detect and learn the weights of non-distinct edges the following way:
If , and , then .
Since is connected on three nodes or more even when removing edge , we know there exists a node such either is connected to either or . Therefore, either or . In both these cases, we deduce . This in turns allow us to detect that . Once this edge is detected, it is very easy to deduce its weight, since by definition. ∎
Appendix B Proofs for unbiased mixtures
b.1 Estimators - proofs
In this case, there is no edge between and , which implies that . Hence, we cannot use a variation of the equation above for finding the edges of a star structure without dividing by zero. Therefore, we need to use . Let . We notice a remarkable simplification: