Learning Graphs from Noisy Epidemic Cascades

by   Jessica Hoffmann, et al.

We consider the problem of learning the weighted edges of a graph by observing the noisy times of infection for multiple epidemic cascades on this graph. Past work has considered this problem when the cascade information, i.e., infection times, are known exactly. Though the noisy setting is well motivated by many epidemic processes (e.g., most human epidemics), to the best of our knowledge, very little is known about when it is solvable. Previous work on the no-noise setting critically uses the ordering information. If noise can reverse this -- a node's reported (noisy) infection time comes after the reported infection time of some node it infected -- then we are unable to see how previous results can be extended. We therefore tackle two versions of the noisy setting: the limited-noise setting, where we know noisy times of infections, and the extreme-noise setting, in which we only know whether or not a node was infected. We provide a polynomial time algorithm for recovering the structure of bidirectional trees in the extreme-noise setting, and show our algorithm matches lower bounds established in the no-noise setting, and hence is optimal. We extend our results for general degree-bounded graphs, where again we show that our (poly-time) algorithm can recover the structure of the graph with optimal sample complexity. We also provide the first efficient algorithm to learn the weights of the bidirectional tree in the limited-noise setting. Finally, we give a polynomial time algorithm for learning the weights of general bounded-degree graphs in the limited-noise setting. This algorithm extends to general graphs (at the price of exponential running time), proving the problem is solvable in the general case. All our algorithms work for any noise distribution, without any restriction on the variance.



page 1

page 2

page 3

page 4


Disentangling Mixtures of Epidemics on Graphs

We consider the problem of learning the weighted edges of a mixture of t...

Finding the Graph of Epidemic Cascades

We consider the problem of finding the graph on which an epidemic cascad...

Parameter estimation in the SIR model from early infections

A standard model for epidemics is the SIR model on a graph. We introduce...

Learning Bayesian Networks Under Sparsity Constraints: A Parameterized Complexity Analysis

We study the problem of learning the structure of an optimal Bayesian ne...

Distributed Exact Weighted All-Pairs Shortest Paths in Near-Linear Time

In the distributed all-pairs shortest paths problem (APSP), every node ...

A theory of maximum likelihood for weighted infection graphs

We study the problem of parameter estimation based on infection data fro...

Noisy source location on a line

We study the problem of locating the source of an epidemic diffusion pro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Epidemic models accurately represent (among other processes) the spread of diseases, information (rumors, viral videos, news stories, etc.), the spread of malevolent agents in a network (computer viruses, malicious apps, etc.), or even biological processes (pathways in cell signaling networks, chains of activation in the gene regulatory network, etc.).

We focus on epidemics that spread on an underlying graph [30], as opposed to the fully mixed models introduced in the early literature [4].

Most settings assume we know the underlying graph and aim to study properties of the spread. Much work has been done in detection [2, 3, 27, 26, 25, 24, 21], where the goal is to decide whether or not there is indeed an infection. This problem is of importance in deciding whether or not a computer network is under attack, for instance, or whether a product gets sold through word-of-mouth or thanks to the advertisement campaign (or both [28]). More specifically, the problem of source detection [32, 33, 31, 34, 35] or obfuscation [11, 13, 12] has been extensively studied. On the other side of the spectrum, both experimental and theoretical work has tackled the problem of modeling [7, 36, 17], predicting the growth [6, 38], and controlling the spread of epidemics [9, 10, 18, 14].

In this work, we take the opposite approach: assuming we know some properties of the spread, can we recover the underlying graph? The early works on this subject proposed a few heuristics and experimentally proved their effectiveness

[16, 19]. Netrapalli et al. [29] established the first theoretical guarantees for this problem for discrete-time infections. They proved one can recover the edges of any graph with correlation decay, with access to the times of infection for multiple cascades spreading on the graph. They introduced a likelihood, proved it decouples into convex subproblems, and demonstrated that the edges of the graph can therefore be obtained efficiently. They also proved a sample complexity lower bound and showed their method is within a log factor of it. Abrahao et al. [1] also introduced a method of solving this problem, this time for a more realistic, continuous-time infection model, through learning only the first edge of each cascade. Zarezade et al. [37] proposed a first experimental attempt to tackle the case of correlated cascades using Hawkes processes. Khim et al. [22] extended the theoretical results to the case where the the cascades spreading on the graph are not independent, which required completely new machinery involving martingales and weighted Pólya urns.

All the results above assume we have perfect knowledge of the properties of the spread we use to reconstruct the graph. For most of the literature, those are the times of infection for all nodes for each cascade. This assumption may hold for online epidemics, as information is usually dated (for instance, posts or retweets on social networks have time stamps). For human networks, however, this assumption is often unrealistic: official diagnosis (and hence recording by any tracking entity such as the CDC) may come days, weeks, or in important examples such as HIV, years

after the actual moment of infection. Moreover, this can be highly variable from person to person, hence the infector is often diagnosed after the infectee. Similar issues arise with biological networks: we only know the expression of a gene when a measure is taken, which can happen after a typically arbitrary delay.

We therefore develop a method for learning the graph of epidemics with noisy times of infection. We demonstrate that past approaches are unusable, due to the fact that even small levels of noise are typically enough to cause order-of-diagnosis to differ from order-of-infection. We develop new techniques to deal with this setting, and prove our algorithm can reliably learn the edges of a tree in the limited-noise setting, for any noise distribution. We also show we can learn the structure of any bounded degree graph from a very weak observation model, in a sample-optimal fashion. We finally provide an algorithm which learns the weights of any bounded-degree graph in the limited-noise setting.

Graph , set of nodes, set of edges.
Number of nodes in the graph.
Random variable for the actual time of infection of node during cascade .
Noise of node during cascade .
Random variable for the noisy time of infection of node during cascade .
Random boolean variable. ( node was infected during cascade ).
Weight of edge

, corresponding to the probability that

infects .
Table 1: Notations

1.1 Model

[width = ]pen-1.png

(a) At t=0, node 1 is the source, in the infected state. It can possibly infect node 2, 3 and 4, all in the susceptible state.

[width = ]pen-2.png

(b) At t=1, nodes 3 and 4 are infected. Node 3 can infect node 5, in the susceptible state. Node 4 can infect node 2, in the susceptible state, but not node 1, since node 1 is now removed.

[width = ]pen-3.png

(c) At t=2, node 2 is infected. All its neighbors are in the removed state, so new node can be infected.

[width = ]pen-4.png

(d) At t=3, the cascade stops, even if node 5 remains in the susceptible state.
Figure 1: A complete cascade.

We observe epidemics spreading on a graph, and aim to reconstruct the graph structure from noisy estimates of the times of infection. In this section, we specify the exact propagation model, the noisy observation model, and the two learning tasks this work tackles.

Propagation model: We consider a particular variant of the independent cascade model, close to the one-step model introduced by [15] and further studied by [20]. The epidemic spreads in discrete time on a weighted directed graph , where parents can infect their children with probability equal to the weight of the edge between them, but children cannot infect their parents. We allow bidirectional edges: it is possible that both and , possibly with different weights. For each edge , the corresponding weight is such that .

This process is an instance of a (SIR) process. Each node starts out in the susceptible state. As in [1], each cascade starts at a positive time 111Most of the literature considers the initial time of infection to be 0. This is because when we have access to the exact times of infection, we can make this assumption without loss of generality. In our case, it would imply we know exactly when an outbreak started, which is usually not the case. on a unique node source node, picked uniformly among the nodes of the graph. Once the source becomes infected, it is removed from the graph, and if it has children, each is infected at the next time step independently according to a probability specified by the weight of the edge shared with the source.

The process ends when there are no newly infected nodes (either because no infection happened during the previous time step, in which case some nodes may never be infected, or because all the nodes of the graph are removed). One realization of this process from start to finish is called a cascade. If two nodes are infected during the same cascade, we say that they are co-infected for this cascade. This process is illustrated in Figure 1.

Observation model: Let be a random variable corresponding to the time of infection of node during cascade , and let be its realization (if stays in the susceptible state during cascade , we have ). We introduce three observation models.

In the no-noise setting, we have access to the exact times of infection .

In the limited-noise setting, we never get to observe the exact times of infection , but only a noisy version (with realization ), where all the are i.i.d., and represent the noise added to the . We assume follows a known distribution . The only restriction we put on is that it cannot have infinite value (i.e., , and we know for a fact when nodes have been infected or not).

In the extreme-noise setting, we take the previous setting to the extreme, and we assume that instead of having access to the noisy times of infection , we only have access to the infection status of the nodes . We know if was infected during cascade , and otherwise. Note that , so we can always deduce the infection status from the noisy times of infection. However, we cannot guess the noisy times of infection from the infection status: the (noisy) times of infections contain strictly more information than the infection status.

For these three settings, we call a sample

the vector of all observations for the cascade

. In the no-noise setting, this is the extended-integer vector . In the limited-noise setting, this is the extended-integer vector . In the extreme-noise setting, this is the boolean vector corresponding to the realization of . We also use the notation (respectively ) for the matrix representing the random variable (respectively the realizations) of all the samples.

Learning tasks: We focus on two different learning tasks. When we learn the structure of a graph, it means that for any two nodes and , we can decide whether or not there exists an edge between these two nodes (whatever its direction). When we learn the weights of the graph, it means that for every two nodes and , we learn the exact value222When , we have . of both and up to precision .

1.2 Why is it a hard problem?

1.2.1 Counting approaches

Most approaches in the no-noise setting relate to counting. In our setting, for instance, a natural (and consistent) estimator for is to count how often an infection occurred along an edge, and divide it by how often such an infection could have happened:


Figure 2: Possible scenarios which could have led to , and . In the no-noise setting, this implies , and there is only one possible infection pattern.

[width = ]ijk.png T n i 0 2 j 1 2 k 2 2


[width = ]i-jk.png T n i 0 2 j 1 2 k 1 3


[width = ]kij.png T n i 1 1 j 2 1 k 0 4


[width = ]jik.png T n i 1 1 j 0 3 k 2 2


[width = ]j-ik.png T n i 1 1 j 0 3 k 1 3


[width = ]kji.png T n i 2 0 j 1 2 k 0 4


[width = ]ikj.png T n i 0 2 j 2 1 k 1 3


[width = ]k-ij.png T n i 1 1 j 1 2 k 0 4


[width = ]jki.png T n i 2 0 j 0 3 k 1 3

Figure 3: Possible scenarios which could have led to , and . We have . In the limited-noise setting, there are nine possible infection patterns (many more scenarios with the same infection pattern, but different noise values, are not shown).

In the no-noise setting, could only have been infected by a node signaling exactly one time step before . However, in the limited-noise setting, signaling its infection one time step after could stem from a variety of scenarios:

  • could have indeed infected : cases a), b) and c) of Figure LABEL:fig:second above.

  • could have infected , but the noise flipped the order of signaling: cases d), e) and f).

  • No infection happen between and , and the probability of infectin depends mainly on another node : cases g), e) and f). This could happen for any other node in the graph.

The natural estimator introduced earlier is therefore not consistent anymore; instead, it tends to a quantity which depends on , but also , and as well, for all the other nodes in the graph. By counting the number of times became infected one time step after , we are not counting the number of infections along the edge anymore, but instead a mixture of all the scenarios described above, which not only include the cases where infected , but also events in which the cascade spread through another node , and the edge was irrelevant to the process. Using this estimator, or any obvious (to us) extension of it, would not only imply learning the wrong weights for the edges, but also learning edges when there are no edges. Our first contribution is therefore to design a new set of estimators, from which we can deduce the value of (Sections 2.3 and 3.2).

Adding noise in the time of infection not only reverses the cascade chronology, it also exponentially increased the number of possible infection patterns that could have happened. Bounding the realm of possibilities is therefore our second step towards solving the problem (Section 2.1).

1.2.2 Max-likelihood approaches

Another common approach is to use likelihood-based methods. For instance, in [29], the authors develop a max-likelihood-based approach to learn the edges of the graph. They prove the log-likelihood has desirable properties: it decouples into only one local problem for each node, and this local problem is convex (up to the change of variable ):

In our setting, the log-likelihood has none of these properties. It is not convex, and it is unclear any method other than brute force could find its maximum. Moreover, it does not decouple anymore, and even computing the log-likelihood itself takes exponential time.

When dealing with hidden variables, a common technique would be to use the Expectation-Maximization algorithm

[8]. However, in our setting, the number of hidden states is , which can be as large as . This prohibits any realistic use of the Expectation-Maximization algorithm for networks with more than twelve nodes. Moreover, except for the recent contributions [23], very little is known about the theoretical convergence of the Expectation-Maximization algorithm.

1.3 Contributions

The contributions of this article are multiple:

  • To the best of our knowledge, we are the first to tackle the problem of learning the edges of a graph from noisy times of infection, a simple but natural extension of a well-known problem.

  • We provide the first efficient algorithm for learning the structure and weights of a bidirectional tree in this setting. We also establish a tree-specific lower bound which shows that our algorithm is sample-optimal for learning the structure of the tree (Section 2).

  • We prove it is possible to learn the structure of any bounded-degree graph in the extreme setting for which we only have access to the infection status (i.e., whether or not a node was infected). Moreover, we can do so with optimal sample complexity, according to the bound established in [29].

  • We provide polynomial algorithms for learning the weights of bounded degree graphs.

  • Finally, we extend the results from bounded-degree graphs to general graphs. This proves the problem is solvable under any noise distribution, although the exponential sample complexity and running time prohibits any use of this algorithm in practice (Section 3.2).

2 Learning bidirectional trees

The bidirectional tree is the simplest example which illustrates some of the difficulties of the noisy setting. Indeed, for a directed tree, the true sequence of infections can be reconstructed, and we can use techniques from the no-noise setting. For a bidirectional tree, those techniques cannot be extended. However, the uniqueness of paths in the bidirectional tree still makes this problem considerably easier than the general setting. We therefore start by presenting a solution for the bidirectional tree. The key ideas here generalize to the neighborhood-based decomposition we introduce below, which forms our key conceptual approach for the general problem.

This section contains three contributions. First, we show how to learn the structure of a tree using only the infection status, i.e., what we call the extreme-noise setting (Section 2.1). For each cascade, we only know which nodes were infected. We show this contains enough information to learn the structure of bidirectional tree. Second, we establish a lower bound for the no-noise setting, and show our algorithm has the same dependency in the number of nodes as this lower bound. In other words, for the task of learning the structure of any tree, an optimal algorithm in the no-noise setting would need as many cascades as our algorithm needs in the extreme-noise setting (up to constants).

Finally, we show how we can leverage this learned structure to learn the weights of the tree, this time when we have noisy access to the times of infection, i.e., the limited-noise setting (Section 2.3). We provide sample complexity for this task.

2.1 Tree structure

As illustrated in Section 1.2, the number of edges that could exist is much higher in the limited-noise setting than the number of actual edges in the tree. Our first key contribution is therefore to introduce a new estimator, , which keeps track of the fraction of cascades for which and were both infected. This estimator can therefore be computed only with the infection status in the extreme-noise setting. Using this estimator, we show that in the specific case where the graph is a tree, we learn the structure of this tree, whether or not both .

Our algorithm for learning the edges of the tree relies on one central observation: achieves a kind of local maximum if there is an edge between and (Lemma 1). This observation relies heavily on the fact that there is uniqueness of paths on a tree. Let us now dive into the proof.

Definition 1.

Let be the fraction of cascades in which both and became infected. We have:

We now show that the limit of the estimator satisfies a local maximum property on the edges of the tree:

Lemma 1.

If and are not neighbors, let be the path between them, with , , and . Then:


We consider the case in which both and have been infected. There is a unique source of infection and a unique path between and . Therefore, all the nodes on the path from to must have been infected as well. In particular, both and were infected. This shows , so , and therefore .

What’s more, every time and became infected, at least one more infection along an edge must have occurred in order for to become infected as well. This occurred with probability at most . Therefore, . We conclude .

This simple lemma allows us to design Algorithm 1. Indeed, suppose we have access to all the limits . By ordering them in decreasing order, we can deduce the structure of the tree by greedily adding every edge unless it forms a cycle333This algorithm is very similar in spirit to Kruskal’s algorithm for finding the maximum spanning tree .

1:procedure LearnTree() limit of .
2:      for
3:      by decreasing order
5:     for  do
6:         if Adding to does not create a cycle then
7:              Add to               return
Algorithm 1 Learn the undirected edges of the tree.

We show that if we have access to the limits of the estimators , the algorithm above correctly find the structure of the tree.

Lemma 2.

Algorithm 1 correctly finds all the pairs such that there exists at least one directed edge between and .


We show that in the for-loop at line 5, we add an edge to if and only if this edge was a real edge in the original tree. We prove it by induction on the elements of the sorted list .

When no element has been selected, the proposition is trivially true.
Suppose now that elements of have been examined so far. Let be the element. Two cases arise:

  1. and are not neighbors. Let be the path between them, with and . In this case, using Lemma 1, . In other words, all the pairs have already been considered by the algorithm. By induction, we have kept all of them in . Therefore, adding the pair would form a cycle. This pair is not kept in , which is what we wanted since it is not an edge in the original tree.

  2. and are neighbors. Suppose that adding this pair forms a cycle. Then there is a sequence of nodes such that were all bigger than , and the pairs were kept by the algorithm for all . However, by uniqueness of paths in a tree, there exists a pair such that the path connecting and in the original tree goes through . Using Lemma 1, this means , which is a contradiction. Therefore, adding this pair in does not form a cycle. This pair is kept in .

Therefore, this algorithm keeps all the edges, and only the edges of the tree, so it recovers the tree structure.

We next quantify how many cascades are needed for Algorithm 1 to be correct if we replace the by their estimates . We note that we do not require to be close to their limit, but only need the order of the to be the same as the order of the . We identify events which guarantee that the order is the same (Corollary 1):

Definition 2.

Let be the set of triplets of nodes such that at least one directed edge exists between the first and the second node, as well as between the second and the third node.

Proposition 1.


then for all paths in the tree, with , we have:


For , we have by hypothesis . Now, we recall that is the number of cascades for which both and were infected. By uniqueness of paths in the tree, every time both and were infected, both and must have been infected as well. This shows that . Notice that this is a deterministic property, not an asymptotic property. Therefore, .

For , we follow an identical reasoning, but with .

Corollary 1.


then the correctness of Algorithm 1 is preserved when given as input instead of h.

In other words, Algorithm 1 outputs a correct set of undirected edges with finite samples.


According to Proposition 1, for all paths in the tree, with , we have that . As shown in the proof of Lemma 2, this is the only property of the input needed in order to yield the correct output.

Proposition 2.

With cascades, with probability at least , we have:


Let us consider one triplet in . We recall that is the number of cascades for which both and were infected. Since the only path from to is through , we always have that . We notice that to obtain , we only need one cascade for which both and got infected, but not . We lower bound the probability of this cascade happening. For each cascade , we have:

The probability that this event never occurs during the cascades is upper bounded by:

Now, there are edges in a tree, therefore . By union bound:

Notice that contains both and . We have therefore proven that with probability at least , when considering cascades, we have

Putting together Proposition 2 and Corollary 1, we obtain our first theorem for learning the undirected edges with finite samples:

Theorem 1.

With cascades, with probability at least , we can learn the structure of a any bidirectional tree in the extreme-noise setting, i.e., when we only have access to the infection status of the nodes.

2.2 Lower bound

In this section, we prove a lower bound for trees in the no-noise setting. With very minor adjustments, we adapt the lower bound of [29]. Since for a general tree, the max degree can be up to , we design a lower bound which is independent from the max-degree. Let be a tree drawn uniformly from , the set of all possible trees on nodes, and be the reconstructed graph from the times of infection.

therefore forms a Markov chain. We have:

(data processing inequality)
(independent cascades)
(Fano’s inequality)

Since is drawn uniformly from , . There are trees on nodes, according to Cayley’s formula [5], so .

In conclusion:

Using the same kind of techniques as in [29], we can assume . Therefore:

Theorem 2.

In the no-noise setting, we need cascades to learn the tree structure.

In our extreme-noise setting, when we have only access to the infection status of the nodes, we can learn the tree structure with the same sample complexity as the no-noise setting!

2.3 Tree weights

In this section, we assume we are in the limited-noise setting, and we have access to the times of infection. We also assume we have already learned the structure of the tree.

Once we have reduced the set of possible edges by learning the structure of the bidirectional tree, learning the weights of the edges is still non-trivial. Indeed, from and , it is still impossible to know whether this sample is useful for estimating (case when infected ), or whether we should use this sample for estimating instead (case when infected ). What is more, we only get one sample per node and per cascade, so it is impossible to know what really happened during that cascade. However, knowing the distribution of the noise, it is possible to compute the probability that the noise maintained the order of infections. Using this information and the reduced set of known undirected edges, we can compute two sets of estimators, from which it is possible to infer the weights of all edges in the tree.

We introduce these two sets of estimators, or, in other words, two estimators for each directed edge. These estimators tend to multivariate polynomials of the weights of most edges of the tree. Thus in general these polynomials have exponentially many terms; however, when and are neighbors, it is possible to express them concisely using a quantity , which we define formally below. This succinct representation is the key idea we exploit to solve the resulting system of equations.

Once we know the structure of the bidirectional tree, we can consider the four estimators for each undirected edge (two estimators for each directed edge). They form a system of four equations and four unknowns, which we solve to obtain the weights of the edges.

Definition 3.

is the probability that became infected before any node on the path from to , including , became infected.

We now introduce the estimators:

Definition 4.

We introduce 2 sets of estimators:

By the law of large numbers, as the number of cascades scales,

tends to and to , where

We now compute the exact values of these two quantities. Let us assume the (unique) path between and has length . We call the set of nodes on the path from to , with and . We then have:

Lemma 3.

Recall and are the expectation of the estimators defined above. We have:

What’s more, when and are neighbors (which implies ), the expressions simplify to:


This expression is involved in general. However, if and are neighbors, then there are no nodes on the path between and , other than and themselves. What’s more, , and . Therefore:

Let us now focus on :

As before, this expression is complex in general, but simplifies if and are neighbors, in which case:

Using the simplified expression only for when and are neighbors, we obtain:

Proposition 3.

If we know is an edge in the original tree, then the probability of infection along this edge is given by:


According to Lemma 3, we had four second-order equations, with 4 unknowns: , , and . We solve it, and obtain the wanted result. See Appendix A for details.

Combining all the pieces, we obtain our first theorem for infinite samples:

Theorem 3.

It is possible to learn the weights of a bidirectional tree in the limited-noise setting.

Now that we have proven the problem is solvable, we establish the number of samples needed to learn the weights with the method above.

Lemma 4.

With samples, with probability at least , we have:


Using Hoeffding’s inequality:

Choosing , we have that with probability at least , all the following hold:

Hence, with probability at least , we have (see Appendix A for details):

We use the results from Lemma 3 to bound the denominator by . In the end, we obtain:

We choose . Therefore:
With samples, with probability at least , we have .

By a union bound on all the weights of the tree, knowing there are at most directed edges in a directed tree, we obtain the following sample complexity:

Theorem 4.

With cascades, with probability , we can learn all the weights of the edges of a bidirectional tree in the limited-noise setting, i.e., when we only have access to the noisy times of infection.

3 Bounded-degree graphs

In the previous section, the algorithm presented relies heavily on the uniqueness of paths. This property implies that we can deduce the edges from the nodes which are co-infected the most often. However, this is not true for a general bounded-degree graph. In Figure 4, we can see that the two nodes and would be co-infected frequently despite not sharing an edge. This makes the task of finding the structure much more challenging than for the bidirectional tree.

In this section, we show how the main ideas for learning the structure of the bidirectional tree can be extended for learning the structure of general bounded-degree graphs, in the extreme-noise setting, with optimal sample complexity. The framework for learning the weights of the edges in the limited-noise setting is - to the best of our knowledge - not extendable to general bounded-degree graphs; we therefore develop a new algorithm to learn the weights for general bounded-degree graphs.

[width = 0.5]lotsEdges-clean.png

Figure 4: Two nodes can be co-infected frequently without sharing an edge.

3.1 Bounded-degree structure

In the previous section, we introduced the estimator , which records the fraction of cascades for which both and are infected. From a local maximum property of this estimator, we deduced the structure of the tree, in a sample efficient fashion. Indeed, if there exists a path between and , and the first edge on this path is , then if and are infected, must have been infected as well.

We want to build on this idea for a bounded-degree graph of maximum degree . However, for such a graph, there may be multiple paths leading from to , and we cannot guarantee a single node will be infected each time. However, if is a node, is its neighborhood, and is another node of the graph, we can guarantee that if both and are infected, there exists a node in