Information, behavior, and diseases often spread over the edges of an underlying network [2, 3, 10, 18]. On Twitter or Facebook, users may share information with their friends and followers [20, 30]; in communities or countries, individuals infected with HIV or Ebola may spread the disease to other people and other regions by direct physical contact [16, 24, 33, 36]. As more data become available and researchers are able to trace the connections between individuals and recover information about specific transmission events, several novel statistical questions have emerged concerning estimation of unknown parameters governing the stochastic spread. Some scientific questions of interest include the following: Which individuals are at the greatest risk of being infected by HIV in a relationship network? What treatments are effective at slowing the spread of disease from one region or person to another [13, 17]? Applying the same probabilistic models to an entirely different setting may allow researchers in online marketing and advertising to devise an optimal advertising budget for product proliferation in online social networks [12, 21, 39]. Although a sizable amount of literature exists for answering these questions based on a particular network and stochastic spreading model [9, 38, 41], relatively little existing work addresses the problem of estimating a particular stochastic model based on infection transmission data.
In the statistics literature, most research in network science has focused on theory and applications of graphical model estimation [27, 44]. However, the equally important goal of performing further estimation or inference procedures based on a known (or estimated) graph structure has been largely unaddressed, with an exception being graph partitioning approaches such as stochastic block modeling [1, 45, 46]. Another novel aspect of our work is that we do not assume our data are collected in an independent, identically distributed manner, as might be the case if one were observing multiple diseases spread on the same graph. In contrast, our inference procedures are based on information collected about a single epidemic outbreak, which is more realistic in various applications. Indeed, although different diseases may spread over the same network, the spreading behavior of each disease may be quite different, or the disease may exhibit different characteristics during successive spreads; hence, the goal is simply to model the propagation of the disease under consideration. Notably, although epidemiology has traditionally been an active application area in statistics [19, 26, 37], relatively few methods exist in previous literature that take into account network structure within a population of interest.
Outside of statistics, several attempts have been made to estimate stochastic models for the spread of information and disease, but they are far from comprehensive. In one line of work, each vertex may only be infected once, and time information providing the order in which individuals were infected is unavailable. This model, applicable to settings such as HIV, is considered in , as well as in our own previous work . Another variation relies on observing multiple independent, identically distributed cascades, where a cascade is an infection in which each vertex may be infected only once [15, 34]. While the notion of cascades was likely inspired by applications in social networks, obtaining accurate data from actual cascades is often much more complicated . Finally, Bayesian approaches that leverage virus DNA sequences have been used in the epidemiology literature [14, 29]
. However, these techniques require manually segmenting time into epochs. Furthermore, none of the methods described above for parameter estimation come with statistical guarantees concerning inference.
In this paper, we consider a parametrized model of transmission inspired by percolation
and gravity models. Our goal is to perform statistical inference for the unknown parameter vector that determines the edge weights which govern the spread of the infection. Furthermore, we associate each edge of the graph to a vector of known covariates. This allows us to answer questions such as the following: In the week leading up to a sports match, writers and fans post on Twitter. Are friend-follower relationships important for the order in which posts occur? Do previous interactions predict future interactions? As a second example, suppose we wish to reconstruct the spread of Ebola in West Africa. Does the reconstruction reveal which factors, such as proximity or language, best correlate with the spread of the disease? This paper provides methods to analyze such datasets. Our analysis crucially leverages theoretical properties of Pólya urn processes: Specifically, the counts and weights of balls of a given color in the urn are separated, and the weights are to be inferred based on observing ball counts. To the best of our knowledge, this decoupling between counts and weights has not been previously studied in urn theory and provides a modeling tool of independent interest. We then allow for two scenarios: one in which we obtain the order of vertex infections, and another in which we observe an unordered set containing infection information.
Our main contributions are to analyze the maximum likelihood estimator and study its asymptotic properties. This allows us to construct confidence intervals and test the importance of covariates. Furthermore, we show that maximum likelihood estimation amounts to optimizing a convex objective function, which is therefore computationally tractable. The case where order information about successive transmissions is unavailable is far more difficult, since the maximum likelihood estimator becomes intractable to compute. In this scenario, we derive two equations that lead to algorithms for computing parameter estimates. Additionally, we provide a class of examples for which the two estimators coincide. Finally, we explore the empirical performance of our algorithms using synthetic spreading data and historical data obtained from the Ebola spread.
The paper is organized as follows: In Section 2, we define our infection model. In Section 3, we introduce and state the properties of the maximum likelihood estimate. Section 4 offers two extensions of the maximum likelihood estimation approach. Section 5 contains estimators for the case of unordered infection data. In Sections 6 and 7, we apply our methods to synthetic and real data, respectively. Finally, we conclude with a discussion of further research directions in Section 8. Due to space constraints, we provide proofs of all our theoretical results and tables of numerical results in the Appendix.
2 Background and problem setup
We begin by defining some basic notation. For a vector , we write to denote the diagonal matrix with diagonal entries equal to . For two vectors and , we use to denote entrywise multiplication and to denote entrywise division.
2.2 Stochastic spreading model
Let denote a directed graph, with vertex set and edge set , where we allow for self-loops. Let . We also often identify the edges of with the enumeration . Each edge is assigned a covariate vector and a nonnegative weight , possibly equal to a function of . We require that our graphs be strongly connected, meaning that between any two vertices, there is a path across edges of nonzero weight in each direction. We collect the covariate vectors as rows in a covariate matrix . Let denote the set of out-neighbors of .
Next, we define the infection process spreading over the edges of . At time , the infection originates from a single randomly chosen vertex . At each subsequent time , an infected vertex is chosen among the set of previously infected vertices, and infects another vertex along the edge . We denote the -field of random infection events up to time by , and we denote the vector of ordered infection data by . In the case of unordered infection data, we denote the set of unordered infection spreads by . We will also briefly discuss the case where only the ordered set of infected nodes is observed, without information about which nodes were the sources of transmission.
Importantly, a given node may be infected any number of times. Drawing an analogy to the scenario of epidemic spreading, we may treat each node as a different community, in which the infection count of a certain community may increase over time as more individuals become infected with the same disease. Transmissions between individuals in the same community correspond to self-loops in the edge infection data.
We now define the specific measure for our infection process that will be studied in this paper. Let denote the number of times vertex has been infected prior to time , and set for . We define the infection probabilities by
In other words, the first infected vertex is chosen uniformly at random. Each subsequent vertex is chosen to be infected with probability proportional to the weights entering
from all previously infected vertices. This is equivalent to a continuous-time infection process where each new infection instantiates an independent exponential random variable along all outgoing edges with parameter proportional to the corresponding edge weight, corresponding to the waiting time until its neighbor becomes infected.
We will focus on the specific case where the weight function takes the form
where is a parameter vector in . This formulation resembles the exponential form of infections given by  and is useful for interpretability of covariates affecting a spreading process. Furthermore, as explained in Remark 9 below, the case of more general weight functions may be recast in terms of the exponential weight formulation (2) when the weights all lie in the interval For more details, see Section 4.1.
Finally, we define the notation we will use for conditional and limiting measures. We write to denote the measure defined by equations (1) and (2), where is the underlying parameter. We will generally use to denote the true parameter. We define as shorthand for the conditional measure , meaning that the identity of the next infected edge is drawn conditional on the infection history . We will often use the vector to denote the vector of edge covariates corresponding to the edge drawn according to . In detail, the probability space of the entire infection process is given by , so is a random measure; since random variables are functions of their sample space, is defined on .
We also define the limiting measure . A priori, it is not clear that such a limit must exist, but we will subsequently derive the existence of the limit as a consequence of our Pólya urn theory. We will also refer to the limiting measure, which turns out to be non-random, as , when considered as a vector.
2.3 Generalized Pólya urns
In this section, we briefly review the theory of generalized Pólya urns, which is an important component of our theoretical analysis . In a generalized Pólya urn, we have an urn containing balls of different colors. In addition, each ball is labeled with a positive real-valued weight. The evolution of the urn is governed by an replacement matrix . At each time , a ball is drawn from the urn with probability proportional to its weight. If a ball of color is drawn, it is returned to the urn along with one ball of color and weight , for each such that .
To relate this urn model to the spreading process described by equation (1), consider a generalized Pólya urn defined by the edges of the infection graph. The colors of the balls in the urn correspond to the edges. We now describe the replacement matrix : The rows and columns of are indexed by the edges of the graph. For each pair of edges , where and , we define to be nonzero exactly when , in which case . It is not hard to see that successive infections on the graph follow the same probabilistic mechanism as the urn evolution, where corresponds to the number of balls for edge present in the urn at time . Thus, the goal is to infer the weight matrix of the Pólya urn based on ball counts. Note that if the first infected vertex is , we can consider the urn as being initialized with one ball of weight for each edge , where is in .
2.4 Relation to gravity models
Finally, we comment on the relationship between our proposed model and the problem setup adopted in the literature on gravity models. The term “gravity model” is used broadly to describe a model for a quantity that depends proportionally on certain quantities and inversely on certain others. The terminology is inspired by Newton’s law of universal gravitation, in which the gravitational force between two point masses is proportional to the product of the two masses and the inverse of the square of the distance between them.
In social science, a seminal work on gravity models observed that the number of graduates of a university depends proportionally on the population of the region and inversely on the distance from the university . The book by  considers gravity models in great detail. More recent works such as  have developed gravity models for susceptible, infected, recovered (SIR) models of network infection. The underlying random variables are generally modeled as binomial, and the estimation procedure consists of fitting parameters that control the spread of disease from one region to another.
Similar to our setup, any subsequent infection occurs over the edge with probability proportional to some . In addition, the number of infections over under equation (3) over a pre-specified period of time, such as a day, is a Poisson random variable with mean , as is the case for the standard gravity model of .
However, a key difference between the model adopted by this approach and the setting of our paper is the reinforcing aspect of our process. In our model, the function increases monotonically in , meaning that the amount of infection at each node accumulates over time. In contrast, the gravity model of  does not contain this feature, and is better suited for applications such as modeling road traffic between two cities, which might reasonably be constant when adjusted for seasonality. Similarly, the spread of influenza considered in  might fit well into this gravity model framework, partly because influenza is quite ubiquitous and the infection process is somewhat stationary.
However, cases may exist where these assumptions are unreasonable. In particular, the historical spread of Ebola was observed to pass through various phases where, at first, it was not widespread enough to travel over most edges . In order to facilitate the methods described above, the Ebola spread was manually split into three phases, and separate parameters were fit for each phase. However, the information on where to partition the infection data was determined by first inspecting the data, leading to unaddressed questions involving complicated dependences and post-selection inference. The reinforcement aspect built into the Pólya urn formulation obviates the need to manually segment data. Another important distinction between our work and the literature on gravity models is that we provide rigorous statistical theory for the estimation algorithms we propose. Lastly, note that we consider a discrete-time model of infection spread rather than a continuous-time model, out of convenience; hence, one downside of our model is that we are not able to infer how the rate of spreading might evolve in continuous time.
3 Maximum likelihood estimation
3.1 Log-likelihood expression
We now investigate the problem of maximum likelihood estimation for the model defined above. Let denote the likelihood of , computed with respect to the parameter vector . The log-likelihood for the exponential parametrization is then given by
Importantly, note that the objective function (4) is concave in , since it is a difference of linear terms and log-sum-exp functions . Hence, we may compute the maximum likelihood estimator efficiently via convex programming. More details are contained in Section 6.
In the succeeding sections, we will consider the existence, computability, and statistical properties of the maximum likelihood estimator. Note that we denote the true parameter by and the maximum likelihood estimator by .
We first establish conditions for the existence of the maximum likelihood estimator. We make the following definition:
We say that the data satisfy the suboptimal sampling condition if for each nonzero , either
there exists a time and an edge such that and ; or
for all and such that , we have for some constant , independent of and .
Note that the condition simply means the source vertex of has accumulated at least one infection prior to time . Furthermore, the condition is equivalent to the condition ; i.e., under parameter vector , the edge has heavier weight than .
We then have the following theorem, which states that the suboptimal sampling condition is both necessary and sufficient to guarantee the existence of a maximizer:
The suboptimal sampling condition is satisfied if and only if the likelihood function attains a maximum.
The proof can be found in Appendix C. The suboptimal sampling condition is inspired by a relatively simple idea: In order to compare the propensity of one edge to spread infection versus another, we need instances in which infections are possible across both edges. Furthermore, we need instances of both spreads happening, or else our estimator would attempt to assign a probability of to spreading across a particular edge.
Currently, we do not have a simple method to check whether the suboptimal sampling condition holds by hand, but it is possible to check the condition by determining whether an optimization problem with an arbitrary objective and linear inequality constraints is feasible.
We now return to the possible non-existence of the maximum likelihood estimator. To make an instance of non-existence more precise, we offer an easy corollary to Theorem 2:
Suppose there is a such that
for all and such that . Then the maximum likelihood estimator does not exist.
As a basic example in one dimension, if is always the largest of the ’s with ; i.e., all spreads happen on the edges with the heaviest weights, the condition (5) is satisfied with equal to any vector with strictly positive entries. Corollary 4 also illustrates the meaning of the term “suboptimal sampling condition," since it gives a case where only the heaviest-weight vertices receive a transmission event on any time step.
Our goal is to show that the log-likelihood is strictly concave. We start by establishing a helpful representation for the Hessian of the log-likelihood:
Let be sampled from the ’s with probability proportional to . Let denote the covariance matrix of under . The Hessian of the log-likelihood is given by
In particular, if for some , the log-likelihood is strictly concave.
The proof proceeds via a straightfoward calculation and is provided in Appendix C.2. The next proposition provides a sufficient condition for having full-rank covariance matrices.
Suppose for all , and suppose . If there is a vector satisfying
then . Otherwise, we have .
The proof is somewhat involved computationally and is deferred to Appendix C.2. In general, we expect that , unless is degenerate and has a solution to .
3.4 Asymptotic theory
We now turn to our main statistical result, which is a theorem establishing asymptotic properties of the maximum likelihood estimator.
Suppose that and for any . If is strongly connected and is -dimensional, then the maximum likelihood estimator exists for sufficiently large . Furthermore, it is consistent, and we have the convergence in distribution
where the coordinate of is
The proof of Theorem 7 is contained in Appendix A.2. Although the general approach of deriving asymptotic normality resembles the techniques used in the standard theory of -estimation in a multivariate setting, several technical challenges arise due to the fact that our objective function is not simply a sum of independent, identically distributed terms.
Recall from Corollary 4 that the maximum likelihood estimator may not exist. However, as tends to infinity, Theorem 7 guarantees that the maximum likelihood estimator exists, meaning that the suboptimal sampling condition mentioned in Theorem 2 must eventually be satisfied for sufficiently large . This agrees with the intuition that as increases, the probability of spreads happening exclusively on edges of largest weight decays to 0.
4.1 General weight functions
We now briefly discuss extensions of the maximum likelihood theory we have derived for exponential weights to slightly more general frameworks. We begin by considering the maximum likelihood estimator for non-exponentially parametrized weight functions. We propose a two-step algorithm, described in Algorithm 1, to compute a maximum likelihood estimator:
A simple argument shows that Algorithm 1 computes a maximum likelihood estimator in the case of general weights, if it exists: If is not in and , the value of the likelihood is not decreased by setting . Now consider in . Clearly, the weight must be positive in order for the likelihood to be nonzero. As a result, we can then define . It is easy to see that maximizing the reparametrized likelihood then maximizes the original likelihood.
, so the maximum likelihood estimator is not necessarily unique. This should not be surprising, however, since we already know the weights are only unique up to scaling. To amend this problem, suppose we fix the value of the last edge weight to remove one degree of freedom. Maximizing the likelihood then amounts to solving the maximum likelihood estimation problem with
By Proposition 6, we then have , guaranteeing that the maximum likelihood estimator for the new likelihood function is unique.
We also remark that although the method for general weight estimation is mathematically rigorous and leads to accurate methods for simulating future spread of the disease, it may be more difficult to interpret the meaning of the estimated coefficients. This is unlike the case of the exponential weights parametrization, in which we can interpret the relative sizes of components in the parameter vector as providing the relative importance of various edge covariates.
Finally, we return to the setting of a -dimensional exponential parametrization. Suppose we have obtained weight estimates using Algorithm 1. To obtain an exponential parametrization, we simply perform the following projection, where the logarithm of is taken componentwise:
Recall that is only unique up to scaling, so we may wonder about the consequences of this scaling on the projection. However, the arbitrary scale factor is accounted for by adding an intercept term to when taking the projection. Indeed, note that scaling results in adding a constant to the coordinates of , which is eliminated by the intercept term.
Naturally, accurate estimates of translate into accurate estimates of . To rigorize this notion, we state the following result, which follows immediately from the continuity and measurability of :
Let be the true parameter of dimension , and define . If is consistent for , then is consistent for .
As a result, we have a suitable way of recovering the effects of covariates, provided we have a consistent estimator for general weights.
4.2 Unknown sources
We now consider an extension to the case where the set of infecting vertices is unknown. The probability of an infection set is then
The process described in equation (8) also parallels the evolution of a generalized Pólya urn model, where the balls correspond to vertices rather than edges. Here, the replacement matrix is , and each entry is simply equal to the weight function .
However, we encounter one complication concerning the maximum likelihood estimator: the log-likelihood is no longer convex in general. We mention a special case in which the theory described in the previous sections may be applied directly. Suppose the weight function is known to be constant in . If we consider the exponential parametrization for some vectors , we obtain the log-likelihood
5 Estimation without order information
5.1 Problem setting
Finally, we examine the substantially more difficult setting where edge transmissions are unordered: Instead of receiving information , we observe , the unordered set of edges across which infection has spread.
We first note that the infection could have occurred over a variety of different paths, i.e., possible orderings of elements of that could produce an infection vector . Since the computation of the likelihood involves summing over all possible infection paths, computing the likelihood will often be intractable for large graphs. For instance, for the complete graph, this leads to summands. In addition, the log-likelihood takes a more complicated form and may not be concave in general; more precisely, the log-likelihood results in the logarithm being applied to an additional sum, leading to a composition of a convex and a concave function. Since this is in general not concave, we do not automatically obtain an efficient method for computing the maximum likelihood estimator even if we ignore the intractability of the sums. Thus, we need an alternative solution. We consider the case of general weights, for which we offer two approaches.
5.2 Using a limiting distribution
The first approach is to derive the unknown weight vector as the limit of a quantity computed using finite samples. Since may be scaled arbitrarily, we set . For an edge , we write to denote the number of transmissions across edge prior to time , leading to the vector . Denoting , we have the following theorem:
We have the almost-sure convergence
This motivates the following algorithm. Note that the algorithm uses the vectors and , which contain the cumulative infection information over all time steps.
Unfortunately, a method for obtaining confidence intervals or performing other statistical inference procedures based on the estimates obtained by Algorithm 2 appears to be more complicated than in the case of the maximum likelihood estimator.
5.3 Using a fixed-point equation
The second method leverages the observation that the limiting distribution
is the leading left eigenvector of the weight matrix
, when normalized to be a probability distribution (cf. Lemma15 in Appendix A.1). Since the weight matrix only depends on and the topology of , which we assume is fixed, we can denote the desired left eigenvector by . Furthermore, by Lemma 16 in Appendix A.1, the true weights must then satisfy
where . One way to view equation (10) is as a fixed-point equation, which we may attempt to solve in an iterative manner: Given some , we can compute the weight matrix and its eigenvector . By using (or a finite-sample estimate of this quantity), we can compute a weight vector from equation (10). An algorithm based on this idea is provided in Algorithm 3, where we use to denote the initial guess for the distribution . Again, the algorithm uses the vector , which contains the cumulative infection information over all time steps.
Some natural questions are whether equation (10) always has a fixed point with good statistical properties, and whether Algorithm 3 converges to such a fixed point. Unfortunately, such an analysis appears to be beyond the scope of this paper. A simulation study is provided in the experiments of Section 6.
A natural initialization for Algorithm 3 is the empirical proportion of edge transmission counts . In fact, we can show that this initialization gives rise to a fixed point in a specific case:
Let be the directed cyclic graph on vertices, and consider the associated urn scheme. The empirical distribution
leads to a fixed point of the equation
where the weights are given by
The key fact underlying the analysis of Theorem 12 is that counts the number of times that infection has spread across edge , since subtracting merely corresponds to ignoring the ball (edge) used to initialize the process. Thus, we have . In other graphs besides the cyclic graph, such a simple relation may not hold.
Theorem 12 provides one example where our separate intuitions for using the empirical distribution of infection spreads and for finding a solution to our fixed point equation nearly coincide.
To test our methods, we conducted simulations using a variety of parameter settings. We considered network topologies corresponding to a directed cycle graph of various sizes, both with and without self-loops on the vertices. We took to be equal to the -dimensional vector of ones in the case of no self-loops, and the -dimensional vector of ones in the case of self-loops, for different values of . We also varied the infection count
. For each set of parameters, we simulated several infections using edge weights drawn independently and identically from a normal distribution, with the exception of self-loop parameters. We then calculated the empirical distribution estimator, fixed point estimator, maximum likelihood estimator, and general weights estimator for each simulated infection. In many cases, the fixed point estimator could not be calculated because of the nonexistence of the required Perron eigenvector, which is only guaranteed to emerge for sufficiently large; thus, most of the comparisons we report are between the other three estimators.
The results of the simulations, together with full details of the implementations, are reported in Appendix H. Tables 1-4 provide results for the cycle graph without self-loops; results for the cycle graph with self-loops are provided in Tables 6-9. Table 1 reports the accuracy of our estimates, measured according to the root mean squared error between the estimate and the true . As discussed in Section 4.1, for the methods other than maximum likelihood, we compute the error after projecting our estimates of the edge weights back into . In general, the maximum likelihood estimator performs the best, which is unsurprising.
Next, we compare the performance of the asymptotic confidence intervals given by the maximum likelihood estimator. We approximate the asymptotic information by the empirical average
which is known to converge almost surely to . However, in some cases—particularly those of high and or low —the matrix is nearly singular, so inverting the matrix to compute confidence intervals results in numerical errors. In order to gauge the presence of numerical errors, we compute the proportion of runs in which negative diagonal entries are present in the calculated inverse of , which is supposed to be positive semidefinite. We also provide the proportion of confidence intervals that cover the true . Finally, we compute the average length of confidence intervals and (for comparison) the average length of a centered interval that would be necessary to cover in of the simulations.
As seen in Tables 2 and 3, the empirical coverage of our computed confidence intervals is noticeably smaller than the target level. This is likely due to the instability of ball count proportions in a Pólya urn after relatively few draws. Indeed, the theory for our asymptotic intervals relies on almost sure convergence of the Pólya urn process to a particular limiting distribution, but for moderate values of , the contents of the urn may be quite far from the asymptotic limit. More concretely, suppose the limit were only approximately realized for values . Then the estimator (11) might be more accurately computed by taking an average of the latter terms, and the scale factor of in the formula for asymptotic normality in Theorem 7 should analogously be replaced by . Of course, this observation does not contradict our theory, since converges to as tends to infinity. In practice, one might want to consider some “burn-in" when computing the confidence intervals. Alternatively, other estimates of the inverse information matrix that avoid the numerical accuracies issues arising from the expression in equation (11) might provide more accurate confidence intervals.
The final item of interest is computation time, provided in Table 4. In general, the maximum likelihood estimator takes the longest time to compute; this is particularly noticeable for large values of , , and . In contrast to other methods, the computation time for the empirical distribution estimator and the general weights estimator does not appear to depend too much on the dimension . All simulations and calculations were conducted in Python; the maximum likelihood estimator and general weights computation used the SCS solver of CVXPY.
Finally, we ran experiments to compare the performance of Algorithms 2 and 3 in the absence of infection order information. The results are provided in Table 5. The simulations were conducted on a cycle graph of size with , with different values of . As noted in the simulation results, the fixed point estimator performs similarly to the empirical distribution estimator. The fixed point estimator was calculated with iterations, which is likely sufficient for this simple case, since one can easily show that a unique fixed point exists in this case and the iterative algorithm exhibits linear convergence to this fixed point.
7 Application to Ebola Spread
In this section, we describe an application of our methods to data collected from the spread of Ebola in Guinea, Sierra Leone, and Liberia from 2013-2015. The official count for the epidemic included 28,616 cases suspected or confirmed, 15,227 cases confirmed, and 11,310 total deaths . The dataset we analyzed contains genome sequences from 3,219 infected individuals, which allowed researchers to infer transmissions between cities and states in the region. Specifically, epidemiologists applied a Bayesian phylogeographic model to infer relationships between sequences determine parameters related to the spread of infection, concluding that there was strong evidence for the importance of five variables .
Due to the uncertainty in inferring phylogenetic trees, a number of possible transmission patterns were produced by the model. In our work, we considered one particular realization, with the goal of checking whether the transmission structure provides evidence for the importance of the same predictors according to the model. The transmission pattern is shown in Figure 1. It consists of 319 inter-regional transmissions between 46 of the 63 regions. Of the 319 transmissions, 166 occurred between unique source-destination pairs.
. We applied our model to the inter-regional transmission data, using a complete graph topology for connections between regions. Computing the maximum likelihood estimator took 5,715 minutes. The estimator, standard errors, and-statistics are provided in Table 12. One difficulty in interpreting the results is due to the small estimated standard error. This is consistent with our simulations in Section 6, where often the proper coverage is not attained. As a result, we speculate that the standard errors should all be increased by a large multiplicative factor to improve accuracy of inference.
The largest absolute -statistics correspond to the great circle distance between two regions; the destination population; and the source population. In addition, the presence of an international border and the various international boundaries had large absolute -statistics. This is consistent with the analysis of . However, there are a few notable differences: First, our model found the “source t.t. 100k” variable and “source prec.” variables to be relatively important. Second, one of the least significant variables for our model is the source temperature seasonality, which was regarded as relatively important for . Our hypothesis is that none of these three variables are particularly significant.
This work leaves a number of directions for future inquiry. The most important practical question to address is to improve the finite-sample accuracy of our confidence intervals. As demonstrated in experiments, the standard errors should be larger to ensure better coverage. As mentioned earlier, one could consider some burn-in when computing confidence intervals; however, this may only be feasible if the number of infection events is reasonably large. A better solution would be to bound the nonasymptotic bias in the urn process more explicitly.
In the case when the identities of infecting vertices are unknown, we have only studied the situation where the weight functions are constant in
, which corresponds to the scenario where all neighbors similarly affect the odds of subsequent infections at. This is a restrictive assumption, since it precludes the estimation of parameters based on observable edge covariates. Another setting of interest involves incomplete source information: In online social networks, it is sometimes the case that certain transmission edges are known explicitly through direct interactions, while others are not. Hence, a model and analysis that can accommodate a mix of infecting-infected pairs pairs and standalone could be useful.
Finally, linking to preexisting literature, introducing the Pólya urn model or a continuous-time variant in a Bayesian setting might be of interest. The goal would be to integrate the model into phylogenetic models as in  or , rather than performing a two-step process as employed in our experiments, where the first step consists of segmenting the infection process. This would allow us to conduct valid statistical inference without first conditioning on the correctness of the pre-estimated network.
The authors thank Varun Jog for helpful discussions while preparing this paper.
Appendix A Proof sketches
a.1 Lemmas on Pólya urns
The analysis of this section applies to irreducible Pólya urns, meaning that the matrix is irreducible. Equivalently, the matrix has strictly positive entries for . Our first task is to prove that the urns arising in our problem setting are indeed irreducible.
If is strongly connected, the generalized Pólya urn defined by the replacement matrix is irreducible.
Next, we have the following lemma:
Suppose the urn is irreducible. The distribution converges almost surely to the unique strictly positive distribution given by the leading left eigenvector of .
Note that when we say that the distribution converges to a deterministic vector, we identify the vector as the non-random measure over the colors, where the index of the vector is the probability that the next ball drawn is of color .
We also have the following result:
There exists a deterministic vector with components such that the distribution satisfies the equation
a.2 Asymptotic theory
We borrow the general proof strategy from [28, Chapter 6, Theorem 5.1] and adapt it to our setting.
To prove consistency, we wish to show that there is a sequence of maximizers of the log-likelihood that converge to the true . We want to show that for all in the spherical shell of radius around . Then, by the continuity of the log-likelihood, the maximizer must be within the interior of the ball of radius centered at . This is summarized in the following lemma, proved in Appendix E.1:
There exists such that for all , we have
If is in , then by the continuity of the log-likelihood, there is a in the ball of radius centered at that is a local maximum. By the concavity of the log-likelihood, this is also the unique global maximum . Hence,
By Lemma 17, the limit of the right-hand side as is equal to 0. Thus, we clearly have
as well. Hence, we have the convergence in probability, implying consistency.
We now turn to the proof of asymptotic normality. We first expand the coordinatewise derivatives of the log-likelihood about to obtain
for some on the line segment between and , where we have used the shorthand notation to denote the coordinate of the gradient of , and similarly for the higher-order derivatives.
Substituting sets the left hand side to . Rearranging, we then obtain
We will apply Lemma 27 with
We have the following lemma, proved in Appendix E.2:
We have the convergences
in distribution and in probability, respectively.
Putting this all together means that
This completes the proof of asymptotic normality.
Appendix B Proofs for Section a.1
b.1 Proof of Lemma 14
Consider two edges and . Since is strongly connected, there exists a path of some length from to . Hence, the term in the expansion of has entry strictly positive. Since all entries of are nonnegative, it is clear that the entry of must be strictly positive, as well.
b.2 Proof of Lemma 15
Let be a continuous-time branching process, where in the time interval , each particle of type branches independently with probability . When a particle of type branches, it is replaced by one particle of type and one particle of each type for which . Let be the leading left eigenvector of , normalized to be a probability distribution, and let
be the associated eigenvalue. Note thatis unique by the Perron-Frobenius theorem. Define . By Theorem 1 of , we have the almost sure convergence of to , where is a nonnegative random variable. For completeness, we provide a statement of this theorem in Appendix G.
By normalizing, we obtain the limiting distribution
so it suffices to show that we can recover our urn process from the branching process via stopping times. For this, we follow Theorem 1 of  and the ensuing discussion.
Let be the split time of the process . Then, forms an increasing sequence of stopping times. At , there are particles of type . Thus, the probability that the next split is of a particle of type is
since all particles of a type have independently-distributed exponential lifetimes of mean . The resulting change in is that one particle of type is added for each such that .
We now consider the analogous urn process , which has replacement weight matrix . Unlike in standard urn theory, we distinguish between the number of balls and their weights; when we draw a ball of color , we replace it with a ball of weight , for each for which . Again, at the time, we have balls of type in the urn, and on the next draw, the probability we draw a ball of type is
Note that the ensuing replacement is then the same as for .
Thus, the transition probabilities of and are the same. Additionally, since both processes are Markov, they are equivalent. Since as , we have
By taking limits, we see that , as required.
b.3 Proof of Lemma 16
We first show that the limiting vector exists. We define
By Lemma 15, the distribution is strictly positive, almost surely. Hence, for sufficiently large, is also strictly positive. Taking limits, we therefore obtain
where is deterministic. In particular, we have
Accordingly, we define the vector to obtain the desired result.
Appendix C Proofs of existence and uniqueness lemmas
c.1 Proof of Theorem 2
In this section, we consider basic proofs of existence and computation for the maximum likelihood estimator.
We will use both the likelihood and the log-likelihood as is convenient. We begin with a useful lemma:
Given , the log-likelihood has a maximum if and only if for all , there is no such that is strictly increasing.
The forward direction is simple, so we focus on the reverse direction. First note that since the log-likelihood is concave and continuously differentiable, every is also concave and continuously differentiable. As a result, the function is either constant, strictly decreasing, increasing to a maximum and then decreasing, or strictly increasing. Let . Since is not strictly increasing, the function is defined for all . Furthermore, the vector defined as must exist, and that the maximum of must then agree with the maximum of . ∎
Returning to the proof of the theorem, suppose and exist such that and . The term in the expansion of is bounded above by
Furthermore, since , it is easy to see that