Modern epidemiological studies seek to understand disease dynamics, evaluate intervention strategies and differentiate between population level and individual level effects. A traditional approach to the modeling of infectious disease relies on mechanistic compartmental models, where only the summary of disease statuses of individuals in the population plays a role in understanding the disease dynamics. Examples of such mechanistic compartmental models abound in the epidemiology and mathematical biology literature, e.g. the susceptible-infectious-recovered (SIR) model (Kermack and McKendrick, 1927). The majority of these simplify disease transmission to a population level event — as such these models are posed to answer population level questions about disease outbreaks (e.g., “will the outbreak end?”), but cannot resolve those at the individual level (e.g.,“what is my risk of infection?”). This is exemplified by the “random mixing” assumption that underpins many of these models. Under this assumption any infectious individual can transmit the disease to any other susceptible individual with equal chances. However, it is clear that the contact network of individuals plays an integral role in disease transmission and that interventions on individual behavior can change the overall dynamics of an outbreak (Eames and Keeling, 2003; Kiss et al., 2006; Lunz et al., 2021).
The literature on epidemics diffusing through networks has been greatly bolstered during the SARS-COV-2 pandemic, and a myriad of mechanistic models attempting to capture both the network and disease characteristics have been proposed (Ferguson et al., 2020; Cencetti et al., 2020; Nielsen et al., 2021; Small and Cavanagh, 2020; Skums et al., 2020; Lee et al., 2020; Soriano-Arandes et al., 2021). While a number of these make use of available mobility data collected through powered mobile devices, they still largely operate on the population level with disease dynamics considered at the county or zip-code level. At the same time, high-resolution data collected at the individual level are becoming available, requiring new model development. For example, in Figure 1 we plot the transmission and interaction dynamics of individuals on a college campus, a snippet from the eX-FLU data we analyze in detail in Section 6.2.
A number of approaches to individual-level modeling have been introduced in the literature. Several papers have introduced a graph-coupled Hidden Markov Model to account for changes in the infectiousness state of individuals(Dong et al., 2012; Fan et al., 2015, 2016), and there have been developments in agent-based disease transmission models that consider covariates associated with infections (Touloupou et al., 2020; Ju et al., 2021). Such individual-level inference is very computationally challenging and becomes intractable for even moderately sized datasets with few predictors of interest. Bu et al. (2020) introduced an individual level framework for stochastic epidemic processes, where contact information about infections and individual-to-individual interactions are nearly-completely observed. Specifically, the proposed approach introduced an exact sampler when exact recovery times are unobserved that relies on fully observing the infection times, building on earlier versions of individual-level stochastic models that do not model the network (Auranen et al., 2000; Cauchemez et al., 2006; Hoti et al., 2009; Britton, 2010).
We note that many of these prior approaches assume that the epidemic process is fully and exactly observed, so inference may proceed via the likelihood of the complete data. However, available epidemic data typically only provide a partial view on exposure, infection and recovery times — the exposure time is often “latent” because of the incubation or latency period, and infection and recovery times are often not fully kept track of due to limited resolution of data collection or failure of follow-up. Even in considerably rich and high-throughput modern datasets such as the eX-FLU study that we analyze in this paper (Aiello et al., 2016), the true exposure times are not observed even though infections can be inferred via daily symptom reporting, and the exact recovery times are not available because only weekly updates were obtained for influenza recoveries. Recent work carries out inference from partial epidemic data using the marginal likelihood under simpler SIR or SIS models (Ho et al., 2018b, a; Ju et al., 2021), but the techniques are computationally intensive and difficult to extend to account for the factors (e.g., the contact network structure) we seek to model. Many infectious diseases such as influenza and COVID-19 have a substantial incubation or latency period; our contributions both incorporate a latency period and propose an inference procedure to account for the unknown infection and recovery times through latent variables.
Lastly, contemporary studies collect information on covariates such as hygiene habits, vaccination statuses, and preventative measures and disease control, which go beyond the simple infection states and contact tracing information that were previously available. The model and procedure of Bu et al. (2020) consider the binary contact status between individuals for modeling infection rates but more flexibility is needed to accommodate heterogeneity in epidemic rate parameters as a function of multiple covariates. As a concrete example, consider the role of hygiene habits on disease spread. Previous studies have shown that good hygiene habits such as frequent hand-washing can help reduce the transmission of infectious diseases (Aiello et al., 2010; Hübner et al., 2013; Hovi et al., 2017; Thompson and Rew, 2015)
. These studies are mostly randomized trials without closed and interacting populations and have thus been analyzed using simple two-sample t-tests or randomization tests(Stedman-Smith et al., 2015a; Arbogast et al., 2016; Stedman-Smith et al., 2015b; Savolainen-Kopra et al., 2012). While this allows us to get a sense of whether hygiene is important overall, it does not quantify the effect of individual hygiene behavior on disease transmission within a joint inference procedure.
In our motivating dataset, the eX-FLU study of influenza-like-illnesses on a college campus (Aiello et al., 2016), raw counts of hand-washers and infection cases suggest that 37% of those who did not optimally†††In the eX-FLU study “optimal hand-washers” were identified through survey questions on the frequency and duration of hand-washing. wash their hands experienced flu-like-illnesses, while only 24% of those who did became sick during the study. However, a Fisher’s exact test of proportions does not detect this as a significant difference, which is unsurprising as many of the observations are dependent through a network. Inspecting the local network of an infected individual and (manually) tracing the disease transmission over a few weeks (see an illustration in Figure 1) suggests that optimal hand-washers are indeed less likely to contract the disease, even after contact with infectious individuals. This exploration illustrates how individual-level information affects the complex dependencies between the epidemic process and the contact network in a way that goes undetected in population-level summaries, motivating a framework that directly models individual covariates into the transmission mechanism.
The model proposed in Section 2
extends the literature on continuous-time Markov processes to accommodate a contact network, latency period, more missingness in epidemic observations, and individual-level covariates for heterogeneous disease transmission dynamics. We choose to build on the stochastic SEIR model, a variation on the widely used SIR model that explicitly considers the latency period. This flexibility comes at a cost — exact inference becomes intractable because of the complex dependencies between the covariates, disease process and missingness mechanism. To address this, we derive a stochastic expectation-maximization (stEM) algorithm that can exploit the complete data likelihood while augmenting the data through exact conditional sampling(Nielsen and others, 2000). This approach is made computationally tractable due to three key realizations: many of the required computations can be cast as offset Poisson regressions which can be solved efficiently, a rejection sampler for missing exposure times can be deployed naively in parallel, and an exact sampler for missing recovery times can be adapted from Bu et al. (2020)
The remainder of this paper is organized as follows: In the next section, we introduce our model framework. Sections 3 and 4 discuss our proposed inference procedures for complete data and partial data. We evaluate the performance of the proposed inference methods through simulation experiments in Section 5, and finally apply our model to analyzing the eX-FLU dataset in Section 6.
2 Model framework
We adopt a stochastic compartmental model for epidemics, where all members of the target population are divided into non-overlapping subsets related to their disease statuses, and the mechanism of disease spread is described by the transition between disease statuses for each individual. We base our epidemic model on the SEIR model with four disease statuses: (susceptible), (exposed), (infectious), and (recovered or removed). An individual may get exposed (and thus become an person) upon contact with an individual, and an infectious () person will eventually recover and transition to the status. In this model, the status resembles a latency period and does not entail any transmissibility, and a recovered person enjoys immunity to the disease and therefore no longer contributes to the contagion process.
These disease spread dynamics evolve as a continuous-time Markov chain (CTMC) defined through exponentially distributed waiting times between consecutive events(Guttorp, 2018). This implies that the disease process progresses as a series of competing Poisson processes at the individual level. For example, suppose is the rate of exposure between an infectious person and an susceptible person who are in contact at time
. Then the probability ofgetting exposed (thus becoming an person) at time for small is
Given that the contact structure of the population is subject to change in time as well, we extend the CTMC model to the dynamics of the contact network. For any pair of individuals and , they either share an undirected contact link (“connected”) or they do not (“disconnected”); the contact network can thus be represented by a binary symmetric matrix called the adjacency matrix. Its dynamics are described at the pairwise level, where each entry evolves as a CTMC that takes values in . For example, if individuals and are disconnected at time , with link creation rate , the probability of them engaging in contact by time for small is
We will consider heterogeneous exposure rates, allowing individual characteristics and network information to play a substantive role in transmission probabilities. Moreover, we consider two types of infectious individuals — these can be thought of as symptomatic and asymptomatic caseswho exert different transmission forces on the population. This is summarized in Figure 2.
Model Specification. Our goal is to construct an individualized framework that can capture the interplay between the epidemic process and the evolution of the contact network. Disease transmission relies on contact between individuals, while the change in contacts, in turn, depends on individual disease statuses. To do so, we model the joint evolution of the epidemic and network processes as the combination of continuous-time Markov chains for individuals (or pairs of individuals) in the population. At any time point in the process, conditioned on the current status of the process , five types of events may occur for an individual or a pair of individuals: exposure (an person becomes exposed by an person), manifestation (an person becomes infectious after a latency period), recovery (an person recovers and becomes a person), link activation (a previously disconnected pair get connected in the network), and link termination (a previously connected pair break off their contact).
We accommodate different types of individual heterogeneity in the disease transmission dynamics: (1) people may exhibit different levels of susceptibility that can be explained by individual characteristics such as health conditions, hygiene habits, and behavioral choices; (2) those who are infectious might not be equally contagious to the susceptible population (e.g., symptomatic and asymptomatic cases for COVID-19); (3) contact rates in the network may vary in time to reflect phases of social intervention and/or behavioral changes as response to an epidemic (e.g., a pre- and post-lockdown period denoted by and , respectively); (4) contact rates in the network may vary between pairs of individuals based on their healthy (denoted by , the collection of , , and statuses) versus infectious (denoted by ) status.
Combining the above descriptions, we design the model framework as follows. For any members and in the target population, given the current system state at time , one of the following five events may occur after an exponentially distributed waiting time with the associated rate (see Figure 2 for a summary of the epidemic process):
Exposure. If is infectious, is susceptible, and they are in contact at time , then exposes with instantaneous rate that can be decomposed as
where is the baseline exposure rate, represents ’s contagiousness level at time (details in the next item), and are the regression coefficients on ’s individual characteristics that account for additional heterogeneous effects of susceptibility. (For example, can characterize if person has been vaccinated, washes their hands frequently and/or wears a mask during the flu season, which may reduce ’s exposure risk.)
Manifestation. If (or ) is exposed, then he or she becomes infectious with rate . For additional generality we assume that any infective (status person) gets assigned to one of two categories with different levels of contagiousness: (“symptomatic”) or (“asymptomatic” or “less symptomatic”). Individuals are assigned to the first category with probability and to the second with probability .‡‡‡Note that it is straightfoward to introduce more sub-types of infectives or include continuous and even time-varying explanatory variables for the function , if more intricate modeling of heterogeneous transmissibility is necessary. With the two-type infective setup, the term in Eq. (3) can be written as
which means that an person is on average times more infectious than an individual.
Recovery. If (or ) is infectious, then he or she recovers with rate .
Link activation. If and are not in contact, then they get into contact with rate , where
where is the healthy or infectious status of person at time and stands for the activation rate of link type in phase ( and ).
Link termination. If and are in contact, then they break off the contact with rate , where
and stands for the termination rate of link type in phase .
Here and represent two time phases of different social behaviors that partition the entire observation time window (e.g., and can be intermittent lockdown and no-lockdown phases).
The link rates are dependent on the individual disease statuses or since we wish to characterize the social adaptation behavior in response of an epidemic – for instance, a healthy-infectious () pair that are in contact might be more likely to disconnect from each other than a pair of healthy individuals to avoid disease transmission. Further, since we assume the contact network is symmetric and only dependent on healthy or infectious statuses, the link rates satisfy and for .
Note that we can also include individual-level covariates in the link activation and termination rates ( and ) to allow for more heterogeneity in the network dynamics, but choose to focus on individual variability in the epidemic process in the main text, and relegate details of heterogeneous network link rates to Section S1 of the Supplementary Material (Bu et al., 2021).
In this section, we demonstrate how to learn model parameters in the missing data setting. Though our focus will be on inference in the partially observed regime, we begin by describing key inferential terms related to the complete data setting as these quantities will play a role in our stochastic EM algorithm.
3.1 Inference with complete data
A complete dataset refers to the fullly observed event sequence between time and maximum time () of one realization from the generative model. In particular, if we were to continuously observe an epidemic as it progresses under this model, we would have access to (1) exact times of exposure (), manifestation (), recovery, and link activation and termination; (2) the identities of the individuals involved in each event; (3) the or subtype allocation for each infectious individual at the time of their manifestation event; and (4) the contact network structure as well as initial disease statuses of all individuals at time .
Given the complete data (or equivalently, sufficient statistics summarizing the data) and all individual characteristics , we can write down the complete data likelihood with respect to the parameters of the model
. Here the vectors denote, , and we index by the set of all pair types. The likelihood takes the form
Table 1 summarizes the notation used in the expressions above.
Since the generative model is a CTMC comprised of individual-level Poisson processes, the above likelihood can be decomposed into epidemic-related components (1st and 3rd lines above) and network-related components (2nd and 4th lines), where each component is simply the product of all the exponential rates for inter-event times. Some of those inter-event rates can be considered at the population level, and so only require bookkeeping of aggregated counts (e.g., total number of exposed cases , number of people at time , and number of link activation events for - pairs in phase ). Other inter-event rates, however, depend on individual-level information, and so require tracking terms such as the time-varying neighborhood structure and the exposure and manifestation times for each individual. Note that, for consistency, we set individual ’s exposure time and manifestation time to if has never been exposed or manifested.
|total population size (assumed fixed)|
|total number of , , exposed (), infectious () and recovered () cases|
|total number of and neighbors of at time|
|total number of status , , and individuals in the population at time|
|exposure time and manifestation time for individual (set to if never exposed/manifested)|
|total number of link activation & termination events among type pairs in phase|
|number of connected & disconnected type pairs at time|
Despite a lengthy likelihood expression, parameter estimation is straightforward when complete data are available. We can obtain closed-form maximum likelihood estimates (MLEs) for most of the parameters, and find the remaining MLEs for parameters and through a simple numerical procedure. This suggests that likelihood-based inference given completely observed data is easily implementable through a few lines of code, and can be modularized toward inference when some data are missing (discussed in the next section).
The MLEs of all parameters can be obtained by taking partial derivatives of the log-likelihood (denoted by ) as follows and setting them all to zero:
The MLEs for parameters and () have closed-form expressions
while the MLEs for , and can be solved from equations (8)-(10) numerically. An efficient iterative procedure is detailed in Section S2 of the Supplementary Material (Bu et al., 2021); in particular, steps for solving the MLE of can be largely simplified based on the observation that it is equivalent to solving for the linear coefficient of a Poisson regression model with individual offset.
3.2 Inference with partial epidemic observations
As demonstrated in the previous subsection, parameter estimation is relatively straightforward when the full event sequence is observed. However, as discussed in the introduction, real-world epidemic data rarely rarely include measurements of each epidemic event. In the case of the eX-FLU study, true exposure times are not available even though the data contain daily symptom reports. This is because there is typically an incubation period for people who contract the flu. Similarly, exact recovery times are not available in the eX-FLU study, with recoveries discernible only at a weekly resolution from epidemic surveys. Therefore, we need to consider inference with partially observed epidemic data, in particular with exposure times and recovery times unknown.
To this end, we derive a method based on the stochastic expectation-maximization (stEM) algorithm (Celeux, 1985). Expectation maximization (EM) offers an approach to efficiently carry out maximum likelihood estimation for continuous-time Markov chain models in missing data settings (Doss et al., 2013; Xu et al., 2015; Guttorp, 2018). Imputing the missing data in the E-step requires access to the conditional expectation, and stEM is a variant that builds an approximation to the conditional expectation using augmented data obtained via conditional simulation. To be more precise, let denote the observed data and be the missing data; a general outline of the stochastic EM algorithm for estimating is as follows:
For , do
(E-step) draw one sample of missing data, from its conditional distribution , and then let
(M-step) maximize with respect to target function to update :
There are two advantages of this approach. First, in the E-step, integrating to obtain an expected log-likelihood (as in the traditional EM algorithm) is replaced by sampling, which avoids the often intractable marginalization step in the case of complex models (Renshaw, 2015; Xu and Minin, 2015; Stutz et al., 2021). Second, the M-step simply requires solving for the MLEs given a version of the complete data, which is often straightforward, and has been discussed in the previous subsection for the present setting.
These advantages come at the cost of a potential challenge: we have to conditionally sample the missing data given our observed data and current parameter estimates. In our framework, this is equivalent to sampling event times of a continuous-time Markov chain conditioned on end-points, a notably difficult problem (Hobolth and Stone, 2009; Rao and Teg, 2013). The following section will focus on the conditional sampling (or “imputing”) of missing data, i.e., missing exposure and recovery times.
4 Stochastic EM approach
Let and denote all missing exposure times and recovery times, respectively. We assume that all manifestation times are observed,§§§This assumption is reasonable as manifestation times can be collected via daily symptom monitoring or frequent screening or testing. and there is no missingness in the contact network events given high-resolution contact-tracing. Thus, our inference procedure with partial epidemic observations can be outlined as follows.
For , do
sample missing exposure times from their joint conditional distribution ;
sample missing recovery times from their joint conditional distribution ;
combine the sampled event times in Steps 1 and 2 with observed data to form an augmented dataset, then solve for the MLEs with the augmented dataset to get updated parameter estimates .
Since Step 3 is already addressed in Section 3.1, we now address Steps 1 and 2 separately.
4.1 Step 1: conditional sampling of missing exposure times
Inspecting the complete data likelihood in Eq. (7) reveals that given all the other event times, person ’s exposure time is independent from other individuals’ exposure times, and thus the joint conditional density for can be factorized into individual density components of exposure. Thus, it suffices to derive the conditional density for ’s exposure time , which is assumed to lie within a “plausible” latency interval . Note it is always valid to choose this interval as , meaning that ’s exposure time may occur any time after time and before ’s manifestation time. In practice, one may alternatively specify a shorter plausible interval based on prior knowledge about latency duration, which can improve computational efficiency.¶¶¶For example, if we believe that latency should be longer than 2 days but shorter than 2 weeks, then we may set and . Then, ’s instantaneous risk function of exposure and contracting the disease during can be written as (here let )
This exposure hazard is a step function with change points occurring when either (1) activates a link with an or person, (2) deactivates a link with an or person, (3) one of ’s contacts enters status or , or (4) one of ’s contacts exits status or . For person , denote this set of change points , with ∥∥∥Here we suppress the notation that is associated with individual to avoid double subscripts, focusing on one individual during exposition.. This set defines a partition of the latency interval such that on each sub-interval , is constant.
Thus, the conditional density for ’s exposure time can be expressed as
where the normalizing constant can be explicitly evaluated since is a step function.
We derive a rejection sampler for the missing exposure time from , with the “plausible interval” set as . Consider the following proposal density
which is the density function of a truncated inhomogeneous Exponential distribution with rate . A rejection sampler for runs in two steps:
Sample from , an inhomogeneous Exponential with rate truncated on :
sample an interval (recall that the risk function is constant on each interval) via
where we denote the length of by len().
within interval , sample truncated on interval .
Compute the acceptance probability for by (here is a constant)
and draw ; accept as a sample of if , and otherwise go back to Step 1 and repeat.
A full derivation of the above (importantly showing that ) is provided in Section S3 of the Supplementary Material (Bu et al., 2021). Note that it is easy to generalize to a setting with other choices of the plausible interval bounds and , which simply entails changing the limits of integration in the above derivations.
Step 1 of the stEM procedure can be carried out by running the rejection sampler above, and we may further speed up computation by running the sampler for each person in parallel. In the simulations in Section 5.2, the rejection sampler accepts approximately 45% of proposals.
4.2 Step 2: conditional sampling of missing recovery times
Every infectious individual recovers with rate independently of other members in the population, but when conditionally sampling missing recovery times, we have to make sure that the imputed timepoints are compatible with observed data and the sampled exposure times (Cauchemez and Ferguson, 2008; Fintzi et al., 2017). The conditional samples of missing recovery times should satisfy two conditions: first of all, an individual cannot recover before a time if is known to be still infectious by ; and more importantly, if another individual gets exposed during his contact with , then the recovery time for cannot leave with no possible infection source.
Sampling missing recovery times, therefore, amounts to conditionally sampling event times with endpoints restricted by count and contact data. This challenging task was addressed by the DARCI algorithm developed in Bu et al. (2020) (Proposition 4.2) for a simpler epidemic model with only one type of infectives. Here, we cannot directly use DARCI because the different transmissibility levels of and individuals must be taken into account when we consider the possible ranges of recovery times to ensure the existence of viable infection sources for those exposed. Instead, we adapt the DARCI algorithm to accommodate our two types of infectives. Below we discuss the details of the modified DARCI procedure.
Utilizing the Markov property, this algorithm first segments the observation window into contiguous, non-overlapping time intervals. On each time interval , the disease statuses of all people are known at endpoints and , and thus we know the set of people who should recover during .******For example, in the eX-FLU study, weekly epidemic surveys would provide information on if someone felt sick during each week. Further, conditioned on the sampled exposure times, we also know the infection/exposure cases and their exposure times during and let these individuals be . We sample the missing recovery times in the following steps:
Initialize a vector of “feasible lower-bounds” LB of length with for every ; for any , further set , where is ’s exposure time;
Arrange the set of exposed individuals in the order of such that their exposure times , and for each (chosen in the arranged order), examine ’s “potential infectious neighborhood”
where is the set of ’s neighbors at time , and is the set of known infectious individuals at time .
If (i.e., potential infection sources are all members of ), then select one , with probability
and set .
Draw recovery times , where is a truncated Exponential distribution with rate and truncated on the interval .
Despite the dense notation, the intuition of this sampling algorithm is straightforward: if person is the only possible infection source of person , then should wait until gets exposed before he or she recovers, so that the resulting augmented data are consistent with the observed data.
We note that the conditional sampling of recovery times is parallelizable as well, since operations on each interval are independent and thus can be run in parallel.
4.3 Uncertainty quantification and improving efficiency via averaging
While the algorithm proposed above provides a way to estimate , we may further quantify uncertainty in our estimates by leveraging expressions for their asymptotic variances by appealing to results established in Nielsen and others (2000).
Let denote the parameter estimates from the stochastic EM algorithm, and be the true parameter values. Then the asymptotic variance matrix of is
where is the information matrix of the observed data evaluated at the true parameter values, is a matrix representing the fraction of missing information due to partial observations, and is the
-dimensional identity matrix. We may decompose Eq. (21) into two parts: the first term is the asymptotic variance of the observed data MLE, while the second term accounts for the additional variance arising from stochastic simulations (i.e., the conditional sampling). It can be shown that the latter contribution increases the variance by no more than 50% over the observed data MLE (Proposition 4 in Nielsen and others (2000)).
Moreover, the first term in the asymptotic variance formula can be regarded as a constant given a model and a fixed sample size, but the second term can be reduced by averaging over either across iterations in one run, or over multiple runs of the algorithm. That is, to improve efficiency, we may adopt two strategies:
Take the average of the last iterations and take the averaged parameter values as the final estimates; denote such estimates by , and then its asymptotic variance is (see Proposition 5 in Nielsen and others (2000))
Take independent runs of the stochastic EM algorithm and take the average of the estimates produced by each run; denote such estimates by , and then its asymptotic variance is (see Section 4.2.2 in Nielsen and others (2000))
For instance, if we run the stochastic-EM procedure times in parallel and take the average of the estimates produced by those independent runs, then the asymptotic variance of these estimates is bounded above by . We can further plug in the estimates for to get a variance estimate of . In the simulations in Section 5.2 these approximations to the variance are conservative, but not for all parameters. While the Wald-type intervals for most parameters cover the truth nearly 100% of the time, the and in our simulations are covered 93% and 81% of the time, respectively.
Because the marginal likelihood of observed data isn’t available, the observed data information matrix is not immediately obtainable. However, we can nonetheless estimate using the Louis identity (Louis, 1982)
where is the (marginal) log-likelihood of the observed data, and is the log-likelihood of the complete data. Both terms of the right-hand side only involve the complete data likelihood and can be estimated via Monte Carlo approximation using the augmented data samples generated across the iterations of our inference procedure (Diebolt and Ip, 1995). That is, these samples are already available from the estimation process, and so evaluating Eq. (24) does not require additional sampling.
5 Simulation experiments
We now turn to validate the methods of inference via simulation experiments. Synthetic data are simulated using the Gillespie algorithm (Gillespie, 1977) according to the generative process described in Section 2. A generated complete dataset includes the initial contact network , the initial exposed/infectious individual(s) , and the full event sequence , where each event consists of the event time, the identities of the individuals involved in this event, as well as the event type (exposure, manifestation, recovery, link activation and termination). In addition, we assume that for each person in the target population, we observe a vector of covariates, which in simulation experiments are randomly sampled binary and standard normal variables.
The full set of parameters to estimate is . For simplicity, throughout this section, we use the following ground-truth setting for the link rates:
That is, we assume that in the second social behavior stage , (link activation for pair) is reduced and (link termination for pair) is increased, mimicking a “quarantine” or “lockdown” phase. The initial network is a random Erdős–Rényi graph with edge density . Parameters for the epidemic process are chosen as follows:
These values are chosen to ensure a low probability of epidemic extinction at the beginning and so that a fair proportion (at least 50%) of the population has been exposed by the end of the outbreak.
5.1 Inference with complete data
We first validate our iterative inference procedure for complete data, as derived in Section 3.1. Table 2 presents the mean absolute errors (MAEs), variances and mean square errors (MSEs) for all model parameters across 40 simulations on -sized populations. Here we present results for (true value ) instead of as is the quantity that we directly solve for in inference. In the simulation results presented below, the regression coefficient is set as .
We can see from Table 2 that when the complete event sequence is available, the derived inference method can estimate model parameters quite accurately. The most challenging parameters to estimate are (shown in terms of in the table) and , largely because we have to resort to numerical optimization procedures to solve for their MLEs.
5.2 Inference with partial observations
Our inference method can accommodate two types of missingness in epidemic observations: (1) exposure times and (2) recovery times. Between these two types of missingness, the former is more difficult to handle as it requires the conditional sampling step derived in Section 4.1.
First, we test out the central part of our inference procedure. To do this, we hold out all the exposure times (i.e., every for each person who ever got infected) from each simulated dataset and treat those time points as unobserved while considering all other information as observed. We then run the proposed stochastic EM algorithm (without the “sampling recovery times” step) on each partial dataset. Columns 2-4 in Table 3 (under “Missing expo. times”) summarize the estimation results for some of the model parameters across 40 independent simulations. Same as in Table 2, we provide the MAE, variance and MSE for the estimates.
Next, on the same simulated data, we hold out all the recovery times (i.e., for each person who ever became infectious) and treat both and for each infected person as unobserved. Now the full stochastic EM inference algorithm is applied to each partial dataset, and the results are summarized in columns 5-7 in Table 3 (under “Missing both”) .
As one might expect, compared to inference based on complete data, there is some decrease in accuracy, in particular for , and . This decrease in performance is largely due to the variability in the sequential samples of individual local neighborhoods. That is, when the exposure time is unknown and has to be conditionally sampled, the local neighborhood structure for each person (i.e., and ) is also unknown and fluctuates throughout the rest of the sampler. This would greatly impact the numerical optimization procedure for these three parameters, as they either directly depend on and or involve the function
which changes whenever gets updated. Moreover, when the exposure times are unknown, the accumulated amount of infection forces exerted on each susceptible person is also unavailable, which would make solving for and more challenging (see Section S2 of the Supplementary Material (Bu et al., 2021) for details on the numerical optimization).
We also note that when recovery times are missing in addition to exposure times, there tends to be more variability in the parameter estimates, since more missingness in the data should tend to induce increased uncertainty. Much of the additional uncertainty is reflected in the estimation of exposure- and latency-related parameters (e.g., , and ), as the conditional samplers for exposure times and recovery times are co-dependent, and the complexity in estimating those parameters is more vulnerable to the loss of more information.
|Parameter||Missing expo. times||Missing both|
We further investigate the estimation of these more difficult parameters and note that estimation accuracy increases when more data are available. We demonstrate this by conducting simulation experiments for different population sizes ( and ). Since the number of epidemic events increases approximately linearly in , more events can be observed with a larger population size. For each simulated dataset, we take out both the exposure and recovery times and run the inference algorithm, but in Step 3 (solving for the MLEs) we fix all other parameters at the true values and only estimate and . Moreover, we run a separate set of experiments where we only estimate (i.e., fixing at the truth as well). In Figure 3 we present the MSEs of the produced estimates for and when fixing other parameters (shown as “b_S” and “exp(eta)” in red circles and green triangles), and for only while fixing all other parameters (shown as “exp(eta) only” in blue squares). We can see that with increased population size, which yields more observed events, the error in estimating these parameters decreases.
6 Case study: flu season on a university campus
We now return to the study of transmission of influenza-like illnesses among students on a university campus, where high-resolution contact tracing was conducted to track physical proximity between study subjects.
This dataset was collected over a 10-week epidemiological study, eX-FLU (Aiello et al., 2016), where inter-personal physical contacts of study participants were surveyed to investigate the effect of social intervention on respiratory infection transmissions. 590 university students enrolled in the study and were asked to respond to weekly surveys on influenza-like illness symptoms and social interactions; they also completed a comprehensive entry survey about demographic information, lifestyles, immunization history, health-related habits, and tendencies of behavioral changes during a flu season or a hypothetical pandemic. 103 individuals among the study population were further recruited to participate in a sub-study in which each study subject was provided a smartphone equipped with an application, iEpi. This application pairs smartphones with other nearby study devices via Bluetooth and thus can record individual-level contacts (i.e., physical proximity) at five-minute intervals.
The iEpi sub-study took place from January 28, 2013 to April 15, 2013 (that is, from week 2 until after week 10 in the main study). Between weeks 6 and 7, there was a one-week spring break (March 1 to March 7), during which epidemic data collection was paused and volume of recorded contacts also dropped considerably. In our application case study, we use data obtained on the sub-study population from January 28 to April 4 (week 2 to week 10), and treat the two periods before and after the spring break as two different social behavior phases. That is, we regard weeks 2-6 as and weeks 7-10 as in our analysis.
Furthermore, we consider two types of “infectious” (status ) members within the study population: (1) multi-symptomatic and (2) uni-symptomatic. To maintain notation consistency, we label the former by and the latter by . They are defined as follows.
multi-symptomatic (), a case with a cough AND one of these three symptoms: fever or feverishness, chills, or body aches.††††††This is the definition of “influenza-like-illnesses”.
uni-symptomatic (), a case with a cough, which is an important symptom for influenza.
For each infection case, we set the reported symptom onset time as the manifestation time (denoted by in previous sections), and treat the exposure time () and recovery time () as unobserved. Since (as dictated by the design of the assumed epidemic mechanism), we set the plausible latency interval as . Using weekly surveys (which asked each participant if they felt sick in the past week), we know that the missing recovery times must lie within a 7-day interval for each individual, where the lower and upper bounds are the start and end of a week. Moreover, we assume that all the contact network events are fully observed, as the high-resolution contact tracing can provide timepoints of initiation and termination of all individual-level contacts.‡‡‡‡‡‡The timepoints of link activation and termination events are obtained from processing the Bluetooth signals that indicate close proximity of smartphones equipped to the study participants. Technical details in processing the Bluetooth signals are provided in Supplement S6.1 in Bu et al. (2020). This suggests that the proposed inference procedure in Section 4 is applicable to this dataset.
In the rest of this section, we first address a realistic concern of possible external infection sources for the sub-study population, and then present details and discuss results of our data analysis.
6.1 Inference with external infection sources
Since the 103 individuals in the dataset are sub-sampled from the 590 study participants, which are also sub-sampled from the entire university campus population, we have to treat the data as observed in an open population instead of a closed one. Therefore, some slight modifications should be made to the model. Specifically, individuals in our target population may get infected by people who are outside of the small population, and we call those people “external infectors”.
For simplicity, we represent the joint forces of all external infectors by a single infector that exists outside of the population and exhibits a constant level of transmissibility over time, and this external force of infection is exerted uniformly on all members of the target population.
For each susceptible individual , let the rate of disease onset (i.e., manifestation) due to external infectors be , and let this onset rate depend on individual characteristics , similar to our treatment of the internal exposure rate :
where denotes the population average external onset rate, and coefficients represent the effects of individual characteristics on subject ’s deviation of susceptibility from the average level.
Here is the rate of moving from status directly to either or , rather than from to , and that’s why we are naming it the “external onset rate” instead of “external exposure/infection rate”. We are not introducing both an exposure rate (like ) and a manifestation rate (like ) for external infection cases because of identifiability concerns: since all susceptible people are exposed to the same external infector with time-invariant transmissibility, the exposure rate and manifestation rate would not be identifiable at the same time when the exposure times are not observed. Thus, to ensure identification, we choose to include only one rate instead of two, and the “onset rate” can be thought of as the rate of any susceptible individual developing contagiousness due to external infection forces.
Now the set of parameters is extended to , and we can write down a complete data likelihood by slightly modifying Eq. (7), where the term related to the new parameters and are separate from the other terms. This means that introducing external cases wouldn’t affect parameter estimation of the other parameters at all, and that we can still use the partial data inference procedure detailed in Section 4 to analyze the real data. We include details on complete data inference with external cases in Section S4.1 of the Supplementary Material (Bu et al., 2021).
6.2 Data analysis
Before applying our model framework to the data, we first discuss how we identify internal and external infection cases and describe the individual characteristics used in the analysis. We adopt the following labeling criteria for internal and external infection cases: if an infected person had any infectious contact (within the 103-person population) up to 2 weeks prior to symptom onset, then we label this case as “internal”, and otherwise this case is labelled as “external”. This procedure gives us 18 internal cases and 16 external cases in total. Moreover, among all 34 cases, 13 are multi-symptomatic () and 21 are uni-symptomatic (). We provide a summary of the breakdown of all infection cases in Table 4.
We consider the following four individual-level characteristics (all collected from the entry survey) that have previously been linked to disease transmission risk:******We have included the original survey questions used to calculate the derived covariates “change_behavior” and “prevention” in Sections S4.2 and S4.3 of the Supplementary Material (Bu et al., 2021).
flushot: a binary indicator of whether or not the study subject has taken a flu shot for this year.
wash_opt: a binary indicator of whether or not the study subject’s hand-washing habit is considered “optimal”, which is derived from survey questions about how long and how frequently one usually washes their hands.
change_behavior: a derived score that measures how willingly the study subject would change their lifestyle during a hypothetical pandemic; this is calculated by standardizing the numeric sum of the 0/1 scores of 13 Yes/No questions (Yes=1, No=0) about voluntary behavioral changes that essentially translate to reduced social activities or isolation in a lockdown; a higher score represents more willingness in changing one’s lifestyle in response to a pandemic.
prevention: a derived score that measures one’s belief in the effectiveness of different preventative practices in reducing the risk of catching the flu; this is calculated by standardizing the numeric sum of response scores (ranging from 1 to 5, 1=strongly disagree, 5=strongly agree) to 6 questions regarding potential preventative measures against flu transmission; a high score represents stronger belief in the effectiveness of preventative practices.
We perform 20 independent runs of the stochastic-EM inference procedure on the dataset, each time with a different random initialization and 60 burn-in steps. For each run, we take the average of the last 20 iterations (after burn-in) and then average over the 20 averages (across runs) to produce estimates of the parameters. Asymptotic standard errors are obtained using the method described in Section4.3; here we obtain a conservative estimate of standard errors by setting and upper-bounding the asymptotic variance matrix by , where are the final parameter estimates produced by averaging.
Tables 5 and 6 present estimates of select parameters of interest. Note that here we take one day as unit of time. From the epidemic parameter estimates, we can see that for this population, the baseline exposure rate is quite high, indicating fast disease exposure upon contact (it takes approximately 0.22 days on average for an contact to lead to infection if the susceptible individual is not vaccinated and does not wash hands properly); the latency period lasts slightly less than 5 days on average, while recovery from symptoms and contagiousness takes about 6 days on average. The total external infection force experienced by the entire -person population is on the scale of , indicating that on average there would be a disease onset due to external sources every other day if nobody in the study population had a flu shot or washed their hands optimally. In terms of the effects of individual-level covariates, we note that the estimates are associated with relatively large standard errors (indicated in the parentheses), and this is potentially due to the small sample size (in particular, the limited number of infection cases). Nevertheless, the effect of hand-washing (“wash_opt”) seems to be considerable, given that there is a -fold reduction (
) in the exposure risk if one washes their hands optimally compared to suboptimal hand-washing; this effect appears significant, in that the 95% Wald-type confidence interval constructed using the conservative standard error estimate is, which does not cover zero.
In Table 7 we include estimates of several parameters related to the contact network process. Here we emphasize the difference between the change rates of (healthy-healthy) links and (healthy-ill) links, as well as the difference between the two phases ( before spring break and after). We can clearly see that the link deletion rates for links are higher than those of links in both phases, suggesting that the duration of contact between a healthy person and an infectious person is on average shorter than the contact between two healthy people, probably because the students were cutting meetings short with peers who seemed sick in order to avoid getting infected during the flu season. Moreover, we can clearly see that the level of network activity is much higher (both in terms of establishing and breaking contact) in (weeks 2 to 6, before spring break) compared to (weeks 7 to 10, after spring break) when we compare the rates for phase and phase . Such findings are enabled by our model design which allows for different levels of network activities by introducing different time phases.
|( v.s. infectiousness)||0.0622||0.0526|
|(proportion of )||0.382||0.0854|
|(internal exposure)||-0.105 (0.671)||-2.42 (0.817)||-0.201 (0.326)||-0.0541 (0.273)|
|(external onset)||-0.805 (0.597)||-0.139 (0.471)||0.257 (0.263)||-0.0362 (0.273)|
Estimates of epidemic coefficients on individual characteristics, with conservative asymptotic standard deviations in the parentheses.
Through our data analysis, we have found that proper hand-washing is significantly associated with reduced risk of flu infection, and that there is a considerable external force of infection for the study population. Moreover, study participants exhibit adaptive contact behavior to flu transmission in that contact between healthy and infectious individuals is less frequent and also lasts less time compared to contact between healthy individuals. Our model has also identified differential inter-personal contact patterns in the two observational periods before and after the school break. These findings are consistent with our intuition and are now quantified statistically within a joint inferential framework.
In this paper, we present a continuous-time Markov chain model for infectious diseases that accounts for individual-level heterogeneity and a dynamic contact structure. Our proposed model can capture the interplay between the epidemic process and network changes, and, more importantly, can describe heterogeneous susceptibility and transmissibility through individual covariates. To accommodate unobserved exposure times and recovery times in real epidemic data, we develop a data-augmented inference procedure based on the stochastic EM algorithm so that we can make use of the complete data likelihood. We also design an efficient method to conditionally sample missing exposure times that are compatible with observed data and respect the dynamic contact network structure. Experiments show that the developed inference procedure performs well on partial data and is able to uncover notable phenomena from modern epidemic data with high-resolution contact tracing.
It is important to note that the modeling framework we propose is flexible beyond our choice of underlying compartments. That is, our approach can be easily adapted to incorporate notions of reinfection (by allowing some individuals to reenter the susceptible population) or to distinguish between more than two types of infections. In pursuing generalizations of the methodology, introducing additional parameters requires careful consideration of the uncertainty quantification from the stochastic EM algorithm. Although we are able to show that in our setting, the estimated confidence intervals perform well empirically compared to their nominal coverage, they rely on variance approximation formulas, and it is crucial to conduct similar validations in more complex models.
Finally, our analysis of the iEpi data provides confirmation of the importance of hand-washing on the reduction of the spread of influenza-like-illness. Unlike previous claims in this area, we are able to measure the actual effect on the transmission rate of a disease in an active population with dynamically changing contact patterns. We hope that this development encourages greater data collection of observational high-frequency individual-level data in this area to gain better understanding of other pharmaceutical and non-pharmaceutical interventions. For example, future studies will be able to estimate the effectiveness of vaccination in preventing transmission under different social interaction rates and population densities, as well as validate claims about the efficacy of mask-wearing and active social distancing. Importantly, such data can be collected discretely in closed populations and provide invaluable insight into the deployment of public health interventions (Motta et al., 2021).
Supplementary material including derivations, proofs and additional details of inference and data analysis. (Document attached at the end.)
Code and example synthetic data. R and Python code for inference and simulations; since the real data are proprietary, only example synthetic data are provided. GitHub repository: github.com/fanbu1995/EpiNetHetero.
- Mask use, hand hygiene, and seasonal influenza-like illness among young adults: a randomized intervention trial. The Journal of infectious diseases 201 (4), pp. 491–498. Cited by: §1.
- Design and methods of a social network isolation study for reducing respiratory infection transmission: the eX-FLU cluster randomized trial. Epidemics 15, pp. 38–55. Cited by: §1, §1, §6.
- Impact of a comprehensive workplace hand hygiene program on employer health care insurance claims and costs, absenteeism, and employee perceptions and practices. Journal of occupational and environmental medicine 58 (6), pp. e231. Cited by: §1.
- Transmission of pneumococcal carriage in families: a latent Markov process model for binary longitudinal data. Journal of the American Statistical Association 95 (452), pp. 1044–1053. Cited by: §1.
- Stochastic epidemic models: a survey. Mathematical biosciences 225 (1), pp. 24–35. Cited by: §1.
- Likelihood-based inference for partially observed epidemics on dynamic networks. Journal of the American Statistical Association, pp. 1–17. Cited by: §1, §1, §1, §4.2, footnote ‡‡.
- Supplement to “likelihood-based inference for partially observed stochastic epidemics with individual heterogeneity”. (), pp. . Cited by: §2, §3.1, §4.1, §5.2, §6.1, footnote **.
- Likelihood-based estimation of continuous-time epidemic models from time-series data: application to measles transmission in london. Journal of the Royal Society Interface 5 (25), pp. 885–897. Cited by: §4.2.
- S. pneumoniae transmission according to inclusion in conjugate vaccines: bayesian analysis of a longitudinal follow-up in schools. BMC Infectious Diseases 6 (1), pp. 14. Cited by: §1.
- The sem algorithm: a probabilistic teacher algorithm derived from the em algorithm for the mixture problem. Computational statistics quarterly 2, pp. 73–82. Cited by: §3.2.
- Using real-world contact networks to quantify the effectiveness of digital contact tracing and isolation strategies for covid-19 pandemic. medRxiv. Cited by: §1.
- A stochastic em algorithm for approximating the maximum likelihood estimate. Technical report Sandia National Labs., Livermore, CA (United States). Cited by: §4.3.
- Graph-coupled HMMs for modeling the spread of infection. arXiv preprint arXiv:1210.4864. Cited by: §1.
- Fitting birth-death processes to panel data with applications to bacterial dna fingerprinting. The annals of applied statistics 7 (4), pp. 2315. Cited by: §3.2.
- Contact tracing and disease control. Proceedings of the Royal Society of London. Series B: Biological Sciences 270 (1533), pp. 2565–2571. Cited by: §1.
- Hierarchical graph-coupled HMMs for heterogeneous personalized health data. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 239–248. Cited by: §1.
A unifying variational inference framework for hierarchical graph-coupled HMM with an application to influenza infection.
Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1.
- Report 9: impact of non-pharmaceutical interventions (npis) to reduce covid19 mortality and healthcare demand. Cited by: §1.
- Efficient data augmentation for fitting stochastic epidemic models to prevalence data. Journal of Computational and Graphical Statistics 26 (4), pp. 918–929. Cited by: §4.2.
- Exact stochastic simulation of coupled chemical reactions. The journal of physical chemistry 81 (25), pp. 2340–2361. Cited by: §5.
- Stochastic modeling of scientific data. CRC Press. Cited by: §2, §3.2.
- Direct likelihood-based inference for discretely observed stochastic compartmental models of infectious disease. The Annals of Applied Statistics 12 (3), pp. 1993–2021. Cited by: §1.
- Birth/birth-death processes and their computable transition probabilities with biological applications. Journal of mathematical biology 76 (4), pp. 911–944. Cited by: §1.
- Simulation from endpoint-conditioned, continuous-time Markov chains on a finite state space, with applications to molecular evolution. The Annals of Applied Statistics 3 (3), pp. 1204. Cited by: §3.2.
- Outbreaks of streptococcus pneumoniae carriage in day care cohorts in finland–implications for elimination of transmission. BMC infectious diseases 9 (1), pp. 102. Cited by: §1.
- Intensified hand-hygiene campaign including soap-and-water wash may prevent acute infections in office workers, as shown by a recognized-exposure-adjusted analysis of a randomized trial. BMC infectious diseases 17 (1), pp. 1–9. Cited by: §1.
- Impact of health campaign on hand hygiene with alcohol-based hand rubs in a non-clinical setting. Journal of Hospital Infection 83, pp. S23–S28. Cited by: §1.
- Sequential monte carlo algorithms for agent-based models of disease transmission. arXiv preprint arXiv:2101.12156. Cited by: §1, §1.
- A contribution to the mathematical theory of epidemics. Proceedings of the royal society of london. Series A, Containing papers of a mathematical and physical character 115 (772), pp. 700–721. Cited by: §1.
- Infectious disease control using contact tracing in random and scale-free networks. Journal of The Royal Society Interface 3 (6), pp. 55–62. Cited by: §1.
- The engines of sars-cov-2 spread. Science 370 (6515), pp. 406–407. Cited by: §1.
- Finding the observed information matrix when using the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 44 (2), pp. 226–233. Cited by: §4.3.
- To quarantine, or not to quarantine: a theoretical framework for disease control via contact tracing. Epidemics 34, pp. 100428. Cited by: §1.
- Assessment of simulated surveillance testing and quarantine in a sars-cov-2–vaccinated population of students on a university campus. In JAMA Health Forum, Vol. 2, pp. e213035–e213035. Cited by: §7.
- COVID-19 superspreading suggests mitigation by social network modulation. Physical Review Letters 126 (11), pp. 118301. Cited by: §1.
- The stochastic em algorithm: estimation and asymptotic results. Bernoulli 6 (3), pp. 457–489. Cited by: §1, item 1, item 2, §4.3, §4.3.
Fast mcmc sampling for markov jump processes and extensions..
Journal of Machine Learning Research14 (11). Cited by: §3.2.
- Stochastic population processes: analysis, approximations, simulations. OUP Oxford. Cited by: §3.2.
- Hand washing with soap and water together with behavioural recommendations prevents infections in common work environment: an open cluster-randomized trial. Trials 13 (1), pp. 1–11. Cited by: §1.
- Global transmission network of sars-cov-2: from outbreak to pandemic. MedRxiv. Cited by: §1.
- Modelling strong control measures for epidemic propagation with networks—a covid-19 case study. IEEE access 8, pp. 109719–109731. Cited by: §1.
- Household sars-cov-2 transmission and children: a network prospective study.. Clinical Infectious Diseases: an Official Publication of the Infectious Diseases Society of America. Cited by: §1.
- Outcomes of a pilot hand hygiene randomized cluster trial to reduce communicable infections among us office-based employees. Journal of occupational and environmental medicine 57 (4), pp. 374. Cited by: §1.
- Hand hygiene performance and beliefs among public university employees. Journal of health psychology 20 (10), pp. 1263–1274. Cited by: §1.
- Computational tools for assessing gene therapy under branching process models of mutation. arXiv preprint arXiv:2111.08183. Cited by: §3.2.
- The healthy workplace project: results of a hygiene-based approach to employee wellness. American Journal of Health Promotion 29 (5), pp. 339–341. Cited by: §1.
- . Journal of Computational and Graphical Statistics 29 (2), pp. 238–249. Cited by: §1.
- Likelihood-based inference for discretely observed birth–death-shift processes, with applications to evolution of mobile genetic elements. Biometrics 71 (4), pp. 1009–1021. Cited by: §3.2.
- Efficient transition probability computation for continuous-time branching processes via compressed sensing. In Uncertainty in artificial intelligence: proceedings of the… conference. Conference on Uncertainty in Artificial Intelligence, Vol. 2015, pp. 952. Cited by: §3.2.
S1 Additional details of the model framework
As a supplement to the model framework introduced in Section 2, we may consider individual-level covariates in the change rates of links in the contact network. Specifically, given the status of the process at time , , if and are not in contact, then they initiate contact with rate , where