Individual-level Modeling of COVID-19 Epidemic Risk

06/28/2020 ∙ by Andres Colubri, et al. ∙ Broad Institute 0

The ongoing COVID-19 pandemic calls for a multi-faceted public health response comprising complementary interventions to control the spread of the disease while vaccines and therapies are developed. Many of these interventions need to be informed by epidemic risk predictions given available data, including symptoms, contact patterns, and environmental factors. Here we propose a novel probabilistic formalism based on Individual-Level Models (ILMs) that offers rigorous formulas for the probability of infection of individuals, which can be parameterised via Maximum Likelihood Estimation (MLE) applied on compartmental models defined at the population level. We describe an approach where individual data collected in real-time is integrated with overall case counts to update the a predictor of the susceptibility of infection as a function of individual risk factors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The COVID-19 Pandemic [Velavan2020, Fauci2020] has emerged as the most serious health crisis that humanity has faced since the 1918 Influenza Pandemic [Viboud2018]. Its causal pathogen, SARS-CoV-2 [Lu2020], is a coronavirus new to the human population with unique molecular [Wrapp2020], physiopathological [Zou2020], and epidemiological [Liu2020] features. This has resulted in the exponential spread of COVID-19 around the world, with over 11 million confirmed cases and 500,000 deaths worldwide as reported by the COVID-19 interactive web dashboard from John Hopkins University (https://coronavirus.jhu.edu/map.html), at the time of this writing [Dong2020].

As part of the public interventions that aim to reduce the transmission of COVID-19, precautionary self-isolation of the general population and quarantining of suspected and confirmed mild cases is a strategy that can substantially reduce the effective reproductive number of the disease, [Wang2020]. This has the important consequence of "flattening the epidemic curve" until herd immunity is achieved, either by infection or vaccination, and therefore avoid overwhelming the health care system [Kissler2020]. Active monitoring of contacts via traditional contact tracing by health care workers [eames2003], potentially complemented and expanded with proximity-sensing tracking mobile apps [Ferretti2020], could further help mitigate transmission [Peak2020].

In all these interventions, the spatio-temporal modeling of the disease is critical to understand risk factors associated with transmission and, through this, adjust the magnitude and timing of the interventions in order to maximize their change of success. In particular, predicting the epidemic risk of individuals to contract the disease over space and time can help to identify subpopulations under increased risk and to inform interventions such as quarantine. Most importantly, being able to promptly identify who, in a system, is at risk of infection during an outbreak is key to the efficient control of the epidemic. However, developing such models is challenging in a situation like the current pandemic, due to the uncertainty in the epidemiological parameters of a novel pathogen and also to the urgency with which interventions and tools are needed.

In this paper, we adopt a individual-level model (ILM) framework that enable us to express the probability of a susceptible individual being infected as a function of their interactions with the surrounding infectious population while also allowing to incorporate the effect of individually-varying risk factors (e.g., age, pre-existing conditions). ILMs are intuitive and flexible due to be expressed in terms of individual interactions [Gibson1997, Keeling2001, Neal2004] but also computationally costly to parameterise, especially in the case of epidemics in large populations. More recent work has shown how to simplify the likelihood calculations to make ILMs more readily applicable in real-life scenarios [Deardon2010]. Addition of geographical covariates could in flexible infectious disease models that can be used for formulating etiological hypotheses and identifying geographical regions of unusually high risk to formulate preventive action [Mahsin2020]. Therefore, ILMs can provide for better understanding of the spatio-temporal dynamics of disease spread facilitating a greater understanding of the impact of policies and interventions for controlling epidemic outbreaks.

We first apply the formalism for ILMs to derive an expression for the marginal probability of individual risk of infection as a function of parameters with straightforward epidemiological interpretation and initial estimation. We incorporate symptom and other individual-level data to update the risk of infection based on this new information. We then construct a population-level compartmental SIR epidemic model, where the rate of infection can be estimated from the individual-level probabilities given a random sample of individuals front he population. This allows to express the population-level parameters as a function of the individual-level parameters, and to use partially observed data (overall case counts, and individual risk factors and contacts) to apply Maximum Likelihood Estimate (MLE) within a Partially Observed Markov Process (POMP) framework. The POMP framework enables us to solve a computationally more tractable MLE problem thanks to iterated filtering, an efficient computational method that’s based on a sequence of filtering operations which are shown to converge to a maximum likelihood parameter estimate. As result of this approach, we arrive to estimates of individual-level parameters that can be used to predict risk of infection.

The structure of this article is as follows. In Section 2, the formula for individual epidemic risk is derived and applied to evaluate conditional probability on infection given additional individual-level data and the overall likelihood function. In Section 3, the iterated filtering MLE is presented to fit the model parameters. Section 4, applies the iterated filtering MLE algorithm on simulated data generated with an agent-based model (ABM). We conclude the article with a discussion in Section 5.

2 Model Formalism

2.1 Individual-level Models

In this section, we present the general form of the epidemic ILM based on [Deardon2010] where the heterogeneity of infectious disease transmissions are allowed at the individual-level. We denote the set of individuals who are susceptible, infectious, or removed at time t as , , or respectively. Note, for given , , , and are mutually exclusive. Here, we assume time is discretized so that time point , for represents a continuous time interval .

Let be the probability of a susceptible individual i being infected at time t. Then a general form of the ILM, without geography dependency, is given as:

(1)

where is the set of infectious individuals that interacted with in the time interval . The functions and represent risk factors associated with susceptible individual contracting, and infectious individual passing on, the disease, respectively. Risk factors that involve both the infected and susceptible individuals, such as spatial separation or contact networks, are incorporated through the (time-dependent) infection kernel, . Finally, the sparks term, , represents infections that are not well explained by the , , and ) terms (e.g., infections originating from outside the study population). For example, could be used to represent purely random infections that occur with equal probability throughout the susceptible population at any given time.

This formula is the result of assuming a Poisson infectious process in each time interval . We count the number of transmission events between susceptible and infected

, which follow a Poisson distribution

with the rate of transmission. Non-infection from corresponds to , so then . The rate of transmission from to is modeled as the product form . Non-infection of from all infected follows from independence between these Poisson processes, therefore (ignoring the sparks term):

Formally, the ILMs can be extended to incorporate the effect of spatially varying risk factors upon the transmission of infectious disease. The resulting GD-ILMs have the general form [Mahsin2020]:

(2)

where represents the area index which varies from to . Here, is a susceptibility function of potential risk factors associated with susceptible individual in area contracting the disease; is a transmissibility function of potential risk factors associated with infectious individual in area passing on the disease; is the infection kernel that represents risk factors associated with both susceptible and infectious individuals at time (assumed to be independent of area ); and the sparks term, , represents “random” infections that are not otherwise explained by the model. However, in the context of this manuscript we will only consider simple ILM without explicit geographical dependencies.

2.2 Model Covariates

The aforementioned susceptibility and transmissibility functions, and , respectively, can be used to model individual-level covariates. We may wish to identify the vulnerable age groups and gender along with the estimation of vaccination effect in the susceptibility function. We propose a general susceptibility function as follows:

(3)

where is a constant susceptibility parameter, and are covariates that represent various susceptibility factors to be included in the model (e.g.: age, pre-existing conditions, etc.) Thus, is the parameter for the -th individual-level covariate.

The transmisibility function has a similar general form:

(4)

where is a constant transmisibility parameter, and the transmisibility covariates and their corresponding coefficients.

The infection kernel is a function of the varying distance between i and j over the course of the time interval

(5)

The next sections will provide specific details on these factors and concrete assumptions motivated by what is known about COVID-19.

2.3 Model Simplifications

The general forms of the and functions described above allow us to incorporate an arbitrary number of covariates into the model. Here we propose a very simple initial model. The susceptibility function will depend only on "immunity status" of the susceptible individual. We define the variable to be 1 if individual is over 65 or is immunosuppressed due to some pre-existing condition, 0 otherwise. Therefore:

(6)

For the transmissibility function, an important factor determining the potential for an infected individual to pass along the virus seems to be the presence or absence of symptoms. So in this case the binary covariate takes the values 0 or 1 whether the infected individual is aymptomatic/pre-symptomatic or symptomatic, respectively. So we arrive to the following simplified form:

(7)

As for the the infection kernel, for the time being is just 1 when is in , the contact set of , which includes all the infectious individuals whim whom was closer than 2 meters for at least 15 minutes in , and 0 otherwise:

(8)

Finally, we will adopt a zero sparks term. This assumption is reasonable is transmission mainly due to interactions between individuals, and not through the environment (e.g.: contaminated surfaces or fomites). There is some anecdotal evidence that this might be the case in COVID-19 [Ferretti2020], but for the time being we just assume to keep the models simple.

With these modeling decisions we reach the following expression for the individual probability of infection:

(9)

2.4 Conditioning on Additional Data

Formula (9) gives an expression of the marginal probability of infection of individual in time interval . This probability depends on a number of individual-level and area-level covariates. However, additional data from the individual can be used to arrive to an updated risk of infection. In particular, we are interested in the probability of infection over the course of the past days given this new data, , with defining an appropriate retrospective window of possible infection. Given that the incubation period of COVID-19 is two weeks, then

should be a suitable choice to inform quarantining/testing measures. Applying Bayes Theorem to this probability, we can formally write:

The probability of an infection over the course of the past days can be expressed as a function of the per-day probabilities of infection:

since each event (infection days ago, days ago, and so on) is independent from each other. Furthermore, infection day ago implies that infection did not happen until exactly that day, and so:

where is precisely (9), with indices for individual omitted for clarity.

More concretely, if D comprises symptom data self-reported by individual , or , we can then write the risk of infection over the past days for individual at time given symptom data :

(10)

with the coefficient defined as:

(11)

This coefficient can be thought of the ratio of the observed symptoms given the fact that the individual was recently infected to the prevalence of those symptoms among the general population, which in general should be greater than 1. For instance, if the observed symptoms are cough and fever, then is a measure of how much prevalent those symptoms are among people infected with COVID-19, and could be estimated from currently available data. In fact, a recent study [Menni2020] looked at the predictive power of symptoms self-reported in the US and UK with the COVID Symptom Study mobile app [COVIDSymptomStudy]

. This study presents a logistic regression predictor of infection given a number of symptom predictors:

(12)

Where . We make the assumption that, for a sufficiently long period of time , the conditional probability is simply , where

represents infection at some moment in the past. With this assumption in place, we can connect the COVID Symptom Study prediction model with out probabilistic formalist by means of:

(13)

where would represent the overall prevalence of COVID-19 infection. This allows us to write the following final formula for the risk score:

(14)

Given knowledge of the individual’s symptoms, demographic and medical covariates, and their recent contact history, it would be then possible to calculate this risk of infection.

3 Maximum likelihood in POMPs

Following [Deardon2010], given the , , , and counts of susceptible, exposed, infected, and recovered individuals, respectively, at time

, we can write the likelihood function as function of the parameter vector

as the individual probabilities of infection as follows:

(15)

where , , , , and the joint probability of all new infections occurring in time interval [t, t+1):

(16)

MLE via Metropolis-Hastings MCMC requires the recalculation of a very large number of times, with varying , in order to maximize the likelihood. In each recalculation of the likelihood, the products over all the individuals have to be evaluated, which can become prohibitive even for relatively small populations.

An alternative MLE approach integrates compartmental models with Partially Observed Markov Process (POMP) models [king2016]. Compartmental models simplify the mathematical modeling of infectious disease; however, they assume access to fully observed disease data. In reality, not all COVID-19 cases are reported, and there are several reports of infectious asymptomatic/pre-symptomatic carriers [Aguirre-Duarte2020], with some studies [Nishiura2020] suggesting at least 30% of asymptomatic cases. POMP models allow us to address such limitations by combining the simplicity of compartmental models with a probabilistic framework for the unobserved data.

POMP models represent data collected at times as noisy, incomplete observations of an unobserved Markov process . Disease transmission, represented by compartmental models, is a Markov process because the number of infectious people at time t is solely determined by the number of infectious people at time . A POMP model is characterized by the transition density and measurement density of its stochastic processes. The one-step transition density is represented by , since is Markovian and only relies on the previous state. Meanwhile, the measurement density depends on only the state of the Markov process at that time and so is represented by , where

is a random variable modeling the observation at time

. Hence, the entire joint density for a POMP model, including the initial density , is:

and the marginal density for the sequence of measurements, , evaluated at the data, , is

Here the state variable is the vector described before. Our novel approach here will be to relate the population-level parameters in a SEIR model for COVID-19 [Wang2020] with average estimations calculated over a suitable sample of individuals, which will be expressed in terms of the individual-level probabilities defined by equation (9) and, ultimately, as a function of the individual-level parameters . In this way, our method can be seen a form of hierarchical maximum likelihood where estimation of individual-level is performed simultaneously with the population-level parameters [Rouder2005], which has the advantage of reducing variability in the recovered parameters [Farrell2008].

3.1 SEIR model setup

We constructed the two components of a POMP model: the unobserved process model and the measurement model. The process model, defined as a SEIR model, provides the change in true incidence of COVID at every time point, while the measurement model incorporates the fact that not all cases are observed or reported.

The underlying dynamics of COVID can be captured by a stochastic SEIR model. Most of the assumptions of a basic SEIR model are still the same in a stochastic version. However, we add parameters that induce random fluctuations into the population and change the compartments’ rates of transfer in response to interventions. We do this by using probabilistic densities for the transition of state variables. Moreover, although disease dynamics are technically a continuous Markov process, this is computationally complex and inefficient to model, and so we make discretized approximations by updating the state variables after a time step, . The system of discretized equations is shown below, where is the number of susceptible individuals who become exposed to COVID, is the number of newly infectious cases, and is the number of cases that are removed from the population, all during the time step :

(17)

This equation describes how the sizes of the four compartments (susceptible, exposed, infectious, and removed) change between . The model further assumes that the population size remains constant at every time point. We added inherent randomness to our model by setting , , and

as binomials. If we assume that the length of time an individual spends in a compartment is exponentially distributed with some compartment-specific rate

, then the probability of remaining in that compartment for an additional day is and the probability of leaving that compartment is :

(18)

The force of infection, , is the transition rate between the susceptible and exposed classes and can be written as , where represents the transmission rate of the disease. Furthermore, is the transition rate between the exposed and infectious classes, and is the transition rate between the infectious and removed compartments. represents the mean length of time a person stays in the latent stage and represents the mean length of time a person is infectious before being removed from the population (either because of intervention efforts or natural recovery). We assume these two parameters to be constant over the course of the epidemic.

The transmission rate can be estimated from sample averages calculated over individuals. If we recall the ILM formalism from the previous section, we can write the probability of infection of susceptible by infected contact as follows:

(19)

Therefore, the transmission rate for individual is the sum of these probabilities over all the contacts:

(20)

If we are considering infected individuals from a random sample we can then estimate the transmission rate as:

(21)

Given a fixed sample , we can consider , that is, a function solely of the individual-level susceptibility and transmissibility coefficients.

Although it is impossible to directly record the number of people that are susceptible, exposed, infectious, and removed directly, the publicly available data tells us the number of observed cases per day. The mean number of observed cases per day is the true number of cases multiplied by the reporting rate (). This can be modeled with a binonial distribution of parameters and :

(22)

The process and measurement models define our final POMP model. For each time point, the process model generates the number of new cases based on binomial distributed counts. The measurement model then estimates the observed number of cases based on the true number of cases and reporting rate.

3.2 Iterated filtering method

The likelihood function for the POMP models is the density function evaluated with data at a candidate set of parameter values. It is computationally simpler to work with the log likelihood, , so that we can deal with sums instead of products. We used a simulation-based approach to avoid solving the density function analytically, in which we simulated the random variable , which implicitly defines the density function. Likelihood evaluation via Sequential Monte Carlo (SMC) techniques is one standard method to obtain the log likelihood for POMP models, because it simulates sample paths rather than requiring explicit transition probabilities. Exploiting the Markov property of the process, it is possible to use these paths to sample the parameter space much more efficiently than with regular MCMC, thanks to the iterated filtering method.

We factorized the likelihood as the product of conditional likelihoods:

(23)

where and there are time points. The structure of a POMP model then implies the representation of as

(24)

so that the final expression for the likelihood is:

(25)

In this last equation, although is simple to calculate (using the observation process), is more difficult to evaluate. We can use the Markov property to determine an expression for this probability, known as the prediction formula:

(26)

We can then use Bayes’ Theorem to determine an expression for , known as the filtering formula:

(27)

The prediction and filtering formulas give us a recursion. Specifically, the prediction formula calculates the prediction distribution at time , , at time by using the filtering distribution at time , ,at time . Meanwhile, the filtering formula gives us the filtering distribution at time using the prediction distribution at time .

In SMC, we use Monte Carlo techniques to sequentially estimate the integrals in the prediction and filtering recursions, which in turn allows us to estimate . This is done by generating a swarm of particles that are propagated forward based on the process model and then filtered and altered to fit the next data point more closely. Because of this, SMC is commonly known as the particle filter.

Iterated filtering [Ionides2006] allows to more efficiently obtain MLEs of parameters in partially observed dynamical systems, such as POMPs. It works by defining a set of initial values for the parameter vector and a fixed number of iterations,

. For every iteration, we apply a basic particle filter (Equation 9 above) to the model and add stochastic perturbations to the parameters so that they take a random walk through time. At the end of the time series, we use the final value of the parameters as the starting point for the next iteration but reducing (“cooling”) the random walk variance. After completing the

iterations, we obtain the Monte Carlo maximum likelihood estimate, , and its corresponding log likelihood. In contrast, Monte Carlo likelihood by direct simulation scales poorly with dimension. It requires a Monte Carlo effort that scales exponentially with the length of the time series, and so is infeasible on anything but a short data set.

4 Simulation study

4.1 Parameter estimation

Millions of individuals worldwide have self-reported symptoms associated with COVID-19 infection through numerous websites and apps specifically designed for that purpose [Menni2020]. Meanwhile, anonymized mobility data generated by cellphones has been aggregated from several sources and made available for research [covid19mobility]. However, this currently available data is not enough to evaluate the ILMs described above. These models can potentially provide infection risk predictions aggregating several sources of health and epidemiological data from individuals, including symptoms, demographics, and contact information. Approaches incorporating this kind of data, collected through contact tracing and symptom reporting apps, have been proposed recently by several groups [sphinxteam, Alsdurf2020], and have lead to consider the privacy risks presented by this data and possible mitigation of those risks. Here, we are focusing primarily on the parameter estimation problem, assuming that it is possible to acquire the data securely, but will make some observations regarding privacy in the conclusions.

Since the detailed data needed to calculate our ILMs is not currently available, we have started by conducting a purely computational study where the individual-level data is generated by means of agent-based models (ABMs). These models allow us to simulate behaviours of individuals in a large population, and obtain data that mirrors what we could collect with contact tracing and symptom reporting apps in real life. One advantage of using ABMs at this stage is that we can define the ground truth of the ILM by specifying the coefficients in the susceptibility and transmissibiliy functions, this allow us to evaluate the accuracy of our parameter estimation methods.

For the purpose of running the ABMs simulations, we used the COMOKIT COVID-19 SEIR model [Drogoul2020] implemented in the GAMA software [Taillandier2019], a general ABM simulator allowing for a wide range of options through a custom modeling language and supporting GIS layers to represent specific geographical in detail. In particular, we run simulations using a scenario of a COVID-19 outbreak in Vietnam without any containment strategies, with a population of nearly 10,000 individuals. We adapted the SEIR model provided in COMOKIT to incorporate the individual probabilities of infection as defined by Formula (9), with a number of different selections of parameters . We arbitrarily defined three sets of ground truth parameters, , , and . Given the symmetric form of (9), we expanded the product of the transmissibility and susceptibility functions to arrive to the following reparametrization:

(28)

The relationship between the original and new parameters is given by:

(29)

from which we can express the ratios and as a function of the new parameters , , , and :

(30)

We assume the mean latent and infectious times to be known, and we can estimate them from the GAMA data. In the case of the COMOKIT simulations, have and , both in units of 1/day. Therefore, the only parameters in the SEIR model at the population level are , , , and , which we estimate by applying iterated filtering with POMP. An MLE run corresponding to the the underlying parameters is shown in figure 1.

Figure 1: Result of MLE with POMP on synthetic data generated with GAMA: (A) Curve of new infected cases over time. (B) Scatterplot matrix showing all the initial points (gray) and final (red) points in parameter space from the MLE runs. (C) GAMA curve (blue) and 9 simulated curves (red) generated with the MLE parameters. (D) Range covered between the 5 and 95 percentiles from 100 simulated curves using the MLE parameters.

An issue we encountered with the first round of MLE runs is that, as the ABM simulation progressed, the compartment of susceptible individuals gets depleted as more become infected, and so the estimator becomes increasingly biased. In order to account for this problem, we fit the GAMA data only for the initial stages of the simulated epidemic, when the number of new infectious cases is still increasing due to the large percentage of susceptible individuals. The range of the data with enough susceptibles is shown in figure 2.

Figure 2: Range of synthetic epidemic data used to fit the parameters. (A) GAMA curve (blue) and 9 simulated curves (red) generated with the resulting MLE parameters. (B) Range covered between the 5 and 95 percentiles from 100 simulated curves using the MLE parameters.

The ground truth values for the , , , and

parameters, and the mean and standard deviation over the 10 highest likelihoods are listed in table

1. Most of the true values fall within a standard deviation from the mean MLE, as seen in 3, with the exception of in the parameter set 3.

Set 1
True value MLE mean MLE sdev
0.04 0.063 0.036
0.40 0.839 0.411
0.40 0.603 0.346
4.00 5.452 1.825
Set 2
True value MLE mean MLE sdev
0.25 0.629 0.216
0.40 0.347 0.384
0.75 1.690 0.538
1.20 1.438 0.667
Set 3
True value MLE mean MLE sdev
1.0 0.381 0.120
1.0 1.409 0.322
2.0 2.088 0.901
2.0 2.113 1.218
Table 1: Maximum likelihood estimates for all the parameter sets
Figure 3: Plots representing the standard deviation of the top 10 MLEs for each parameter as an error bar. The red circle indicates the true value of the parameter. All values are scaled to the mean MLE for each parameter.

In order to recover the original parameters , , , and that are needed in the susceptibility and transmisibility functions, we can use the ratios in equation 30. An initial calculation simply taking the top estimates of the ’s to calculate the mean and standard deviation of the ’s and ’s gave values with high errors when compared with the true rations, which seems to be caused by fluctuations in the

’s. A better approximation to the ratios is given by this heuristic formula:

(31)

where represents the mean of the parameter taken over the 10 top MLEs. This formula smooths out the fluctuations in the individual parameters, and gives a better result, shown in table 2.

Set 1
True ratio MLE approx.
0.10 0.13
0.10 0.09
Set 2
True ratio MLE approx.
0.33 0.31
0.62 1.50
Set 3
True ratio MLE approx.
0.5 0.42
1.0 0.63
Table 2: Approximation to the original parameter ratios using MLEs

4.2 Risk calculation

Once the parameters of the model have been determined through MLE using POMP, in particular the individual-level coefficients (, , , ), we can compute individual risks of infection using equation (14). We use the ABM in GAMA with the three parameter sets and a random assignment of symptoms for susceptible and infected individuals using the symptom prevalence for the US listed in [Menni2020]. Instead of using the ground truth individual-level parameters, we generate random perturbation of and , and then obtained and with the ratio estimates in (31).

As a fist sanity check, we calculated the mean risk for susceptible and infected individuals over entire simulation parameter runs with each parameter set, and we obtained the results shown in table 3. Difference between the risks for susceptible and infected is very small for parameter set 3, and the reason for this is that this set yields higher probabilities of infection across all individuals, resulting in more uniform risk score values. We then make predictions of infected status for all agents in each simulation based on their risk, using the mean infected risk minus the standard deviation, as listed in 3, to calculate the threshold to predict infection.

Set 1 0.240.25 0.520.34
Set 2 0.280.24 0.450.32
Set 3 0.550.31 0.630.31
Table 3: Mean and standard deviation of individual risks for susceptible () and infected individuals ()

The area under the receiver characteristic curve (AUC), the 95% confidence interval (CI) for the AUC values, sensitivity, specificity, and overall accuracy of the infection prediction for each parameter set are listed in

4. The corresponding receiver characteristic curves are plotted in 4. We can see how the predictions fare for each parameter set, and set 1 has the highest AUC and best balance of sensitivity and specificity. Even though set 2 and 3 exhibit higher sensitivity, their specificity is fairly low, meaning that in those parameter sets, the risk predictor results in many false positives. In terms of using the risk to decide when to quarantine agents, a high false positive rate (agents that are not infected get needlessly quarantined) is arguably better than a high false negative rate (infected agents are not quarantines and go on to transmit the pathogen).

AUC (95% CI) Accuracy Sensitivity Specificity
Set 1 0.74 (0.73, 0.74) 0.65 0.78 0.53
Set 2 0.65 (0.64, 0.66) 0.62 0.83 0.29
Set 3 0.57 (0.56, 0.59) 0.76 0.81 0.28
Table 4: Performance measures for the infection predictor based on equation (14), calculated on the 3 parameters sets considered in the paper
Figure 4: Receiver Operating Characteristic (ROC) Curve for the infection predictor based on equation (14), generated for the 3 parameter sets considered in the paper

Finally, we run ABM simulations where the risk values where used to quarantine for 14 days those individuals with a risk higher than a given threshold. We simulated two scenarios, in the first scenario, there was a delay of 4 days between the risk calculation and its use to determine quarantine, in order to model the fact that the infectious status of contacts is not determined instantaneously, but with a lag caused by the symptom onset time (and also by the delay in obtaining test results). In the second scenario, the risk was updated instantly with the information of the infected contacts, this represents an unrealistic situation where infectious status is known upon interaction but gives an upper bound for the performance of the intervention. The results form these simulations are shown in figure 5. The epidemic curves of new cases for the three scenarios (no quarantine, delayed quarantine, and instant quarantine) suggest that risk-based quarantine could help in lowering the peak of new cases and spread them over a longer period of time (e.g.: "flattening the curve"), with the most pronounced effect in the case of instantaneous availability of infection status, which is impossible in practice but providing a "maximum curve flattening".

Figure 5: Plot of new cases for the three parameter sets (A: set 1, B: set 2, C: set 3), under three intervention scenarios: no quarantine (blue curve), risk-based quarantine with 4 day delay (orange curve), and risk-based quarantine with instant availability of infectious status of contacts.

5 Conclusions

We constructed an statistical inference framework that allows to obtain individual-level epidemic parameters by applying MLE to population-level case data. We tested this framework using an agent-based model to generate epidemic data resolved at the individual level. As part of this framework, we defined an individual-level epidemic risk model that depends on data such as demographics, medical condition, self-reported symptoms, and contact tracing information. These models could be trained on aggregate data provided by consenting users of a mobile app, for example, and then evaluated locally by the rest of the users. These models could also incorporate additional data, such as spacial random effects. The initial simulation experiments are promising and suggest that is possible to: (1) obtain good estimates for the individual-level parameters by applying MLE on the population level data, (2) predict who is infected and who is not using the individual-level risks, and (3) carry out interventions based on the individual-level risks, such as quarantine, that could help in lowering the peak of the epidemic, i.e.: “flattening the curve”. However, this work in in its initial stages and has several limitations. First of all, the individual-level models we considered so far are very simple, including only two somewhat artificial covariates (immune and symptomatic levels). Secondly, data to train the models was obtained completely from simulated experiments, thus, we need to extend to and validate on real data. Third, our risk calculation requires knowledge on confirmed cases in order to determine exposure events, which might not be readily available or accessible. Other approaches [sphinxteam, Alsdurf2020] are based on estimating the probabilities of all possible states an individual can be in (susceptible, infected, recovered) based on the available information (symptoms, tests, contacts, etc) and then having this information be shared across the individuals through a mobile app in order to update the probabilities as new information is obtained. Our approach has the advantage of being simpler, but could also incorporate some of these ideas to lift the requirement of exact infectious states to be known in advance to the calculation of the risks score. We envision the computational framework presented in this work as the basis for a system that could be used to estimate risks of infection for diseases other than COVID-19.

6 Acknowledgments

The authors would like to thank Aaron King for discussions about the model and revision of the POMP code, Hayden Metsky for suggestions regarding MLE, Brandon Westover for comments about method evaluation, and members of the Machine Learning and Optimization Lab at EPFL for support and feedback.

7 Availability of code and data

All the GAMA parameters and R scripts are available under the MIT license at this repository: https://github.com/broadinstitute/ILM-COVID19-risk, together with the simulated data used in the analysis.