CRISP: A Probabilistic Model for Individual-Level COVID-19 Infection Risk Estimation Based on Contact Data

06/09/2020 ∙ by Ralf Herbrich, et al. ∙ Amazon Zalando 0

We present CRISP (COVID-19 Risk Score Prediction), a probabilistic graphical model for COVID-19 infection spread through a population based on the SEIR model where we assume access to (1) mutual contacts between pairs of individuals across time across various channels (e.g., Bluetooth contact traces), as well as (2) test outcomes at given times for infection, exposure and immunity tests. Our micro-level model keeps track of the infection state for each individual at every point in time, ranging from susceptible, exposed, infectious to recovered. We develop a Monte Carlo EM algorithm to infer contact-channel specific infection transmission probabilities. Our algorithm uses Gibbs sampling to draw samples of the latent infection status of each individual over the entire time period of analysis, given the latent infection status of all contacts and test outcome data. Experimental results with simulated data demonstrate our CRISP model can be parametrized by the reproduction factor R_0 and exhibits population-level infectiousness and recovery time series similar to those of the classical SEIR model. However, due to the individual contact data, this model allows fine grained control and inference for a wide range of COVID-19 mitigation and suppression policy measures. Moreover, the algorithm is able to support efficient testing in a test-trace-isolate approach to contain COVID-19 infection spread. To the best of our knowledge, this is the first model with efficient inference for COVID-19 infection spread based on individual-level contact data; most epidemic models are macro-level models that reason over entire populations. The implementation of CRISP is available in Python and C++ at



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


A probabilistic graphical model for COVID-19 infection spread through a population based on mutual contacts between pairs of individuals across time as well as test outcomes The C++/Python implementation enables full inference at the scale of millions of contacts between thousands of individuals.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The COVID-19 pandemic has spread rapidly around the world, with the number of infections and deaths steadily growing. Most governments around the world have been completely unprepared to deal with the COVID-19 outbreak, which UN Secretary-General Antonio Guterres has referred to as humanity’s worst crisis since World War II. While governments around the world had plans in place in the event of a pandemic, the peculiarities of COVID-19 (e.g., delayed onset of symptoms, asymptomatic transmission) have challenged these preparations. Governments have reacted by implementing measures such as nationwide lock-downs, that require people to stay inside their homes, enforcing social distancing and therefore breaking the COVID-19 infection chain. However, a blunt mechanism such as a lock-down (over an extended period) can cause severe damage to the economy, and so, there is a need to find alternative measures to slow down or stop the spread without incremental effects in other areas of society. These alternatives have to be built in a solid foundation such as widespread testing and the isolation of infected (or potentially infected) individuals via contact-tracing.

Contact tracing technologies TraceTogether ; AarogyaSetu have shown promise in tracking the spread of the disease across the population. These mobile apps capture social contact information between users such as contact duration, distance, etc. using Bluetooth signals on devices. The fine-grained contact data of individuals collected by the apps can enable:

  • Individual risk score prediction. The contact data, combined with information about COVID-19 positive test cases, can be used to predict the likelihood of infection for each individual. The individual risk scores can be leveraged by governments and organizations to prioritize testing as well as to identify individuals that need to enter isolation/quarantine.

  • Hotspot detection. Tracing technologies can help authorities identify areas with a high density of contacts and/or individuals with high infection risk. This can allow policymakers to make more effective decisions, for example, by imposing highly restrictive measures such as lock-downs, shelter-at-home, or school closures only in COVID-19 hotspots while allowing activities to remain closer to normal in unaffected areas.

  • Insights about infection spread. Contact tracing can provide insights into the relative importance of different modalities of disease transmission (e.g., through intermediate surfaces vs individual contact), risk of infection transmission based on contact characteristics such as duration and distance, most likely locations (e.g., schools, work, malls) for the spread of disease, and "super spreaders" who come in close proximity with a large number of individuals and so must be frequently tested for infection.

To achieve the above-mentioned benefits, we need to devise new models and inference algorithms for analyzing contact tracing data. This is because existing epidemics models Chakrabarti2008 ; Mieghem2009 ; Mieghem2014 ; Ferguson2005 ; Ferguson2006 focus on estimating population-level statistics such as percentage of the population infected, number of days for the epidemic to peak, etc. as opposed to the infection state of each individual in the population. Other models Myers2010 ; Warriyar2020 that use ML-based inference techniques assume complete knowledge of the infection state of each individual at each time instant. However, in the COVID-19 scenario, (1) the infection status of individuals is not known until they are tested, and (2) infectious time of individuals are unknown since individuals may infect others while asymptomatic. Finally, governments are using contact tracing data TraceTogether ; AarogyaSetu to identify and test individuals who have come in direct contact with COVID-19 positive test cases. However, the fact that asymptomatic individuals may have infected a large number of people prior to displaying symptoms and being tested, delays the detection of these newly infected individuals by only using contact tracing apps.

Our main contributions can be summarized as follows:

  • We propose CRISP (COVID-19 RIsk Score Prediction), a probabilistic graphical model for COVID-19 infection spread through diverse contacts channels between individuals. Our model uses latent variables to represent the epidemiological states of individuals based on the SEIR model May1991 at different points in time, and captures both the transitions between states as well as test outcomes.

  • We develop a Monte Carlo EM algorithm to infer infection transmission probabilities across a range of contact channels. Our algorithm uses block-Gibbs sampling to draw samples of the latent infection status of each individual over the entire time period, given data about contacts and test results.

  • We provide implementation details to accelerate both the block-Gibbs sampling and the forward sampling algorithm. A Python and C++ implementation of CRISP is available at

  • We conduct experiments with simulated data which demonstrate that our CRISP model can be parametrized by the reproduction factor and exhibits population-level infectiousness and recovery time series similar to those of the classical SEIR model. However, due to the individual contact data, this model allows fine grained control and inference for a wide range of COVID-19 mitigation and suppression policy measures. Furthermore, we show that a testing-and-quarantining policy based on infection risk scores computed by the CRISP algorithm is able to mitigate COVID-19 infection spread while quarantining fewer individuals compared to other policies based on contact-tracing and symptom-based testing.

To the best of our knowledge, this is the first comprehensive model for COVID-19 infection spread that (1) captures the infection states of individuals and transitions between them using the SEIR model, and (2) leverages contact tracing and test outcome data to infer model parameters such as contact-channel specific infection rates using scalable and computationally efficient inference algorithms.

2 Related Work

We classify related work into four broad categories: epidemic models, Machine Learning (ML) based inference of model parameters, influence maximization in social networks, and contact tracing apps.

2.1 Epidemic Models

In recent years, there has been research on modeling individual dynamics of epidemics Chakrabarti2008 ; Mieghem2009 ; Mieghem2014 . However, this work typically resorts to mean-field theory to model virus spread over a network, and thus does not characterize the dynamic infectious state of each individual over time.

Ferguson et al. Ferguson2005 ; Ferguson2006 use a compartmental transmission model to simulate the spread of influenza across a population, and analyze interventions such as antiviral prophylaxis and social distancing to halt a pandemic. The authors use a stochastic model of individuals co-located in households that are randomly distributed across a geographical region, and infection risk from 3 sources – household, place and random contacts in the community. The infection transmission rates for the 3 sources and recovery rates are based on analysis of historical data. In contrast, we leverage real individual contact tracing data and outcomes of tests on individuals to infer the infection transmission rate for each contact and the likelihood of infection for each individual.

Lorch et al. Lorch2020 propose a spatiotemporal epidemic model that uses marked temporal processes to represent the epidemiological condition of each individual (based on a variation of the SEIR compartment models), individual mobility patterns, test outcomes, and testing and contact tracing strategies. The authors design an efficient sampling algorithm for the model using Monte Carlo roll-outs that is able to predict the spread of COVID-19 under different testing & tracing strategies, social distancing measures, and business restrictions, given contact histories of individuals. They use Bayesian optimization techniques to infer model parameters (e.g. infection transmission rate) that minimize the difference between the real positive COVID-19 cases and those in the Monte-Carlo simulations. In addition, they demonstrate the efficacy of their model using real COVID-19 data and mobility patterns of Tübingen, Germany. Our Monte Carlo EM inference algorithm for model parameters is computationally much more efficient than the Bayesian optimization techniques employed in Lorch2020 .

2.2 Machine Learning-based Inference

In Myers2010 , the authors consider the problem of inferring latent social networks based on network diffusion or disease propagation data. Given the times when nodes become infected, but not who infected them, the authors identify the optimal network that best explains the observed data. The authors present a maximum likelihood approach based on convex optimization with a -like penalty term (that encourages sparsity) to estimate the conditional probability of infection transmission between every node pair. A key difference from our work is that Myers2010 assumes complete knowledge of infected nodes and infection times. In contrast, in the COVID-19 scenario, (1) the infection status of nodes is not known until they are tested, and (2) infection times of nodes are unknown since nodes may not show symptoms even though they are infected (and infecting others).

Warriyar et al. Warriyar2020 introduce a novel R statistical software package EpiILM for simulating infectious disease spread, and carrying out Bayesian MCMC-based statistical inference for spatial and/or (contact) network-based models in the Deardon et al. Deardon2010

individual-level modelling framework. In individual-level models (ILMs), the epidemiological state of each individual (e.g., susceptible or infected) is assumed to be perfectly known at each time instant, which makes it relatively straightforward to estimate model parameters such as infection transmission probabilities (as a function of covariates) using maximum likelihood estimation or Bayesian inference using Metropolis-Hastings MCMC. However, in the COVID-19 scenario, epidemiological states of individuals are hidden until they are tested, and this complicates Bayesian inference in our probabilistic model setting.

2.3 Influence Maximization in Social Networks

The Influence Maximization problem aims to select users in a social network that maximize influence spread, and was first modeled as an algorithmic problem by Kempe et al. Kempe2003 . Yuchen2018 presents a comprehensive survey of different diffusion models that capture the information diffusion process and approximation algorithms to maximize influence. The papers assume that diffusion model parameters such as influence probabilities are given and focus on selecting users to maximize influence spread. In contrast, our paper focuses on the problem of estimating model parameters related to infection transmission probabilities for each contact, given social contact information between users and COVID-19 test results for users.

Mathioudakis2011 addresses the problem of finding the "backbone" of an influence network. It employs network sparsification to preserve only the links that play an important role in the propagation of information. Goyal2010 considers the problem of estimating influence probabilities between users in a social graph. Given a social graph and a log of actions by users, the Maximum Likelihood Estimator (MLE) of influence probability of node on node is simply the fraction of actions performed by that are also performed by . Unlike Goyal2010 , in our setting, the infection status and times of nodes are latent, and need to be inferred by our algorithms.

2.4 Contact Tracing Apps

To combat the spread of COVID-19, governments have launched contact tracing appsTraceTogether ; AarogyaSetu that use Bluetooth signals on mobile phones to track contacts between users. Users who have come in direct contact with COVID-19 positive test cases are considered to be at high risk of infection, and subject to tests and quarantine actions. However, a key problem with this approach is that COVID-19 infected users are typically tested only after they show symptoms, and typically, infected users show symptoms 5-6 days post infection. These asymptomatic users may have infected a large number of users over multiple hops prior to displaying symptoms and being tested. This delays detection of infected users using contact tracing apps, and limits their effectiveness to proactively test and isolate infected users to contain the spread of COVID-19. In contrast, our probabilistic modeling algorithm CRISP predicts the likelihood of a user getting infected with COVID-19 through a chain of social contacts involving asymptomatic users, and identifies infected users early, even though they may be multiple hops from a user who has tested positive for COVID-19 and even before they begin to show symptoms. Our inference algorithm also learns infection transmission probabilities for each contact channel.

3 CRISP Infection Spread Model

Figure 1: Graphical model of the CRISP contact infection spread model for 3 people over 4 time steps where individual meets both individual and at time and one test outcome of individual at time . Note that this model has no cycles as we assume the infection status only depends on variables before time step , . However, due to the "memory" that the state and have, we require edges into the entire past of an infection trace.

Our CRISP model is an SEIR model at the level of every individual (see May1991 for an introduction). Note that we consider discrete time steps implicitly assumed to be at the level of single days. We assume that we are given the following two datasets for a given set of individuals:

  • of quadruples of a pair of two individuals who have met at time with specific features

    . Here we assume that the feature vector

    describes the overall contact between and via the number of mutual contacts over channel (e.g., Bluetooth encounters, queuing together, sharing public transportation). We assume to be symmetric so that .

  • of triplets of individual taking a test at time with the test outcome where indicates a negative test outcome.

We model the discrete time steps of infection status for each individual. Our model has many latent variables that represent the four stages of infection111Note that we are assuming that recovered individuals are immune until time step .:

  • : individual has not been infected and is susceptible,

  • : individual is infected but not contagious,

  • : individual is infected and is contagious,

  • : individual has recovered and is not contagious.

Let us use the notation to denote the set of latent states of individual up to and including time and . In addition to the latent infection status of all individuals at each time step, we also model variables for the test outcomes in , that is, . Then, our graphical model between the variables has the following edges:

  1. and . All edges between the latent infection states of a single individual . The edges will be used to describe the probability of and describe the full time series of being susceptible, exposed, infectious and then recovered.

  2. . All edges between two individuals and who had a contact at time .

  3. . All edges between a test outcome at time and the corresponding infection status of individual . These edges will be used to describe the probabilities of a test outcome given the infection status of at that same time.

The full edge set is the union of these three edge types: . Figure 1 shows an example graphical model with these three edge components.

In order to define the joint probability distribution, note that all edges

and are pointing forward in time. Thus, all are conditionally independent of each other given all the past states . Also, all edges have the property that a test outcome of individual at time only depends on the infection status of at , . The joint probability distribution is given by


where . Since we are using an SEIR model, the only non-zero probabilities are the transitions , , , , , , .


Infection Model

In order to define , we assume that an infection occurs from exogenous influences with a fixed probability or with probability of for every instance of a contact through the contact channel if the contact was already in the state . Thus, the probability that no infection occurred at time equals


Infection Status Model

In order to define and , let us assume we have a point density function and for the probability that the exposure () lasts for time steps (and similarly for the duration of the infectiousness). Examples of functions and

are the probability mass functions of the binomial, negative-binominal or geometric distributions. However, for the case of COVID-19, we will use discrete probabilities established from analysis of the population in

Backer2020 and Woelfel2020 . Moreover, let


be the conditional probability (according to ) that the duration is exactly time steps given that the duration is at least time steps. Then,


Note that the first argument to both and is the number of and states up to and including time in the state sequence .

Test Outcome Model

Finally, we need to define the probability of a test outcome given the infection status of individual at time . Since there are two types of mistakes of a test, we use


We assume both and . It is easy to implement more sophisticated test accuracy models here, in particular to distinguish between different infection states. Also, we can easily model and which are dependent on how many days an individual has been in state ; this change would not affect the block-Gibbs sampling scheme in Section 4 in an adverse way.

Prior Model

In order to complete the description of the full probabilistic model, we have to specify . For simplicity, we assume these probabilities to be a delta-peak at state , that is, for all .

4 Inference in the CRISP Model

For inference in the aforementioned model we are interested in computing the infection risk score of every individual at every time step given the test outcomes available as well as the hyper-parameters which cannot be set by knowledge of the diseases: the parameters represent the probabilities of COVID-19 infection transmission through the contact channel and captures the probability that an infection occurs at any time-step from exogenous influences.

In order to estimate , we will maximize the log-likelihood of the data , that is


where the second decomposition explicitly contains the posterior . However, this posterior is not analytically tractable and therefore we will approximate it by performing block-Gibbs sampling of an infection trace of individual keeping all other infection traces

fixed. This requires a computationally efficient procedure to sample from the conditional probability distribution

which we describe in the following subsection.

4.1 Infection Risk Score Inference

Since we assume that the total number of days of the model, , is not large222As of today, the COVID-19 pandemic is active for 90 days., we will enumerate all possible sequences of infection traces and compute the un-normalized probability of for all terms that depend on elements of the trace in order to re-normalize and draw from this distribution. Also, as our model is an SEIR model, we know that each sample can be uniquely represented by a triple of time steps with being time steps individual is in state , being time steps in state , being time steps in state and the remaining being time steps in state .

There are three groups of factors that (might) involve in the (un-normalized) conditional probability distribution :


The first set of factors, , captures the temporal evolution of the infection state changes of directly and can be reduced to three factors based on and all the contacts that could have infected individual . The second set of factors, , captures the factors where the infectiousness of might impact other individuals . Finally, the third set of factors, , captures the outcome of tests on individual .


In order to derive a compact representation of , we assume that it can be written in terms of


Since the infectious status, of other individuals that had contact with only affect in the susceptible state, we can derive and from (4) by collecting the terms (see (5)):


where . Note that is the density function of the geometric distribution. Similarly, given (4) and (7) we can derive as


A similar derivation shows that which proves that the computational complexity of computing has been reduced to one factor for each contact during the states of and three additional factors corresponding to the compact representation for . The number of factors do not directly scale up with .


In order to derive a compact representation of , we note that only the cases of potentially contain the value of for (see the value range of the function and in (4)). In fact, looking at (5) it becomes evident that it requires . Thus, is defined by


where and are the individuals that met at time who were susceptible and have either stayed susceptible or got exposed, respectively:

4.2 Efficient Block-Gibbs Sampling

In this subsection, we describe how we can speed up the block-Gibbs sampling step for drawing a sample infection trace for individual .

Constant terms

A key observation is that the term in (13)—corresponding to test outcomes for individual —is a constant for each infection trace irrespective of the values of other infection traces . Thus, can be pre-computed at the start of the block-Gibbs sampling algorithm for individual and then reused every time we evaluate the likelihood of an infection trace . Similarly, the terms , and in in (14) are constant for each infection trace irrespective of the infection traces of other individuals. Hence, these terms can also be pre-computed at the start of the block-Gibbs sampling algorithm for individual .

Contacts into

The remaining term in (see (14)) captures the contribution due to contacts into individual from other (infectious) individuals who are in state prior to herself getting infected at . As a result, depends on the values of other infection traces and needs to be recomputed during each block-Gibbs sampling step to draw sample . Let us define


where (see also (17)). Then, we have


Thus, at the start of each block-Gibbs sampling step, we pre-compute (22) and (23) for each time step , and then use (24) to compute for a particular infection trace . Note that this involves only multiplications (or additions in the log-domain).

Contacts out from

We next turn our attention to computing for each infection trace that captures the contribution due to contacts out from . We introduce two states and which are identical to except for the value of infection state which is is and one of in . Now, let


be the inner terms in (21). Note that for all contacts , the terms and differ only in the factor that is in but not in since is infectious at this time in but not in . Also, note that values of for do not affect . We can then obtain for each infection trace value as


where is the product of over all time steps and can be ignored due to normalization of the sampling distribution. Again, the ratio can be pre-computed for all time steps at the start of the block-Gibbs sampling step for , and then used to compute for each infection trace as in (26).

Putting it all together

Note that the quantities , and only depend on the infection status of individual at time because the infection traces of all other individuals are fixed when we are drawing a block-Gibbs sample for . Thus, the (un-normalized) conditional probability for each value is obtained by taking the product of , and , which in turn are computed efficiently as described above from pre-computed values of , , and at the start of the algorithm, and , and for all time steps at the start of the block-Gibbs sampling step.

Additional implementation optimizations

We use two additional ideas to accelerate the implementation of the block-Gibbs sampling algorithm:

  • We never materialize the infection trace because it is uniquely described by the triple ; each value can be computed by no more than three comparisons of with , and . Thus, the whole state of the latent variable model is represented by integers.

  • We carry out all computations of probabilities in the log-domain so all functions become sums and products instead of products and powers.

Block Glibbs Sampling Algorithm

Algorithm 1 is block-Gibbs sampling algorithm for sampling from our CRISP model. It cycles through (random) individuals , sampling the vector of latent variables from the conditional distribution until convergence. We can use the samples drawn by this algorithm to compute the infection risk score for an individual at time by taking the fraction of samples in which the latent infection state .

/* Initialization */
Initialize each /* Precomputations independent of contact data */
1 forall  such that  do
2       Pre-compute , and Construct the sequence with states , states , states and states Pre-compute according to (13)
       Pick a random index /* Precomputations dependent on contact data */
4       forall time steps  do
5             Pre-compute using (22) and using (23) Pre-compute ratio using (25)
6      forall  such that  do
             /* Infection trace specific computations */
7             Construct the sequence with states , states , states and states Compute using (24) Compute using (26) Set
      /* Block-Gibbs sampling step */
8       Sample with probability Set with (,,,) states corresponding to return
until convergence
Algorithm 1 Block-Gibbs sampling algorithm for CRISP model

4.3 Hyperparameter Inference

In order to estimate the hyper-parameters of the CRISP model, would like to find that maximizes the log-likelihood log (12

). However, since this is intractable, we propose to use the Monte Carlo Expectation-Maximization (EM) algorithm

Bishop2006 . We will use EM to refine in successive iterations. Let be the value of computed in the previous iteration. Then, in the E step of the current iteration, we will estimate the expected complete-data log-likelihood


We will use the block-Gibbs sampling procedure described in Algorithm 1 to approximate the posterior distribution over the latent infection status of individuals . If the samples drawn from the posterior are , then in the M step, we will compute that maximizes the expected complete-data log-likelihood


where is the infection state for individual at time in sample . If denotes the number of initial states in the sample infection trace , we note that by virtue of (4) only the first terms depend on which reduces the above maximization term to only

We use stochastic gradient descent to compute the

values that maximize the above expression. We also note that for numerical stability, we re-parameterize via as which allows for an unconstrained optimization over .

4.4 Federated Block-Gibbs Sampling

We can extend the block-Gibbs sampling algorithm in CRISP to a federated learning setting McMahan2017 where local contact and test outcome data for an individual are utilized to compute the block-Gibbs sample on the individual’s mobile device without ever needing to be shared with anyone else. This has two benefits: (1) We distribute the block-Gibbs sampling algorithm across hundreds of millions of mobile devices in the world and thereby utilize their distributed computational power, and (2) Contact and test outcome data for an individual are stored only on the individual’s mobile device and not shared with other mobile devices––this preserves a user’s privacy. In the federated setting, the contact data is never centralized—instead for each individual , her device executes the block-Gibbs sampling step to draw sample only using the locally available contacts and test outcome data for , as well as additional “minimal statistics” sent to by the devices of its past contacts. In the following two paragraphs, we explain how to compute the factors , and in (13) in a federated setting (see Algorithm 2 for the pseudo-code which runs on every mobile device).

Factors and

A key observation is that the terms , , in as well as the factor can all be computed locally on the device with the contact and test outcome information available on the device. In order to compute the remaining term in , we only require information on the individuals who had a contact with at each time step and the infection status of at the time of the contact. Individual ’s mobile device already has the contact information for ; thus all that is required to compute are the current infection traces for all individuals who have had contacts with . Since each infection trace is uniquely characterized by a triple, we require the mobile devices of all individuals who have had a contact with to send ’s device the triple corresponding to .


In order to compute as defined in (21), we require the term for each individual who has had a contact with at time and whose infection state . Let be defined as in (5) over all contacts of