COVID-19 modeling challenges. The coronavirus disease 2019 (abbreviated “COVID-19”), caused by the SARS-CoV-2 virus, has been declared as a pandemic by the World Health Organization (WHO) on March 11, 2020. COVID-19 fundamentally differs from other existing epidemics including SARS and Ebola and has caused unprecedented and all-round challenges, devastation and crises to health, society, the economy, and many other aspects, beyond over 4M deaths and 200M confirmed cases reported all over the world. The existing medical, epidemic and modeling research shows some significant characteristics and challenges of COVID-19 [guan2020clinical, li2020substantial, weitz2020modeling, CaoLc21]. They include: (1) unknown transmission processes: it is still unclear how the coronavirus infects people, how the virus transfers from susceptible people to others, whether the recovered patients still hold contagion, and how COVID-19 spreads during the so-called 2-14-day incubation period [han2020covid]; (2) easy infection and fast transmission and mutation: COVID-19 seems to spread and mutate much more easily and quickly than other known epidemics; (3) asymptomatic infection and contagion: high proportional infections spread between susceptible and infectious people, stranger-to-stranger, and household to household in incubation and asymptomatically (corresponding to undocumented cases); (4) significant contagion: higher transmission (infection) rate, e.g., than SARS [wang2020unique], transferable through various channels and media including airborne, and high survival rate; and (5) unquantified effects of various mitigation policies and actions on containing the virus and disease. These complexities become even more apparent in the current overwhelming COVID-19 resurgences dominated by the virus mutations such as delta and lambda and in the partially vaccinated population, where younger people with stronger immunity and those vaccinated with partial immunity are still susceptible to or be broken through by the more infectious mutants but likely demonstrate mild to asymptomatic symptoms.
These COVID-19 complexities also bring forward significant modeling opportunities and challenges [CaoLc21]. First, the unknown transmission processes of COVID-19 incur much uncertainty (e.g., the randomness of infection and contagion particularly during the incubation period and for asymptomatic infectious cases), which is hard to be modeled properly. Second, many observable and hidden factors (e.g., related to asymptomatic contagion and habitual behaviors) and mitigation-related factors (e.g., lockdown and social distancing and people’s cooperation) interact with each other and collaboratively affect the COVID-19 transmission processes and dynamics. Third, the infection and contagion processes and the transition between different states such as the susceptible, infectious and recoverable seem to be highly complex, including being random, nonlinear, time-varying, and noisy. Lastly, the documented COVID-19 data with confirmed, death and recovery case numbers (e.g., in JHU CSSE [dong2020interactive]) are macroscopic and subject to significant data uncertainty (i.e., quality issues), including acquisition inconsistencies, noise and errors, under-reporting and missing reportings, and the randomness in the case confirmation and reporting in different countries and regions. The publicly available case data does not disclose the full picture and hidden nature of COVID-19 dynamics and may not reflect their reality, e.g., inaccurate statistics and missing reportings likely exist in considerable asymptomatic infections. The actual compartments of susceptible, infectious and recovered populations may be hard to obtain, leading to highly unreliable data and poor quality of evaluation ground truth. As a result, modeling COVID-19 needs to pay a special attention to the above various uncertainties, in addition to the epidemic attributes. Modeling the reported poor quality and uncertain COVID-19 case numbers appears highly challenging, easily resulting in overfit, underfit or non-actionable results [CaoLc21].
Modeling gap analysis. Among the massive references reported on modeling COVID-19 [CaoLc21], we roughly categorize the COVID-19 modeling into three major directions: epidemic compartmental modeling of COVID-19 infection and transmission processes built on epidemiological compartments and models for existing epidemics; data-driven modeling of COVID-19 intrinsic characteristics and infection processes on the relevant COVID-19 data; and hybrid modeling by integrating knowledge and modeling methods for a compound or more powerful epidemic understanding and insight of the COVID-19. A typical epidemic compartmental model following the conventional epidemics is the susceptible-infected-recovered (SIR) model. SIR simplifies the transmission process and separates the population into three compartments: the susceptible, the infectious, and the removed. A large number of SIR variants are available with more specific compartments. For example, SEIR [ma2009mathematical] adds an extra exposed compartment, and TSIR [finkenstadt2000time] incorporates the time-dependent transmission into SIR to model the varying transmission and removal rates over time. These classic SIR-based compartmental models were designed for the past epidemics and their transmission process, which do not directly capture the above COVID-19 complexities.
Several very recent SIR-based extensions are available for modeling COVID-19. For example, Chen et al. [chen2020time] explore the time-dependent SIR for time-varying transmission of COVID-19. Such models simply assume the SIR variables are temporal, while the actual COVID-19 processes may evolve over multiple factors, e.g., enforced interventions and diversified cooperation levels. Further, fine-grained SIR models like SIDARTHE [giordano2020modelling] and SEI_DI_UQHRD [nabi2020forecasting] divide the infection process into more specific stages to mimic the features of COVID-19, while they overfit specific country/regional data and lack a general applicability. In addition, SIR-based probabilistic models like SIR-Poisson [hassen2020sir]
assume the infected case numbers follow specific distributions such as Poisson distributions, while the actual conditions of COVID-19 case developments may be much more complicated. A critical reason causing the above problems of COVID-19 models is that they mainly focus on fitting the COVID-19 data (e.g., by regression) or reproducing the transmission processes (e.g., with specific hypotheses) rather than directly address the aforementioned COVID-19 complexities.
SUDR modeling the COVID-19 uncertainties. In addition to the aforementioned COVID-19 characteristics including asymptomatic and undocumented infections, social reinforcement is another phenomenon embedded in a COVID-19-affected community. In social systems, a stimulus from one person may increase the frequency of the behavior that immediately precedes it. Such interpersonal stimulus is called social reinforcement, which characterizes the reinforced influence of social behaviors [centola2010spread]. The COVID-19 pandemic also demonstrates large-scale social behaviors and interactions, and surely social reinforcement is an important aspect to understand the COVID-19 transmissions. Examples of social reinforcement in COVID-19 are the infections through dense and close social contacts, household-to-household infections, household and local community infections, the phenomenon that lifting the infection awareness may slow down the spread of infectious diseases. In this work, we are motivated to directly characterize the above COVID-specific characteristics and address their modeling challenges by integrating both domain (the epidemic and social attributes of COVID-19) and data (quantifying COVID-19 attributes and factors) driven modeling, which can leverage multi-resources about COVID-19 and multi-aspect modeling power to address the aforementioned uncertainties and various COVID-19 challenges [CaoLc21].
Combining this domain- and data-driven thinking, we aim to characterize the COVID-19 epidemic processes by capturing asymptomatic and undocumented infections and social reinforcement that are essential but hidden in the COVID-19 systems and processes. This is achieved by a hybrid approach: (1) capturing and incorporating new knowledge and compartments about the COVID-19 epidemiology into enhanced epidemic SIR models; (2) incorporating data-driven probabilistic mechanisms into the epidemic SIR-based extension to model the uncertainties of COVID-19; and (3) creating factors and mechanisms to capture social characteristics of COVID-19. Accordingly, a density-dependent Bayesian probabilistic Susceptible-Undocumented infectious-Documented infectious-Recovered (SUDR) model is proposed. First, to capture the confirmed and undocumented asymptomatic infections, SUDR replaces the infection compartment in the SIR model by two compartments: undocumented infection (U) and documented infection (D) and assumes that, when infected by the virus, the susceptible first transfers into the undocumented infectious compartment and then steps into the documented infected compartment only if detected. Second, we take a density-dependent view of the COVID-19 infection development and characterize undocumented infections and social reinforcement in COVID-19 contagion. Third, we incorporate probabilistic mechanisms to model the density likelihood-based prevalence, unknown infections, and the uncertain and noisy conditions of COVID-19 data. Lastly, Bayesian inference is applied to approximate the solution of SUDR. To capture the imperfect and noisy statistics of COVID-19 data, we elaborate the model as a probabilistic extension with certain priors and solve it by sampling from the mean-field posterior distribution.
illustrates the SUDR rationale of modeling the undocumented and asymptomatic infections and the social interactions between infecteds (in red) and susceptibles (in green) in COVID-19. We assume all infections are undocumented at the beginning. Then, some of them will transit to be documented once they are confirmed by COVID-19 tests. Since the majority of infected symptomatic individuals are identified as documented infections and then quarantined, they will have a low probability of further infecting other susceptible individuals. Hence, we assume only the undocumented infectious individuals can infect the susceptibles, and there are safe interactions between uninfected susceptibles and unsafe interactions with asymptomatic infections. More interactions and denser contacts with asymptomatic infections will increase the chance of being infected. Accordingly, the central green nodes in scenarios (a) and (c) share the same probability of being infected since they have the same density of unsafe interactions and close contacts with the infected. However, more unsafe interactions as shown in (b) will increase the infection probability of the susceptible individuals, showing social reinforcement and cluster infection in the COVID-19 transmission[liu2020cluster]. As a result, the infection rate of the central green node in scenario (b) is much higher (e.g., by three times if it is linear addictive) than that of scenario (a) to be infected. Thus, SUDR models the transmission rate as the function over the undocumented infection density.
In summary, this work discloses the following insights and contributions in modeling COVID-19:
A susceptible-undocumented infectious-documented infectious-recovered model SUDR explicitly captures the undocumented infections corresponding to asymptomatic infections and unknown contagion, often missed in COVID-19 modeling.
A probabilistic density-dependent infection function models both the COVID-19 uncertainty w.r.t. the infection rate over the density of undocumented infections and the exogenous contagion reinforcement through social interactions, tackling the gaps with a constant or time-dependent assumption of infections.
Bayesian inference with a mean-field method solves the SUDR optimization, which also copes with the poor COVID-19 data conditions including uncertainty, noise and sparsity.
We empirically verify the effectiveness of our method in detecting undocumented infections with the COVID-19 data from different countries and the data with noise and sparsity. Experimental results show that our model outperforms the classic SIR model, time-dependent SIR model, and probabilistic SIR model on the COVID-19 data.
We evaluate the ability of the SUDR model in detecting undocumented infections under imperfect conditions (e.g., the reporting noise and under-reported numbers in the publicly available data) with real-world 60-day COVID-19 data from 11 typical European countries [flaxman2020estimating], which is a subset of the global COVID-19 case dataset reported by JHU CSSE [dong2020interactive], recording the worldwide daily case numbers (including confirmed case numbers, recovered case numbers, and death case numbers). Here, we only extract the initial period (the first 60 days) of the COVID-19 outbreak in these countries, since this early state is more-likely embedded with undocumented cases and it is more challenging to model and control the epidemic dynamics. In general, the first waves and the new resurgences with new variants are often more challenging to modeling and intervention, since they are usually with limited COVID-19 tests and test coverage, poor knowledge and awareness of the COVID-19 complexities including transmissions, incubation periods, mutated attributes, and difference from original strains. At this stage, many confirmed cases may only be documented after obvious symptoms appear and sufficient test toolkits are available, thus incurring a larger proportion of undocumented infections.
Inferring undocumented infections
As discussed early, there are often a large number of undocumented/unreported infected cases, including the asymptomatic or mild symptomatic infections, along with the COVID-19 transmission process. This is even prominent in the early stage of the epidemic outbreak due to the limited tests and the lack of preparedness and in the vaccinated communities due to enhanced immunity. Here we verify this observation.
With the documented infected case numbers, the undocumented infected case numbers in 11 the European countries are inferred by the SUDR model, as shown in Figure 2. We carry out the inference in the first two months from the beginning of COVID-19 epidemic outbreak in each country for case studies and evaluation. While undocumented infections may exist along with the whole process of COVID-19 transmission, under-reporting is even more prominent at the early stage of the epidemic outbreak due to limited tests and lack of preparedness. The specific time period of each country is shown in the third column in Table 1. In Figure 2, the posterior samples of the undocumented infection become converged and the posterior samples of documented infections fit well with the observations. The results show an obvious finding that the undocumented infections are much more than the documented ones in this time period (more quantitative comparison is available in the following part). Further, the prevalence of undocumented infection curves shows the similar trend of firstly increasing and then decreasing in most of the COVID-19 fast-spreading countries except Germany and the United Kingdom, as shown in Figure 2. This popular trend of undocumented infections across countries also reflects the increasing COVID-19 test capacity, government’s enforcement of testing, and people’s increased willingness to be tested, which is consistent with the real-world scenarios.
In Figure 2, the fluctuation of the two colored curves illustrates the different stages of the epidemic contagion in the two-month period. At the initial stage of the epidemic, most countries had a limited ability to test the COVID-19 virus. Also, due to the long incubation period and asymptomatic infections, most of the infected individuals may not be tested immediately after infection. Hence, at the early stage of outbreaks, there may be a large proportion of undocumented infections, resulting in the significant exceedance of the green curves over the orange ones. Then, with the increase of test availability and coverage and the enhanced public willingness of being tested, the number of undocumented infections drops gradually. When all undocumented infections would have been immediately detected, the curve of undocumented infections would be just a horizontal shift of the curve of documented infections because undocumented infections would become documented once detected. However, the overall undocumented-to-documented trend shift still holds, explaining why the peak of documented infections always lags behind that of their undocumented ones in each country in Figure 2.
Further, the results in Figure 2 also shows different COVID-19 transformation and evolving states in each country. For instance, the COVID-19 transmission was likely under better control at the end of the first 60-day period in Austria, Denmark and Switzerland since they passed the peaks of both undocumented and documented daily infections. In contrast, the United Kingdom and Germany were still at their early outbreak stages as the curves, especially the green curves, rise sharply. The rapid increase of their undocumented infections demonstrates their infection burst without effective interventions.
Both undocumented and documented infection case numbers evolve over time. Since the fluctuation of documented infection case number lags behind the undocumented infection case number, it is hard to compare them without proper time and data alignment. Hence, We only make a comparison about their peak values. We demonstrate the peak value of undocumented infections and the peak value of documented infections for each country in Table 1. In the case that the curve is still increasing and has not reached its summit, we simply replace the peak value by the maximum value. For documented infections, the observed maximum daily active case number in that period is listed in the fourth column, while for undocumented infections, we compute the mean peak value from the samples (the green curves shown in Figure 2
) inferred by the SUDR model. The 95% confidence interval is also illustrated along with each mean peak value of undocumented infections. The last column shows the ratio of, which reflects how big the quantitative gap is between the maximum numbers of undocumented infections and documented ones.
For most countries, the ratio ranges from around 2 to 6 in the 60-day time period of the first wave of COVID-19. Some existing studies have achieved the similar results [bohning2020estimating]
. For example, the number of infected in Italy was estimated at around 3.5 times higher than that reported at the end of February, 2020. However, two outliers are identified in the results: 12.86 (Germany) and 10.88 (the United Kingdom), which are much larger than the averagely-estimated ratio. That is because, in the initial stage, the increase of documented infections lags behind the evolving undocumented infections. When comparing the peak value of undocumented infections and the initial value of documented infections, the ratio becomes larger than the actual value. We notice that the active undocumented infections gradually decrease to a low level since the first wave was finally under control.
Overall, Figure 2 shows that detecting undocumented infections and inferring its relationship with documented infections can provide reliable speculation about the COVID-19 contagion in the first two months of COVID-19 outbreaks. Table 1 further shows the quantitative peak values of documented and undocumented infections. The ratio shows an intuitive evaluation of the gap between the reported and unreported infections. These results may assist in understanding the infection movement, forecasting the detected infection case increase, and initiating and adjusting the corresponding mitigation policies. In addition, since any individual indicators do not paint a whole picture of evolving documented or undocumented cases, readers should cross-refer all indicators to infer more comprehensive and trustful insights for making intervention policies and choosing the corresponding control measures.
Inferring the epidemic attributes
The main attributes describing the COVID-19 epidemic include the infection rate , the detection rate , and the removable rate . Here, we infer them by SUDR on the reported data in the 11 European countries.
First, infection rate is one of the most important epidemiological attributes to describe the transmission and reproduction features of COVID-19. In existing studies, infection rate is typically modeled as a constant or time-varying variable. However, this assumption cannot accurately reflect the characteristic and complexities (see discussion in Section ‘Introduction’ of COVID-19 transmission processes. In fact, cluster infection is a prominent characteristic of COVID-19 spreading, and the virus transmission routes and circumstances usually involve household, local community and nosocomial infections, etc. [liu2020cluster, song2020clinical]. Considering this particular epidemiological feature, we model the infection rate as a density-varying (or prevalence-varying) complex function in the SUDR model, which provide a much better capacity to capture the COVID-19 complexities. However, it is difficult to obtain an accurate closed-form solution for the complex prevalence-varying infection rate function. The reasons may include: we have no idea about the micro-level transmission mechanism and the expression form; the infection rate can only be inferred at discrete points (i.e., the observed prevalence of reported infection) which are extremely sparse. Hence, we summarize some important statistical characteristics of the sampled infection rates over inferred undocumented infection densities by our model and present them in box and whisker plot in Figure 3.
The spreading of SARS-CoV-2 virus in the initial stage shows different transmission dynamics with changing infection rates among the 11 European countries. The box plot depicts how the distribution of the infection rate may look like. As shown in Figure 3, countries like Austria, Germany, Spain and Switzerland have relatively higher average infection rates (23.2, 23.1, 21.9 and 21.0, respectively) compared with France and Sweden (11.3 and 12.9, respectively). Besides, the variation range is reflected by the minimum, the lower quartile, the upper quartile, and the maximum. Since the infection prevalence is defined on the domain [0, 1], whereas the observed densities usually are close to 0 but never reach 1 [hebert2020macroscopic], it can also be inferred that the larger the variation range, the more sensitive the complex contagion function over the infection density.
Lastly, in addition to verifying the infection rate, SUDR also infers two other epidemiological attributes: detection rate and removal rate, from the data. As shown in Table 2, the detection rate indicates the average COVID-19 test ability and test coverage in a country. The higher the detection rate, the faster the undocumented infection cases drop. For instance, as shown in Table 2, the detection rates in four countries: Austria, Denmark, Spain and Switzerland are much higher than the others. In Figure 2, the undocumented infection cases in these four countries drop quickly until approach the level of documented infection cases. We can also find that the removal rates in the four countries are also relatively higher. Considering that most of undocumented infections are on asymptomatic or mildly symptomatic patients who are easier to cure, the number of removal cases will increase in unit time when more undocumented asymptomatic or mild infections are detected.
|Country||Detection rate||Removal rate|
|(mean with 95% CI)||(mean with 95% CI)|
|Austria||0.97 [0.90, 1.00]||3.38 [3.14, 3.60]|
|Belgium||0.68 [0.60, 0.80]||0.35 [0.25, 0.47]|
|Denmark||0.91 [0.69, 1.00]||3.67 [3.17, 4.00]|
|France||0.74 [0.41, 0.98]||0.77 [0.16, 1.32]|
|Germany||0.56 [0.34, 0.86]||2.33 [0.72, 4.08]|
|Italy||0.83 [0.75, 0.95]||1.15 [1.07, 1.23]|
|Norway||0.68 [0.53, 0.92]||1.18 [1.06, 1.31]|
|Spain||0.98 [0.93, 1.00]||1.21 [1.12, 1.27]|
|Sweden||0.73 [0.54, 0.93]||0.15 [0.003, 0.55]|
|Switzerland||0.97 [0.89, 1.00]||2.07 [1.88, 2.23]|
|United Kingdom||0.70 [0.47, 0.95]||1.06 [0.14, 2.23]|
As aforementioned, the reported COVID-19 case data contains various uncertainties and quality issues, including the randomness of case reporting, statistical errors, missing undocumented infection cases, missing reportings, the inconsistencies of reporting standards (e.g., different confirmation criteria, and the inclusion of suspected cases with clinical diagnosis into confirmed cases in Hubei, China on Feb 12th, 2020), etc. With such prominent uncertainties in the COVID-19 data, as a probabilistic compartmental model, SUDR is more robust and applicable than existing SIR and its variants, since SUDR assumes the parameters follow a certain distribution instead of a fixed constant or function.
Here, we evaluate the SUDR robustness through the backtesting validation on the COVID-19 case numbers of Hubei province, China from Jan 12, 2020 to Mar 23, 2020, collected by JHU CSSE [dong2020interactive]. We choose this data to validate the SUDR robustness due to its even tougher challenges. Hubei is the first place of the large-scale outbreak of COVID-19. When the epidemic started to spread, limited knowledge was available about the virus and its epidemic containment. The data also shows obvious uncertainties, e.g., a sudden jump in the reported cases due to the inclusion of suspected cases with clinical diagnosis into confirmed cases on Feb. 12, 2020. In comparison with other late reported data, this data is more complex in its case reporting uncertainty, noises and statistics. Comparatively, the above European data may be less uncertain and noisy since some reporting mistakes have already been corrected [dong2020interactive]. As the Hubei case numbers already contain noise types like statistical errors, missing values, and so on, here we incorporate various degrees of sparsity into the data by randomly masking some of its values, resulting in four sets: the complete data, 5% sparsity, 10% sparsity, and 20% sparsity. In this experiment, the degrees of Bernstein polynomials of the function, the deviation hyper-parameters, and the HMC parameters of SUDR are the same as that in the above experiment.
Three baselines are chosen for the robustness comparison. First, SIR is a classic compartmental model with fundamental biological insight. Second, time-dependent SIR [chen2020time]
is an SIR with time-dependent functions to model the transmission rate and removal rate and applies the ridge regression for the model solution. Lastly,complex SIR [hebert2020macroscopic] is a probabilistic extension of SIR by replacing the constant transmission rate with a density-dependent function that relies on the infection case numbers. These baselines only model the explicitly-documented infections but cannot detect the undocumented infections. For the sake of fairness, the comparison experiments only test how well these models fit the reported cases under complex data conditions. The settings of the time-dependent SIR and complex SIR models are the same as in their original designs for optimal performance.
In the backtesting, according to the known case numbers (including the population, the documented infection numbers, and the recovered and death case numbers), we infer the infection rate and the removal rate by these models. Then, with the initial values, the case number series can be obtained step by step by the OED functions of the models. The robustness and effectiveness of the models can be estimated by how well the computed case number series fit the observed daily cases in the data under different noise conditions.
As shown in Figure 4, SUDR and complex SIR show the similar performance. SUDR performs better in the first half stage (before day 30), while the complex SIR performs better in the second half stage (after day 50). This suggests that SUDR pays more attention to the data before day 30 in the epidemiological parameter inference, while the complex SIR does just the opposite. However, both models perform better than the time-dependent SIR and classic SIR at different levels of sparsity. With the increase of sparsity, the performance of SUDR and complex SIR drops gradually but still outperforms the others. The classic SIR model (the blue curve) shows a quite different trend from the real observation data, indicating the significant inaccuracy of the inferred transmission rate and removal rate. Obviously, it is not reliable to infer the trend of COVID-19 merely from the constant mean values of transmission rate and removal rate. The time-dependent SIR model performs better than the classic SIR model as it captures some changes in the observations and is trivially affected by the sparsity level. In contrast, the time-dependent SIR is fragile to noise. It is noteworthy that the Hubei data involves more confirmed cases due to the relaxed case confirmation since Feb 12, 2020 [chen2020time]. This specification adjustment leads to the lift of the infectious cases on around the day as shown in Figure 4. After this adjustment cutoff point, the time-dependent SIR does not fit the actual infectious case numbers, especially at the second half stage. In summary, the probabilistic compartmental models, namely SUDR and complex SIR, are robust enough to combat the noise and sparsity in the data reporting.
The comparison results in Figure 4 show some general insights. On one hand, the compared models represent three typical directions of epidemic modeling: the epidemiological compartments, the time dependency of case numbers, and the uncertainty of case reporting. These are important concerns in understanding the COVID-19 complexities by epidemic modeling: the classic compartmental model (e.g., SIR), time-dependent compartmental model (e.g., time-dependent SIR), and probabilistic compartmental model (e.g., complex SIR and SUDR). On the other hand, the complex conditions of COVID-19 data must be captured in COVID-19 modeling, including missing values, statistical errors, rectification, and sparsity. In addition, it is observable that probabilistic compartmental models like SUDR outperform the classic compartmental models and time-dependent compartmental models, as shown by the results.
Accurately inferring the undocumented infection case numbers of COVID-19 is one of the most challenging tasks in modeling COVID-19. The challenge comes from various uncertainties related to not only the COVID-19 epidemics represented by sophisticated epidemiological attributes of the coronavirus, in particularly, a high proportion of asymptomatic and mildly-symptomatic infections with high contagion to the susceptible, but also diversified data uncertainties. These issues are still apparent in the current COVID-19 resurgences mainly caused by coronavirus mutations (such as delta and lambda variants) and in the vaccine breakthrough infections. This study proposes an inference approach from the macro-level perspective for this complex social-tech problem. Since there is no true knowledge about the actual underlying interactions between entities and in the process of COVID-19 transmissions, the density-dependent infection function better captures complex contagion dynamics, including social reinforcement and non-monotonous relations between the expected epidemic size and their average transmission rate, than other typical methods of modeling constant and time-dependent infection rate. Contrary to complex contagion functions, we adopt a concise and plain four-compartment SIR-like model to characterize the COVID-19 transmission processes. The proposed SUDR shows a stronger generalization ability than the elaborative compartmental models which may include seven or more states. Due to lacking knowledge about the underlying contagion interactions and spread patterns, it is thus appropriate to design a generalized model that can avoid vital deviations and mismodelling errors in characterizing the actual contagion mechanisms.
The second observation from this work is that probabilistic compartmental models is a good choice to characterize complex data conditions in COVID-19 reporting. With Bayesian frameworks, probabilistic compartmental models outperform other mathematical epidemic models by assuming the central epidemiological parameters to follow certain distributions. This way naturally captures the uncertainty in both the COVID-19 processes and case data, unbeatable by typical constant models (as in the classic compartmental models SIR and SEIR) and time-varying function models (e.g., time-dependent compartmental models). In addition, probabilistic compartmental models also offer better robustness and interpretation than classic compartmental models and time-dependent compartmental models.
However, our work and similar probabilistic compartmental modeling have several opportunities to be further enhanced in our future work. First, we can hardly obtain the accurate infection function due to the extreme sparsity of the prevalence and the sampling method. The relationship about how the infection rate varies with the undocumented infection density is still unknown by the current model. Second, SUDR assumes the clusters are isomorphism and homogeneity, whereas in fact, the population stratification and the interaction structure within a cluster may influence the COVID-19 contagion, requiring a further study. Lastly, probabilistic compartmental models highly depend on the prior knowledge on distributions and hyperparameters, which however, are difficult to obtain.
SUDR is a new compartmental epidemic model embedded with Bayesian statistical methods. It jointly models the COVID-19 epidemic processes, unknown infections, social reinforcement of contagion, and imperfect data conditions.
Modeling COVID-19 transmission mechanisms
Figure 5 illustrates the SUDR model for the epidemiological compartmental characterization of COVID-19. SUDR comprises four compartments to simulate the entire transmissions with asymptomatic infections and the transfer from undocumented to documented state. Accordingly, the COVID-19 transmission and dynamics are formulated per Eqs. (1) - (4) over time steps (corresponding to each day in daily case reporting).
refers to the number of susceptible individuals who are not epidemically contained and thus may be exposed to the virus at the infection rate (function ). When infected, a susceptible transits to the undocumented infectious compartment (Eq. (1)). refers to the subpopulation involved in the epidemic, which is assumed to be a part of the entire population (this is particularly applicable to the first COVID-19 waves and new resurgences after full zero-infection containment). As superspreading events (SSEs) and cluster infection are common in the COVID-19 pandemic [xu2020reconstruction, ryu2020effect], not all people in are susceptible, particularly when they geographically stay far away from the epicenter or adopt effective self-protection measures (e.g., wearing face masks or staying at home). In other words, SUDR does not involve such individuals in the epidemic transmission processes to be modeled. Accordingly, we assume only of the entire population is involved in the active epidemic shown in Figure 5, i.e., .
is the number of undocumented individuals contracting the virus, who can thus infect those susceptible individuals such as by close contacts or household infections. They are undocumented as they may be either in an incubation period or asymptomatic. This undocumented group forms an important determinant of the pathogen’s pandemic potential, as these infections are likely undiagnosed but highly contagious [li2020substantial]. Those undocumented infectious individuals, once confirmed with the virus infection (e.g., by diagnosis test) at the detection rate , transit to the documented infectious compartment (Eq. (2)), who are then quarantined and will hardly further infect other susceptible individuals. We assume those observed cases fall in this group. People in will then either be cured or die unfortunately, and then directly transit to the removed compartment at the removed rate (see Eq. (3)). Both and are time-dependent over time . combines both recovered and deceased individuals who are converted from the undocumented and documented infectious compartments (see Eq. (4)). We further assume the recovered and dead individuals are immune against the virus, i.e., who would not further infect other people.
Modeling the unknown infections
As illustrated in Figure 1, COVID-19 infectious individuals may infect the susceptible during their incubation periods or when they are asymptomatic, while both scenarios are unknown. In addition, it is shown that a large proportion of asymptomatic infections cannot be detected immediately. These unknown infections cause great challenge to sourcing and containing the infections before their onset of symptoms and infecting other people, leading to significant time delay in treating the infected and mitigating their contagion spread. To address the incubative and asymptomatic infections, we partition the infectious population into undocumented and documented infectious individuals. Those undocumented could be in incubation or asymptomatic, and we assume all COVID-19 infections are likely initially undocumented. However, those with symptoms onset and diagnosed will be detected, transferring to the documented compartment at a detection rate .
We further assume that only undocumented infectious individuals are infectious to the susceptible since those detected are likely quarantined and then hardly further infect the susceptible without close contacts. The undocumented infections may have a much higher probability than the documented to interact with the susceptible when they have minimal symptoms or are unaware of infection. This assumption is consistent with the reality especially at the early stage of COVID-19 outbreak, when both viral test and effective protection are limited.
Modeling the contagion reinforcement
The contagion of COVID-19 may be reinforced during unsafe social interactions and reinforcement, as COVID-19 can be regarded as a complex social reinforced contagion network. When a susceptible individual is infected, their close contacts may have a higher probability of being infected. The infections of close contacts will further be passed to their contacts. Consequently, the population infection probability increases nonlinearly at the density of infected neighbours in a chained way. This explains the commonly seen cluster infections, such as through local communities like households, parties and hospitals, which dominate the COVID-19 spreading.
SUDR thus models this COVID-19 contagion reinforcement, which may be caused by various contagious factors. We model the transmission rate as the function of the density of infected population, inspired by [hebert2020macroscopic]. Compared with assuming a time-dependent transmission rate in the epidemic modeling, a density-dependent transmission rate function can more reasonably characterize the social reinforcement of COVID-19 contagion and provide a better interpretability of dominating cluster infections.
Modeling data uncertainty, sparsity and noise
To model the aforementioned COVID-19 data quality issues including noise, sparsity and randomness, we incorporate Bayesian inference into SUDR, making it capable of modeling these data conditions. For this, we refer the density of documented infections at time as the COVID-19 prevalence for the measurement. , which is much closer to 0 due to the large population size. By assuming that the population is well mixed, the likelihood of the prevalence can be obtained as:
corresponds to the state set of susceptible, undocumented infectious, documented infectious, and removed people at time . corresponds to the initial state. The noise component is shown in Eq. (6
), which is a normal distribution with mean(referring to the density of the infectious individuals in the state set at time
) and standard deviation.
Since there is not a closed-form solution for Eq. (5), we take a mean-field approximation method for the inference. Similar with the inference in [hebert2020macroscopic], we only consider the largest contribution in Eq. (5), leading to
With the prevalence likelihood, we further obtain the posterior distribution of the prevalence data :
Before sampling, we assume the priors for , , and in the likelihood. We first parameterize the infection rate function since we cannot directly place priors for functions. Bernstein polynomials are adopted for the parameterization as shown in Eq. (9), where is the degree of Bernstein polynomial for with coefficients .
The SUDR model summary
In summary, we have the SUDR model for inferring the COVID-19 prevalence at time and as follows:
where returns a mean-field time series of prevalence with a contagion function parametrized by the degree Bernstein polynomials of coefficients , detection rate , removal rate , and initial conditions . Since there is no information about the initial cases, here, we assume the initial conditions follow distributions:
Figure 6 further shows the probabilistic graphical model of SUDR, where the grey circle refers to the observed data, namely the reported infections; the white circles stand for variables to be inferred by the model. The hyperparameter is represented by the black dot, and the capital letter in the box indicates the number of the variables contained in the box. The probabilistic graphical model clearly demonstrates the dependency relationship between variables.
SUDR is implemented in the STAN probabilistic programming language for statistical inference [gelman2015stan]. The Hamiltonian Montre-Carlo (HMC) algorithm is adopted to generate samples from the posterior distribution in Eq. (8). The observed daily infectious case numbers are divided by the corresponding population of each country to obtain the density (the prevalence). For the sake of simplicity, we set as a constant value 0.01 in our experiments, indicating that 1% of the whole population in the country is involved in the epidemic transmission process. We set for the degrees of Bernstein polynomial of the function since the low degree Bernstein polynomial performs well enough for the inference. For the deviation hyper-parameters in Eq. (10), we set , , , , , and . For the HMC algorithm, the default four chains are adopted for sampling. Other sampling parameters like the iteration number and control parameters are adjusted for each country until convergence.