Introduction
COVID19 modeling challenges. The coronavirus disease 2019 (abbreviated “COVID19”), caused by the SARSCoV2 virus, has been declared as a pandemic by the World Health Organization (WHO) on March 11, 2020. COVID19 fundamentally differs from other existing epidemics including SARS and Ebola and has caused unprecedented and allround challenges, devastation and crises to health, society, the economy, and many other aspects, beyond over 4M deaths and 200M confirmed cases reported all over the world. The existing medical, epidemic and modeling research shows some significant characteristics and challenges of COVID19 [guan2020clinical, li2020substantial, weitz2020modeling, CaoLc21]. They include: (1) unknown transmission processes: it is still unclear how the coronavirus infects people, how the virus transfers from susceptible people to others, whether the recovered patients still hold contagion, and how COVID19 spreads during the socalled 214day incubation period [han2020covid]; (2) easy infection and fast transmission and mutation: COVID19 seems to spread and mutate much more easily and quickly than other known epidemics; (3) asymptomatic infection and contagion: high proportional infections spread between susceptible and infectious people, strangertostranger, and household to household in incubation and asymptomatically (corresponding to undocumented cases); (4) significant contagion: higher transmission (infection) rate, e.g., than SARS [wang2020unique], transferable through various channels and media including airborne, and high survival rate; and (5) unquantified effects of various mitigation policies and actions on containing the virus and disease. These complexities become even more apparent in the current overwhelming COVID19 resurgences dominated by the virus mutations such as delta and lambda and in the partially vaccinated population, where younger people with stronger immunity and those vaccinated with partial immunity are still susceptible to or be broken through by the more infectious mutants but likely demonstrate mild to asymptomatic symptoms.
These COVID19 complexities also bring forward significant modeling opportunities and challenges [CaoLc21]. First, the unknown transmission processes of COVID19 incur much uncertainty (e.g., the randomness of infection and contagion particularly during the incubation period and for asymptomatic infectious cases), which is hard to be modeled properly. Second, many observable and hidden factors (e.g., related to asymptomatic contagion and habitual behaviors) and mitigationrelated factors (e.g., lockdown and social distancing and people’s cooperation) interact with each other and collaboratively affect the COVID19 transmission processes and dynamics. Third, the infection and contagion processes and the transition between different states such as the susceptible, infectious and recoverable seem to be highly complex, including being random, nonlinear, timevarying, and noisy. Lastly, the documented COVID19 data with confirmed, death and recovery case numbers (e.g., in JHU CSSE [dong2020interactive]) are macroscopic and subject to significant data uncertainty (i.e., quality issues), including acquisition inconsistencies, noise and errors, underreporting and missing reportings, and the randomness in the case confirmation and reporting in different countries and regions. The publicly available case data does not disclose the full picture and hidden nature of COVID19 dynamics and may not reflect their reality, e.g., inaccurate statistics and missing reportings likely exist in considerable asymptomatic infections. The actual compartments of susceptible, infectious and recovered populations may be hard to obtain, leading to highly unreliable data and poor quality of evaluation ground truth. As a result, modeling COVID19 needs to pay a special attention to the above various uncertainties, in addition to the epidemic attributes. Modeling the reported poor quality and uncertain COVID19 case numbers appears highly challenging, easily resulting in overfit, underfit or nonactionable results [CaoLc21].
Modeling gap analysis. Among the massive references reported on modeling COVID19 [CaoLc21], we roughly categorize the COVID19 modeling into three major directions: epidemic compartmental modeling of COVID19 infection and transmission processes built on epidemiological compartments and models for existing epidemics; datadriven modeling of COVID19 intrinsic characteristics and infection processes on the relevant COVID19 data; and hybrid modeling by integrating knowledge and modeling methods for a compound or more powerful epidemic understanding and insight of the COVID19. A typical epidemic compartmental model following the conventional epidemics is the susceptibleinfectedrecovered (SIR) model. SIR simplifies the transmission process and separates the population into three compartments: the susceptible, the infectious, and the removed. A large number of SIR variants are available with more specific compartments. For example, SEIR [ma2009mathematical] adds an extra exposed compartment, and TSIR [finkenstadt2000time] incorporates the timedependent transmission into SIR to model the varying transmission and removal rates over time. These classic SIRbased compartmental models were designed for the past epidemics and their transmission process, which do not directly capture the above COVID19 complexities.
Several very recent SIRbased extensions are available for modeling COVID19. For example, Chen et al. [chen2020time] explore the timedependent SIR for timevarying transmission of COVID19. Such models simply assume the SIR variables are temporal, while the actual COVID19 processes may evolve over multiple factors, e.g., enforced interventions and diversified cooperation levels. Further, finegrained SIR models like SIDARTHE [giordano2020modelling] and SEI_DI_UQHRD [nabi2020forecasting] divide the infection process into more specific stages to mimic the features of COVID19, while they overfit specific country/regional data and lack a general applicability. In addition, SIRbased probabilistic models like SIRPoisson [hassen2020sir]
assume the infected case numbers follow specific distributions such as Poisson distributions, while the actual conditions of COVID19 case developments may be much more complicated. A critical reason causing the above problems of COVID19 models is that they mainly focus on fitting the COVID19 data (e.g., by regression) or reproducing the transmission processes (e.g., with specific hypotheses) rather than directly address the aforementioned COVID19 complexities.
SUDR modeling the COVID19 uncertainties. In addition to the aforementioned COVID19 characteristics including asymptomatic and undocumented infections, social reinforcement is another phenomenon embedded in a COVID19affected community. In social systems, a stimulus from one person may increase the frequency of the behavior that immediately precedes it. Such interpersonal stimulus is called social reinforcement, which characterizes the reinforced influence of social behaviors [centola2010spread]. The COVID19 pandemic also demonstrates largescale social behaviors and interactions, and surely social reinforcement is an important aspect to understand the COVID19 transmissions. Examples of social reinforcement in COVID19 are the infections through dense and close social contacts, householdtohousehold infections, household and local community infections, the phenomenon that lifting the infection awareness may slow down the spread of infectious diseases. In this work, we are motivated to directly characterize the above COVIDspecific characteristics and address their modeling challenges by integrating both domain (the epidemic and social attributes of COVID19) and data (quantifying COVID19 attributes and factors) driven modeling, which can leverage multiresources about COVID19 and multiaspect modeling power to address the aforementioned uncertainties and various COVID19 challenges [CaoLc21].
Combining this domain and datadriven thinking, we aim to characterize the COVID19 epidemic processes by capturing asymptomatic and undocumented infections and social reinforcement that are essential but hidden in the COVID19 systems and processes. This is achieved by a hybrid approach: (1) capturing and incorporating new knowledge and compartments about the COVID19 epidemiology into enhanced epidemic SIR models; (2) incorporating datadriven probabilistic mechanisms into the epidemic SIRbased extension to model the uncertainties of COVID19; and (3) creating factors and mechanisms to capture social characteristics of COVID19. Accordingly, a densitydependent Bayesian probabilistic SusceptibleUndocumented infectiousDocumented infectiousRecovered (SUDR) model is proposed. First, to capture the confirmed and undocumented asymptomatic infections, SUDR replaces the infection compartment in the SIR model by two compartments: undocumented infection (U) and documented infection (D) and assumes that, when infected by the virus, the susceptible first transfers into the undocumented infectious compartment and then steps into the documented infected compartment only if detected. Second, we take a densitydependent view of the COVID19 infection development and characterize undocumented infections and social reinforcement in COVID19 contagion. Third, we incorporate probabilistic mechanisms to model the density likelihoodbased prevalence, unknown infections, and the uncertain and noisy conditions of COVID19 data. Lastly, Bayesian inference is applied to approximate the solution of SUDR. To capture the imperfect and noisy statistics of COVID19 data, we elaborate the model as a probabilistic extension with certain priors and solve it by sampling from the meanfield posterior distribution.
Figure 1
illustrates the SUDR rationale of modeling the undocumented and asymptomatic infections and the social interactions between infecteds (in red) and susceptibles (in green) in COVID19. We assume all infections are undocumented at the beginning. Then, some of them will transit to be documented once they are confirmed by COVID19 tests. Since the majority of infected symptomatic individuals are identified as documented infections and then quarantined, they will have a low probability of further infecting other susceptible individuals. Hence, we assume only the undocumented infectious individuals can infect the susceptibles, and there are safe interactions between uninfected susceptibles and unsafe interactions with asymptomatic infections. More interactions and denser contacts with asymptomatic infections will increase the chance of being infected. Accordingly, the central green nodes in scenarios (a) and (c) share the same probability of being infected since they have the same density of unsafe interactions and close contacts with the infected. However, more unsafe interactions as shown in (b) will increase the infection probability of the susceptible individuals, showing social reinforcement and cluster infection in the COVID19 transmission
[liu2020cluster]. As a result, the infection rate of the central green node in scenario (b) is much higher (e.g., by three times if it is linear addictive) than that of scenario (a) to be infected. Thus, SUDR models the transmission rate as the function over the undocumented infection density.In summary, this work discloses the following insights and contributions in modeling COVID19:

A susceptibleundocumented infectiousdocumented infectiousrecovered model SUDR explicitly captures the undocumented infections corresponding to asymptomatic infections and unknown contagion, often missed in COVID19 modeling.

A probabilistic densitydependent infection function models both the COVID19 uncertainty w.r.t. the infection rate over the density of undocumented infections and the exogenous contagion reinforcement through social interactions, tackling the gaps with a constant or timedependent assumption of infections.

Bayesian inference with a meanfield method solves the SUDR optimization, which also copes with the poor COVID19 data conditions including uncertainty, noise and sparsity.
We empirically verify the effectiveness of our method in detecting undocumented infections with the COVID19 data from different countries and the data with noise and sparsity. Experimental results show that our model outperforms the classic SIR model, timedependent SIR model, and probabilistic SIR model on the COVID19 data.
Results
We evaluate the ability of the SUDR model in detecting undocumented infections under imperfect conditions (e.g., the reporting noise and underreported numbers in the publicly available data) with realworld 60day COVID19 data from 11 typical European countries [flaxman2020estimating], which is a subset of the global COVID19 case dataset reported by JHU CSSE [dong2020interactive], recording the worldwide daily case numbers (including confirmed case numbers, recovered case numbers, and death case numbers). Here, we only extract the initial period (the first 60 days) of the COVID19 outbreak in these countries, since this early state is morelikely embedded with undocumented cases and it is more challenging to model and control the epidemic dynamics. In general, the first waves and the new resurgences with new variants are often more challenging to modeling and intervention, since they are usually with limited COVID19 tests and test coverage, poor knowledge and awareness of the COVID19 complexities including transmissions, incubation periods, mutated attributes, and difference from original strains. At this stage, many confirmed cases may only be documented after obvious symptoms appear and sufficient test toolkits are available, thus incurring a larger proportion of undocumented infections.
Inferring undocumented infections
As discussed early, there are often a large number of undocumented/unreported infected cases, including the asymptomatic or mild symptomatic infections, along with the COVID19 transmission process. This is even prominent in the early stage of the epidemic outbreak due to the limited tests and the lack of preparedness and in the vaccinated communities due to enhanced immunity. Here we verify this observation.
With the documented infected case numbers, the undocumented infected case numbers in 11 the European countries are inferred by the SUDR model, as shown in Figure 2. We carry out the inference in the first two months from the beginning of COVID19 epidemic outbreak in each country for case studies and evaluation. While undocumented infections may exist along with the whole process of COVID19 transmission, underreporting is even more prominent at the early stage of the epidemic outbreak due to limited tests and lack of preparedness. The specific time period of each country is shown in the third column in Table 1. In Figure 2, the posterior samples of the undocumented infection become converged and the posterior samples of documented infections fit well with the observations. The results show an obvious finding that the undocumented infections are much more than the documented ones in this time period (more quantitative comparison is available in the following part). Further, the prevalence of undocumented infection curves shows the similar trend of firstly increasing and then decreasing in most of the COVID19 fastspreading countries except Germany and the United Kingdom, as shown in Figure 2. This popular trend of undocumented infections across countries also reflects the increasing COVID19 test capacity, government’s enforcement of testing, and people’s increased willingness to be tested, which is consistent with the realworld scenarios.
In Figure 2, the fluctuation of the two colored curves illustrates the different stages of the epidemic contagion in the twomonth period. At the initial stage of the epidemic, most countries had a limited ability to test the COVID19 virus. Also, due to the long incubation period and asymptomatic infections, most of the infected individuals may not be tested immediately after infection. Hence, at the early stage of outbreaks, there may be a large proportion of undocumented infections, resulting in the significant exceedance of the green curves over the orange ones. Then, with the increase of test availability and coverage and the enhanced public willingness of being tested, the number of undocumented infections drops gradually. When all undocumented infections would have been immediately detected, the curve of undocumented infections would be just a horizontal shift of the curve of documented infections because undocumented infections would become documented once detected. However, the overall undocumentedtodocumented trend shift still holds, explaining why the peak of documented infections always lags behind that of their undocumented ones in each country in Figure 2.
Further, the results in Figure 2 also shows different COVID19 transformation and evolving states in each country. For instance, the COVID19 transmission was likely under better control at the end of the first 60day period in Austria, Denmark and Switzerland since they passed the peaks of both undocumented and documented daily infections. In contrast, the United Kingdom and Germany were still at their early outbreak stages as the curves, especially the green curves, rise sharply. The rapid increase of their undocumented infections demonstrates their infection burst without effective interventions.
Both undocumented and documented infection case numbers evolve over time. Since the fluctuation of documented infection case number lags behind the undocumented infection case number, it is hard to compare them without proper time and data alignment. Hence, We only make a comparison about their peak values. We demonstrate the peak value of undocumented infections and the peak value of documented infections for each country in Table 1. In the case that the curve is still increasing and has not reached its summit, we simply replace the peak value by the maximum value. For documented infections, the observed maximum daily active case number in that period is listed in the fourth column, while for undocumented infections, we compute the mean peak value from the samples (the green curves shown in Figure 2
) inferred by the SUDR model. The 95% confidence interval is also illustrated along with each mean peak value of undocumented infections. The last column shows the ratio of
, which reflects how big the quantitative gap is between the maximum numbers of undocumented infections and documented ones.For most countries, the ratio ranges from around 2 to 6 in the 60day time period of the first wave of COVID19. Some existing studies have achieved the similar results [bohning2020estimating]
. For example, the number of infected in Italy was estimated at around 3.5 times higher than that reported at the end of February, 2020. However, two outliers are identified in the results: 12.86 (Germany) and 10.88 (the United Kingdom), which are much larger than the averagelyestimated ratio. That is because, in the initial stage, the increase of documented infections lags behind the evolving undocumented infections. When comparing the peak value of undocumented infections and the initial value of documented infections, the ratio becomes larger than the actual value. We notice that the active undocumented infections gradually decrease to a low level since the first wave was finally under control.
Overall, Figure 2 shows that detecting undocumented infections and inferring its relationship with documented infections can provide reliable speculation about the COVID19 contagion in the first two months of COVID19 outbreaks. Table 1 further shows the quantitative peak values of documented and undocumented infections. The ratio shows an intuitive evaluation of the gap between the reported and unreported infections. These results may assist in understanding the infection movement, forecasting the detected infection case increase, and initiating and adjusting the corresponding mitigation policies. In addition, since any individual indicators do not paint a whole picture of evolving documented or undocumented cases, readers should crossrefer all indicators to infer more comprehensive and trustful insights for making intervention policies and choosing the corresponding control measures.
Inferring the epidemic attributes
The main attributes describing the COVID19 epidemic include the infection rate , the detection rate , and the removable rate . Here, we infer them by SUDR on the reported data in the 11 European countries.
First, infection rate is one of the most important epidemiological attributes to describe the transmission and reproduction features of COVID19. In existing studies, infection rate is typically modeled as a constant or timevarying variable. However, this assumption cannot accurately reflect the characteristic and complexities (see discussion in Section ‘Introduction’ of COVID19 transmission processes. In fact, cluster infection is a prominent characteristic of COVID19 spreading, and the virus transmission routes and circumstances usually involve household, local community and nosocomial infections, etc. [liu2020cluster, song2020clinical]. Considering this particular epidemiological feature, we model the infection rate as a densityvarying (or prevalencevarying) complex function in the SUDR model, which provide a much better capacity to capture the COVID19 complexities. However, it is difficult to obtain an accurate closedform solution for the complex prevalencevarying infection rate function. The reasons may include: we have no idea about the microlevel transmission mechanism and the expression form; the infection rate can only be inferred at discrete points (i.e., the observed prevalence of reported infection) which are extremely sparse. Hence, we summarize some important statistical characteristics of the sampled infection rates over inferred undocumented infection densities by our model and present them in box and whisker plot in Figure 3.
The spreading of SARSCoV2 virus in the initial stage shows different transmission dynamics with changing infection rates among the 11 European countries. The box plot depicts how the distribution of the infection rate may look like. As shown in Figure 3, countries like Austria, Germany, Spain and Switzerland have relatively higher average infection rates (23.2, 23.1, 21.9 and 21.0, respectively) compared with France and Sweden (11.3 and 12.9, respectively). Besides, the variation range is reflected by the minimum, the lower quartile, the upper quartile, and the maximum. Since the infection prevalence is defined on the domain [0, 1], whereas the observed densities usually are close to 0 but never reach 1 [hebert2020macroscopic], it can also be inferred that the larger the variation range, the more sensitive the complex contagion function over the infection density.
Lastly, in addition to verifying the infection rate, SUDR also infers two other epidemiological attributes: detection rate and removal rate, from the data. As shown in Table 2, the detection rate indicates the average COVID19 test ability and test coverage in a country. The higher the detection rate, the faster the undocumented infection cases drop. For instance, as shown in Table 2, the detection rates in four countries: Austria, Denmark, Spain and Switzerland are much higher than the others. In Figure 2, the undocumented infection cases in these four countries drop quickly until approach the level of documented infection cases. We can also find that the removal rates in the four countries are also relatively higher. Considering that most of undocumented infections are on asymptomatic or mildly symptomatic patients who are easier to cure, the number of removal cases will increase in unit time when more undocumented asymptomatic or mild infections are detected.
Country  Detection rate  Removal rate 
(mean with 95% CI)  (mean with 95% CI)  
Austria  0.97 [0.90, 1.00]  3.38 [3.14, 3.60] 
Belgium  0.68 [0.60, 0.80]  0.35 [0.25, 0.47] 
Denmark  0.91 [0.69, 1.00]  3.67 [3.17, 4.00] 
France  0.74 [0.41, 0.98]  0.77 [0.16, 1.32] 
Germany  0.56 [0.34, 0.86]  2.33 [0.72, 4.08] 
Italy  0.83 [0.75, 0.95]  1.15 [1.07, 1.23] 
Norway  0.68 [0.53, 0.92]  1.18 [1.06, 1.31] 
Spain  0.98 [0.93, 1.00]  1.21 [1.12, 1.27] 
Sweden  0.73 [0.54, 0.93]  0.15 [0.003, 0.55] 
Switzerland  0.97 [0.89, 1.00]  2.07 [1.88, 2.23] 
United Kingdom  0.70 [0.47, 0.95]  1.06 [0.14, 2.23] 
Robustness analysis
As aforementioned, the reported COVID19 case data contains various uncertainties and quality issues, including the randomness of case reporting, statistical errors, missing undocumented infection cases, missing reportings, the inconsistencies of reporting standards (e.g., different confirmation criteria, and the inclusion of suspected cases with clinical diagnosis into confirmed cases in Hubei, China on Feb 12th, 2020), etc. With such prominent uncertainties in the COVID19 data, as a probabilistic compartmental model, SUDR is more robust and applicable than existing SIR and its variants, since SUDR assumes the parameters follow a certain distribution instead of a fixed constant or function.
Here, we evaluate the SUDR robustness through the backtesting validation on the COVID19 case numbers of Hubei province, China from Jan 12, 2020 to Mar 23, 2020, collected by JHU CSSE [dong2020interactive]. We choose this data to validate the SUDR robustness due to its even tougher challenges. Hubei is the first place of the largescale outbreak of COVID19. When the epidemic started to spread, limited knowledge was available about the virus and its epidemic containment. The data also shows obvious uncertainties, e.g., a sudden jump in the reported cases due to the inclusion of suspected cases with clinical diagnosis into confirmed cases on Feb. 12, 2020. In comparison with other late reported data, this data is more complex in its case reporting uncertainty, noises and statistics. Comparatively, the above European data may be less uncertain and noisy since some reporting mistakes have already been corrected [dong2020interactive]. As the Hubei case numbers already contain noise types like statistical errors, missing values, and so on, here we incorporate various degrees of sparsity into the data by randomly masking some of its values, resulting in four sets: the complete data, 5% sparsity, 10% sparsity, and 20% sparsity. In this experiment, the degrees of Bernstein polynomials of the function, the deviation hyperparameters, and the HMC parameters of SUDR are the same as that in the above experiment.
Three baselines are chosen for the robustness comparison. First, SIR is a classic compartmental model with fundamental biological insight. Second, timedependent SIR [chen2020time]
is an SIR with timedependent functions to model the transmission rate and removal rate and applies the ridge regression for the model solution. Lastly,
complex SIR [hebert2020macroscopic] is a probabilistic extension of SIR by replacing the constant transmission rate with a densitydependent function that relies on the infection case numbers. These baselines only model the explicitlydocumented infections but cannot detect the undocumented infections. For the sake of fairness, the comparison experiments only test how well these models fit the reported cases under complex data conditions. The settings of the timedependent SIR and complex SIR models are the same as in their original designs for optimal performance.In the backtesting, according to the known case numbers (including the population, the documented infection numbers, and the recovered and death case numbers), we infer the infection rate and the removal rate by these models. Then, with the initial values, the case number series can be obtained step by step by the OED functions of the models. The robustness and effectiveness of the models can be estimated by how well the computed case number series fit the observed daily cases in the data under different noise conditions.
As shown in Figure 4, SUDR and complex SIR show the similar performance. SUDR performs better in the first half stage (before day 30), while the complex SIR performs better in the second half stage (after day 50). This suggests that SUDR pays more attention to the data before day 30 in the epidemiological parameter inference, while the complex SIR does just the opposite. However, both models perform better than the timedependent SIR and classic SIR at different levels of sparsity. With the increase of sparsity, the performance of SUDR and complex SIR drops gradually but still outperforms the others. The classic SIR model (the blue curve) shows a quite different trend from the real observation data, indicating the significant inaccuracy of the inferred transmission rate and removal rate. Obviously, it is not reliable to infer the trend of COVID19 merely from the constant mean values of transmission rate and removal rate. The timedependent SIR model performs better than the classic SIR model as it captures some changes in the observations and is trivially affected by the sparsity level. In contrast, the timedependent SIR is fragile to noise. It is noteworthy that the Hubei data involves more confirmed cases due to the relaxed case confirmation since Feb 12, 2020 [chen2020time]. This specification adjustment leads to the lift of the infectious cases on around the day as shown in Figure 4. After this adjustment cutoff point, the timedependent SIR does not fit the actual infectious case numbers, especially at the second half stage. In summary, the probabilistic compartmental models, namely SUDR and complex SIR, are robust enough to combat the noise and sparsity in the data reporting.
The comparison results in Figure 4 show some general insights. On one hand, the compared models represent three typical directions of epidemic modeling: the epidemiological compartments, the time dependency of case numbers, and the uncertainty of case reporting. These are important concerns in understanding the COVID19 complexities by epidemic modeling: the classic compartmental model (e.g., SIR), timedependent compartmental model (e.g., timedependent SIR), and probabilistic compartmental model (e.g., complex SIR and SUDR). On the other hand, the complex conditions of COVID19 data must be captured in COVID19 modeling, including missing values, statistical errors, rectification, and sparsity. In addition, it is observable that probabilistic compartmental models like SUDR outperform the classic compartmental models and timedependent compartmental models, as shown by the results.
Discussion
Accurately inferring the undocumented infection case numbers of COVID19 is one of the most challenging tasks in modeling COVID19. The challenge comes from various uncertainties related to not only the COVID19 epidemics represented by sophisticated epidemiological attributes of the coronavirus, in particularly, a high proportion of asymptomatic and mildlysymptomatic infections with high contagion to the susceptible, but also diversified data uncertainties. These issues are still apparent in the current COVID19 resurgences mainly caused by coronavirus mutations (such as delta and lambda variants) and in the vaccine breakthrough infections. This study proposes an inference approach from the macrolevel perspective for this complex socialtech problem. Since there is no true knowledge about the actual underlying interactions between entities and in the process of COVID19 transmissions, the densitydependent infection function better captures complex contagion dynamics, including social reinforcement and nonmonotonous relations between the expected epidemic size and their average transmission rate, than other typical methods of modeling constant and timedependent infection rate. Contrary to complex contagion functions, we adopt a concise and plain fourcompartment SIRlike model to characterize the COVID19 transmission processes. The proposed SUDR shows a stronger generalization ability than the elaborative compartmental models which may include seven or more states. Due to lacking knowledge about the underlying contagion interactions and spread patterns, it is thus appropriate to design a generalized model that can avoid vital deviations and mismodelling errors in characterizing the actual contagion mechanisms.
The second observation from this work is that probabilistic compartmental models is a good choice to characterize complex data conditions in COVID19 reporting. With Bayesian frameworks, probabilistic compartmental models outperform other mathematical epidemic models by assuming the central epidemiological parameters to follow certain distributions. This way naturally captures the uncertainty in both the COVID19 processes and case data, unbeatable by typical constant models (as in the classic compartmental models SIR and SEIR) and timevarying function models (e.g., timedependent compartmental models). In addition, probabilistic compartmental models also offer better robustness and interpretation than classic compartmental models and timedependent compartmental models.
However, our work and similar probabilistic compartmental modeling have several opportunities to be further enhanced in our future work. First, we can hardly obtain the accurate infection function due to the extreme sparsity of the prevalence and the sampling method. The relationship about how the infection rate varies with the undocumented infection density is still unknown by the current model. Second, SUDR assumes the clusters are isomorphism and homogeneity, whereas in fact, the population stratification and the interaction structure within a cluster may influence the COVID19 contagion, requiring a further study. Lastly, probabilistic compartmental models highly depend on the prior knowledge on distributions and hyperparameters, which however, are difficult to obtain.
Methods
SUDR is a new compartmental epidemic model embedded with Bayesian statistical methods. It jointly models the COVID19 epidemic processes, unknown infections, social reinforcement of contagion, and imperfect data conditions.
Modeling COVID19 transmission mechanisms
Figure 5 illustrates the SUDR model for the epidemiological compartmental characterization of COVID19. SUDR comprises four compartments to simulate the entire transmissions with asymptomatic infections and the transfer from undocumented to documented state. Accordingly, the COVID19 transmission and dynamics are formulated per Eqs. (1)  (4) over time steps (corresponding to each day in daily case reporting).
(1)  
(2)  
(3)  
(4) 
refers to the number of susceptible individuals who are not epidemically contained and thus may be exposed to the virus at the infection rate (function ). When infected, a susceptible transits to the undocumented infectious compartment (Eq. (1)). refers to the subpopulation involved in the epidemic, which is assumed to be a part of the entire population (this is particularly applicable to the first COVID19 waves and new resurgences after full zeroinfection containment). As superspreading events (SSEs) and cluster infection are common in the COVID19 pandemic [xu2020reconstruction, ryu2020effect], not all people in are susceptible, particularly when they geographically stay far away from the epicenter or adopt effective selfprotection measures (e.g., wearing face masks or staying at home). In other words, SUDR does not involve such individuals in the epidemic transmission processes to be modeled. Accordingly, we assume only of the entire population is involved in the active epidemic shown in Figure 5, i.e., .
is the number of undocumented individuals contracting the virus, who can thus infect those susceptible individuals such as by close contacts or household infections. They are undocumented as they may be either in an incubation period or asymptomatic. This undocumented group forms an important determinant of the pathogen’s pandemic potential, as these infections are likely undiagnosed but highly contagious [li2020substantial]. Those undocumented infectious individuals, once confirmed with the virus infection (e.g., by diagnosis test) at the detection rate , transit to the documented infectious compartment (Eq. (2)), who are then quarantined and will hardly further infect other susceptible individuals. We assume those observed cases fall in this group. People in will then either be cured or die unfortunately, and then directly transit to the removed compartment at the removed rate (see Eq. (3)). Both and are timedependent over time . combines both recovered and deceased individuals who are converted from the undocumented and documented infectious compartments (see Eq. (4)). We further assume the recovered and dead individuals are immune against the virus, i.e., who would not further infect other people.
Modeling the unknown infections
As illustrated in Figure 1, COVID19 infectious individuals may infect the susceptible during their incubation periods or when they are asymptomatic, while both scenarios are unknown. In addition, it is shown that a large proportion of asymptomatic infections cannot be detected immediately. These unknown infections cause great challenge to sourcing and containing the infections before their onset of symptoms and infecting other people, leading to significant time delay in treating the infected and mitigating their contagion spread. To address the incubative and asymptomatic infections, we partition the infectious population into undocumented and documented infectious individuals. Those undocumented could be in incubation or asymptomatic, and we assume all COVID19 infections are likely initially undocumented. However, those with symptoms onset and diagnosed will be detected, transferring to the documented compartment at a detection rate .
We further assume that only undocumented infectious individuals are infectious to the susceptible since those detected are likely quarantined and then hardly further infect the susceptible without close contacts. The undocumented infections may have a much higher probability than the documented to interact with the susceptible when they have minimal symptoms or are unaware of infection. This assumption is consistent with the reality especially at the early stage of COVID19 outbreak, when both viral test and effective protection are limited.
Modeling the contagion reinforcement
The contagion of COVID19 may be reinforced during unsafe social interactions and reinforcement, as COVID19 can be regarded as a complex social reinforced contagion network. When a susceptible individual is infected, their close contacts may have a higher probability of being infected. The infections of close contacts will further be passed to their contacts. Consequently, the population infection probability increases nonlinearly at the density of infected neighbours in a chained way. This explains the commonly seen cluster infections, such as through local communities like households, parties and hospitals, which dominate the COVID19 spreading.
SUDR thus models this COVID19 contagion reinforcement, which may be caused by various contagious factors. We model the transmission rate as the function of the density of infected population, inspired by [hebert2020macroscopic]. Compared with assuming a timedependent transmission rate in the epidemic modeling, a densitydependent transmission rate function can more reasonably characterize the social reinforcement of COVID19 contagion and provide a better interpretability of dominating cluster infections.
Modeling data uncertainty, sparsity and noise
To model the aforementioned COVID19 data quality issues including noise, sparsity and randomness, we incorporate Bayesian inference into SUDR, making it capable of modeling these data conditions. For this, we refer the density of documented infections at time as the COVID19 prevalence for the measurement. , which is much closer to 0 due to the large population size. By assuming that the population is well mixed, the likelihood of the prevalence can be obtained as:
(5) 
corresponds to the state set of susceptible, undocumented infectious, documented infectious, and removed people at time . corresponds to the initial state. The noise component is shown in Eq. (6
), which is a normal distribution with mean
(referring to the density of the infectious individuals in the state set at time) and standard deviation
.(6) 
Since there is not a closedform solution for Eq. (5), we take a meanfield approximation method for the inference. Similar with the inference in [hebert2020macroscopic], we only consider the largest contribution in Eq. (5), leading to
(7) 
where, is the time series of the density of infectious individuals, computed from Eqs. (1)(4) given the initial condition .
With the prevalence likelihood, we further obtain the posterior distribution of the prevalence data :
(8) 
Before sampling, we assume the priors for , , and in the likelihood. We first parameterize the infection rate function since we cannot directly place priors for functions. Bernstein polynomials are adopted for the parameterization as shown in Eq. (9), where is the degree of Bernstein polynomial for with coefficients .
(9) 
The SUDR model summary
In summary, we have the SUDR model for inferring the COVID19 prevalence at time and as follows:
(10) 
where returns a meanfield time series of prevalence with a contagion function parametrized by the degree Bernstein polynomials of coefficients , detection rate , removal rate , and initial conditions . Since there is no information about the initial cases, here, we assume the initial conditions follow distributions:
(11) 
Figure 6 further shows the probabilistic graphical model of SUDR, where the grey circle refers to the observed data, namely the reported infections; the white circles stand for variables to be inferred by the model. The hyperparameter is represented by the black dot, and the capital letter in the box indicates the number of the variables contained in the box. The probabilistic graphical model clearly demonstrates the dependency relationship between variables.
Model implementation
SUDR is implemented in the STAN probabilistic programming language for statistical inference [gelman2015stan]. The Hamiltonian MontreCarlo (HMC) algorithm is adopted to generate samples from the posterior distribution in Eq. (8). The observed daily infectious case numbers are divided by the corresponding population of each country to obtain the density (the prevalence). For the sake of simplicity, we set as a constant value 0.01 in our experiments, indicating that 1% of the whole population in the country is involved in the epidemic transmission process. We set for the degrees of Bernstein polynomial of the function since the low degree Bernstein polynomial performs well enough for the inference. For the deviation hyperparameters in Eq. (10), we set , , , , , and . For the HMC algorithm, the default four chains are adopted for sampling. Other sampling parameters like the iteration number and control parameters are adjusted for each country until convergence.
Comments
There are no comments yet.