software
R codes for high-dimensional survival screening
view repo
The coronavirus pandemic has rapidly evolved into an unprecedented crisis. The susceptible-infectious-removed (SIR) model and its variants have been used for modeling the pandemic. However, time-independent parameters in the classical models may not capture the dynamic transmission and removal processes, governed by virus containment strategies taken at various phases of the epidemic. Moreover, very few models account for possible inaccuracies of the reported cases. We propose a Poisson model with time-dependent transmission and removal rates to account for possible random errors in reporting and estimate a time-dependent disease reproduction number, which may be used to assess the effectiveness of virus control strategies. We apply our method to study the pandemic in several severely impacted countries, and analyze and forecast the evolving spread of the coronavirus. We have developed an interactive web application to facilitate readers' use of our method.
READ FULL TEXT VIEW PDFR codes for high-dimensional survival screening
The coronavirus pandemic has rapidly evolved into an unprecedented crisis. The susceptible-infectious-removed (SIR) model and its variants have been used for modeling the pandemic. However, time-independent parameters in the classical models may not capture the dynamic transmission and removal processes, governed by virus containment strategies taken at various phases of the epidemic. Moreover, few models account for possible inaccuracies of the reported cases. We propose a Poisson model with time-dependent transmission and removal rates to account for possible random errors in reporting and estimate a time-dependent disease reproduction number, which may reflect the effectiveness of virus control strategies. We apply our method to study the pandemic in several severely impacted countries, and analyze and forecast the evolving spread of the coronavirus. We have developed an interactive web application to facilitate readers’ use of our method.
Coronaviruses are enveloped single-stranded positive-sense RNA viruses belonging to a broad family of coronaviridae and are widely harbored in animals [18, 31, 56]. Most of the coronaviruses only cause mild respiratory infections, but SARS-CoV-2, a newly identified member of the coronavirus family, initiated the contagious and lethal coronavirus disease 2019 (COVID-19) in December 2019 [53, 54]. Since the detection of the first case in Wuhan, the COVID-19 pandemic has evolved into a global crisis within only four months. As of May 26, 2020, the virus has infected more than 5.5 million individuals, caused about 348,000 deaths [25], and altered the life of billions of people.
The pandemic has been closely monitored by the international society. For example, the World Health Organization (WHO) and Johns Hopkins University’s Coronavirus Resource Center [25] have, since the outbreak, reported the daily numbers of infectious and recovered cases, and deaths for nearly every country. The governmental websites of many counties, such as Australia, the US, Singapore, also have been tracking these numbers starting from various time points. These websites have become valuable resources to help advance the understanding of spread of the virus. We have access to a time-series data repository on GitHub ( https://github.com/ulklc/covid19-timeseries), which consolidates and updates information obtained from these data sources. Our data analysis is based on the data obtained from this GitHub data repository.
Much effort has been devoted by the affected countries to battling the disease. However, the crisis has not been over, with new infections detected every day. To forecast when the pandemic gets controlled and evaluate the effects of virus control measures, it is imperative to develop appropriate models to describe and understand the change trend of the pandemic [21, 20, 7, 40].
The susceptible-infectious-removed (SIR) model was utilized to explain the rapid rise and fall of the infected individuals from the epidemics of severe acute respiratory syndrome (SARS), influenza A virus subtype (H1N1) and middle east respiratory syndrome (MERS) [29, 49, 24, 34, 19]. The key idea is to divide a total population into three compartments: the susceptible, , who are healthy individuals capable of contracting the disease; the infectious, , who have the disease and are infectious; and the removed, , who have recovered from the disease and gained immunity or who have died from the disease [5]. The model assumes a one-way flow from susceptible to infectious to removed, and is reasonable for infectious diseases, which are transmitted from human to human, and where recovery confers lasting resistance [13]. SIR models originated from the Kermack-McKendrick model[27], consisting of three coupled differential equations to describe the dynamics of the numbers in the and compartments, which tend to fluctuate over time. For example, the number of infectious individuals increases drastically at the start of the epidemic, with a surge in susceptible individuals becoming infectious. As the epidemic develops, the number of infectious individuals decreases when more infectious individuals die or recover than susceptible individuals become infectious. The epidemic ends when the infectious compartment ceases to exist [27, 5].
SIR models and the modified versions, such as susceptible-exposed-infectious-recovered model (SEIR), were applied to analyze the COVID-19 outbreak [35, 8, 41, 59, 33]. Many of these models assume constant transmission and removal rates, which may not hold in reality. For example, as a result of various virus containment strategies, such as self-quarantine and social distancing mandates, the transmission and removal rates may vary over time [52].
Recently, a number of researchers [10, 6, 50] considered time-dependent SIR models adapted to the dynamical epidemiological processes evolving over time. However, few considered random errors in reporting, such as under-reporting (e.g. asymptomatic cases or virus mutation) or over-reporting (e.g. false positives of testing), or characterized the uncertainty of predictions.
Poisson models naturally fit count data [22]. Several works [28, 37, 60]
used Poisson distributions to model
and from frequentist or Bayesian perspectives; however, most of the works only considered constant transmission and removal rates. How to extend these works to accommodate time-dependent rates remains elusive.We propose to adopt a Poisson model to estimate the time-varying transmission and removal rates, and understand the trends of the pandemic across countries. For example, we can predict the number of the infectious persons and the number of removed persons at a certain time for each country, and forecast when the curves of cases become flattened.
An important epidemiological index that characterizes the transmission potential is the basic reproduction number, , defined as the expected number of secondary cases produced by an infectious case [16, 15, 17]. Our model leads to a temporally dynamical , which measures at a given time how many people one infectious person, during the infectious period, will infect [32]. This may help evaluate the quarantine policies implemented by various authorities. A recent work [32] demonstrated that is likely to vary “due to the impact of the performed intervention strategies and behavioral changes in the population.”
The merits of our work are summarized as follows. First, unlike the deterministic ODE-based SIR models, our method does not require transmission and removal rates to be known, but estimates them using the data. Second, we allow these rates to be time-varying. Some time-varying SIR approaches [50] directly integrate into the model the information on when governments enforced, for example, quarantine, social-distancing, compulsory mask-wearing and city lockdowns. Our method differs by computing a time-varying , which gauges the status of coronavirus containment and assesses the effectiveness of virus control strategies. Third, our Poisson model accounts for possible random errors in reporting, and quantifies the uncertainty of the predicted numbers of susceptible, infectious and removed. Finally, we apply our method to analyze the data collected from the aforementioned GitHub time-series data repository. We have created an interactive web application (https://younghhk.shinyapps.io/tvSIRforCOVID19/) to facilitate users’ application of the proposed method.
We introduce a Poisson model with time-varying transmission and removal rates, denoted by and . Consider a population with individuals, and denote by the true but unknown numbers of susceptible, infectious and removed, respectively, at time , and by , , the fractions of these compartments.
The following ordinary differential equations (ODE) describe the change rates of
and :(1) | |||||
(2) | |||||
(3) |
with an initial condition: and , where in order to let the epidemic develop [9]. Here, is the time-varying transmission rate of an infection at time , which is the number of infectious contacts that result in infections per unit time, and is the time-varying removal rate at , at which infectious subjects are removed from being infectious due to death or recovery [15]. Moreover, can be interpreted as the infectious duration of an infection caught at time [26].
To see this, dividing (2) by (3) leads to
(4) |
where is the ratio of the change rate of to that of . Therefore, compared to its time-independent counterpart, is an instantaneous reproduction number and provides a real-time picture of an outbreak. For example, at the onset of the outbreak and in the absence of any containment actions, we may see a rapid ramp-up of cases compared to those removed, leading to a large in (4), and hence a large . With the implemented policies for disease mitigation, we will see a drastically decreasing and, therefore, declining of over time. The turning point is such that when the outbreak is controlled with .
To fit the model and estimate the time-dependent parameters, we can use nonparametric techniques, such as splines [14, 45, 30, 2, 3, 55], local polynomial regression [46] and reproducible kernel Hilbert space method [1]. In particular, we consider a cubic B-spline approximation [42].
Denote by the cubic B-spline basis functions over associated with the knots . As is unknown, we use cross-validation to choose it in our numerical experiments. We specify
(8) |
When and , the model reduces to a constant SIR model [42].
Denote by and the unknown parameters, by and the reported numbers of infectious and removed, respectively, and by and , the reported proportions. Also, denote by and the true numbers of infectious and removed, respectively at time . We propose a Poisson model to link and to and as follows:
(9) |
We also assume that, given and , the observed daily number are independent across
, meaning the random reporting errors are “white” noise. We note that (
2.3) is directly based on “true” numbers of infectious cases and removed cases derived from the discrete SIR model (2.2). This differs from the Markov process approach, which is based on the past observations.With (2.2), (2.2) and (2.3), and are the functions of and , since and . Given the data , we obtain , the estimates of , by maximizing the following likelihood
or, equivalently, maximizing the log likelihood function
(10) |
where is a constant free of and . See the Appendix for additional details of optimization.
We then estimate the variance-covariance matrix of
by inverting the second derivative of evaluated at . Finally, for , we estimate and by and , where and are obtained from (2.2) with all unknown quantities replaced by their estimates; estimate and by and , obtained by using (2.3) with replaced by ; and estimate by .Summary of estimation and inference for , , ,
Estimation: Let be the size of population of a given country. The date when the first case was reported is set to be the starting date with , and .
The observed data are .
Inference: The estimated variance-covariance matrix of , denoted by , can be obtained by inverting the second derivative of evaluated at . For each , as , , , and are smooth functions of and , we apply the delta method [38]
to estimate their variances and obtain the confidence intervals. As an illustration, we compute
and where andare the partial derivative vectors of
and with respect to .
Since the first case of COVID-19 was detected in China, it quickly spread to nearly every part of the world [25]. COVID-19, conjectured to be more contagious than the previous SARS and H1N1 [47], has put great strain on healthcare systems worldwide, especially among the severely affected countries [51]. We apply our method to assess the epidemiological processes of COVID-19 in some severely impacted countries.
The country-specific time-series data of confirmed, recovered, and death cases were obtained from a GitHub data repository website (https://github.com/ulklc/covid19-timeseries). This site collects information from various sources listed below on a daily basis at GMT 0:00, converts the data to the CSV format, and conducts data normalization and harmonization if inconsistencies are found. The data sources include
World Health Organization (WHO): https://www.who.int/
DXY.cn. Pneumonia 2020: http://3g.dxy.cn/newh5/view/pneumonia.
National Health Commission of China (NHC):
http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml
China CDC (CCDC): http://weekly.chinacdc.cn/news/TrackingtheEpidemic.htm
Hong Kong Department of Health: https://www.chp.gov.hk/en/features/102465.html
Macau Government: https://www.ssm.gov.mo/portal/
Government of Canada:
https://www.canada.ca/en/public-health/services/diseases/coronavirus.html
Australia Government Department of Health:
https://www.health.gov.au/news/coronavirus-update-at-a-glance
European Centre for Disease Prevention and Control (ECDC):
https://www.ecdc.europa.eu/en/geographical-distribution-2019-ncov-cases
Ministry of Health Singapore (MOH): https://www.moh.gov.sg/covid-19
Italy Ministry of Health: http://www.salute.gov.it/nuovocoronavirus
Johns Hopkins CSSE: https://github.com/CSSEGISandData/COVID-19
WorldoMeter: https://www.worldometers.info/coronavirus/
In particular, the current population size of each country, , came from the website of WorldoMeter. Our analyses covered the periods between the date of the first reported coronavirus case in each nation and May 26, 2020. In the beginning of the outbreak, assessment of and was problematic as infectious but asymptomatic cases tended to be undetected due to lack of awareness and testing. To investigate how our method depends on the correct specification of the initial values and , we conducted Monte Carlo simulations. As a comparison, we also studied the performance of the deterministic SIR model in the same settings. Figure 1 shows that, when the initial value was mis-specified to be 5 times of the truth, the curves of and obtained by the deterministic SIR model (2.2) were considerably biased. On the other hand, our proposed model (2.3), by accounting for the randomness of the observed data, was robust toward the mis-specification of and : the estimates of and had negligible biases even with mis-specified initial values. In an omitted analysis, we mis-specified and to be only twice of the truth, and obtain the similar results.
Our numerical experiments also suggested that using the time series, starting from the date when both cases and removed were reported, may generate more reasonable estimates.
Using the cubic B-splines (2.3), we estimated the time-dependent transmission rate and removal rate , based on which we further estimated , and . To choose the optimal number of knots for each country when implementing the spline approach, we used 5-fold cross-validation by minimizing the combined mean squared error for the estimated infectious and removed cases.
Figure 2 shows sharp variations in transmission rates and removal rates across different time periods, indicating the time-varying nature of these rates. The estimated and overlapped well with the observed number of infectious and removed cases, indicating the reasonableness of the method. The pointwise 95% confidence intervals (in yellow) represent the uncertainty of the estimates, which may be due to error in reporting. Figure 3 presents the estimated time-varying reproduction number, , for several countries. The curves capture the evolving trends of the epidemic for each country.
In the US, though the first confirmed case was reported on January 20, 2020, lack of immediate actions in the early stage let the epidemic spread widely. As a result, the US had seen soaring infectious cases, and reached its peak around mid-March. From mid-March to early April, the US tightened the virus control policy by suspending foreign travels and closing borders, and the federal government and most states issued mandatory or advisory stay-home orders, which seemed to have substantially contained the virus.
The high reproduction numbers with China, Italy, and Sweden at the onset of the pandemic imply that the spread of the infectious disease was not well controlled in its early phases. With the extremely stringent mitigation policies such as city lockdown and mandatory mask-wearing implemented in the end of January, China was reported to bring its epidemic under control with a quickly dropping in February. This indicates that China might have contained the epidemic, with more people removed from infectious status than those who became infectious.
Sweden is among the few countries that imposed more relaxed measures to control coronavirus and advocated herd immunity. The Swedish approach has initiated much debate. While some criticized that this may endanger the general population in a reckless way, some felt this might terminate the pandemic more effectively in the absence of vaccines [39]. Figure 3 demonstrates that Sweden has a large reproduction number, which however keeps decreasing. The “big V” shape of the reproduction number around May 1 might be due to the reporting errors or lags. Our investigation found that the reported number of infectious cases in that period suddenly dropped and then quickly rose back, which was unusual.
Around February 18, a surge in South Korea was linked to a massive cluster of more than 5,000 cases [57]. The outbreak was clearly depicted in the time-varying curve. Since then, South Korea appeared to have slowed its epidemic, likely due to expansive testing programs and extensive efforts to trace and isolate patients and their contacts [36].
More broadly, Figure 3 categorizes countries into two groups. One group features the countries which have contained coronavirus. Countries, such as China and South Korea, took aggressive actions after the outbreak and presented sharper downward slopes. Some European countries such as Italy and Spain and Mideastern countries such as Iran, which were hit later than the East Asian countries, share a similar pattern, though with much flatter slopes. On the other hand, the US, Brazil, and Sweden are still struggling to contain the virus, with the curves still hovering over 1. We also caution that, among the countries with dropping below 1, the curves of the reproduction numbers are beginning to uptick, possibly due to the resumed economy activities.
We have developed an interactive web application (https://younghhk.shinyapps.io/) to facilitate users’ application of the proposed method to compute the time-varying reproduction number, and estimated and predict the daily numbers of active cases and removed cases for the presented countries and other countries; see Figure 4 for an illustration.
Our code was written in R [44], using the bs function in the splines package for cubic B-spline approximation, the nlm function in the stats package for nonlinear minimization, and the jacobian function in the numDeriv package for computation of gradients and hessian matrices. Graphs were made by using the ggplot2 package. Our code can be found on the aforementioned shiny website.
The rampaging pandemic of COVID-19 has called for developing proper computational and statistical tools to understand the trend of the spread of the disease and evaluate the efficacy of mitigation measures [23, 48, 58, 43]. We propose a Poisson model with time-dependent transmission and removal rates. Our model accommodates possible random errors and estimates a time-dependent disease reproduction number, , which can serve as a metric for timely evaluating the effects of health policies.
There have been substantial issues, such as biases and lags, in reporting infectious cases, recovery, and deaths, especially at the early stage of the outbreak. As opposed to the deterministic SIR models that heavily rely on accurate reporting of initial infectious and removed cases, our model is more robust towards mis-specifications of such initial conditions. Applications of our method to study the epidemics in selected countries illustrate the results of the virus containment policies implemented in these countries, and may serve as the epidemiological benchmarks for the future preventive measures.
Several methodological questions need to be addressed. First, we analyzed each country separately, without considering the traffic flows among these countries. We will develop a joint model for the global epidemic, which accounts for the geographic locations of and the connectivity among the countries.
Second, incorporating timing of public health interventions such as the shelter-in-place order into the model might be interesting. However, we opted not to follow this approach as no such information exists for the majority countries. On the other hand, the impact of the interventions or the change point can be embedded into our nonparametric time-dependent estimates.
Third, the validity of the results of statistical models eventually hinges on the data transparency and accuracy. For example, the results of Chinazzi et al. [11] suggested that in China only one of four cases were detected and confirmed. Also, asymptomatic cases might have been undetected in many countries. All of these might have led to underestimation of the actual number of cases. Moreover, the collected data could be biased toward patients with severe infection and with insurance, as these patients were more likely to seek care or get tested. More in-depth research is warranted to address the issue selection bias.
Finally, our present work is within the SIR framework, where removed individuals include recovery and deaths, who hypothetically are unlikely to infect others. Although this makes the model simpler and widely adopted, the interpretation of the parameter is not straightforward. Our subsequent work is to develop a susceptible-infectious-recovered-deceased (SIRD) model, in which the number of deaths and the number of recovered are separately considered. We will report this elsewhere.
Containment of COVID-19 requires the concerted effort of health care workers, health policy makers as well as citizens. Measures, e.g. self-quarantine, social distancing, and shelter in place, have been executed at various phases by each country to prevent the community transmission. Timely and effective assessment of these actions constitutes a critical component of the effort. SIR models have been widely used to model this pandemic. However, constant transmission and removal rates may not capture the timely influences of these policies.
We propose a time-varying SIR Poisson model to assess the dynamic transmission patterns of COVID-19. With the virus containment measures taken at various time points, may vary substantially over time. Our model provides a systematic and daily updatable tool to evaluate the immediate outcomes of these actions. It is likely that the pandemic is ending and many countries are now shifting gear to reopen the economy, while preparing to battle the second wave of virus attack[4, 12]. Our tool may shed light on and aid the implementation of future containment strategies.
To minimize (10), we differentiate with respect to . Then solves the following estimating equations:
where the involved partial derivatives can be computed recursively. Specifically, taking partial derivatives on the both sides of (2.2) yields that, for ,
Here, , and , and by using the initial conditions.