Data-driven COVID-19 forecasts and detection of unreported cases
Forecasting the effect of COVID-19 is essential to design policies that may prepare us to handle the pandemic. Many methods have already been proposed, particularly, to forecast reported cases and deaths at country-level and state-level. Many of these methods are based on traditional epidemiological model which rely on simulations or Bayesian inference to simultaneously learn many parameters at a time. This makes them prone to over-fitting and slow execution. We propose an extension to our model SIkJα to forecast deaths and show that it can consider the effect of many complexities of the epidemic process and yet be simplified to a few parameters that are learned using fast linear regressions. We also present an evaluation of our method against seven approaches currently being used by the CDC, based on their two weeks forecast at various times during the pandemic. We demonstrate that our method achieves better root mean squared error compared to these seven approaches during majority of the evaluation period. Further, on a 2 core desktop machine, our approach takes only 3.18s to tune hyper-parameters, learn parameters and generate 100 days of forecasts of reported cases and deaths for all the states in the US. The total execution time for 184 countries is 11.83s and for all the US counties (> 3000) is 101.03s.READ FULL TEXT VIEW PDF
Accurate forecasts of COVID-19 is central to resource management and bui...
Using a hybrid of machine learning and epidemiological approaches, we pr...
The number of Covid-19 cases is increasing dramatically worldwide. There...
The study aimed to forecast the total earnings lost of the tourism indus...
Since the beginning of the COVID-19 pandemic, many dashboards have emerg...
During the COVID-19 pandemic, a significant effort has gone into develop...
Accurate forecasts for COVID-19 are necessary for better preparedness an...
Data-driven COVID-19 forecasts and detection of unreported cases
COVID-19 has severely affected global health and economy. Accurate forecasts of the effect of such a pandemic is essential to design policies regarding resource management and economic decisions. Center for Disease Control is using around 20 different death forecasting approaches for this purpose, including our method [CDCforecasts]. Many of these models are based on the traditional SEIR model [li1999global], and yet they produce different results. This suggests that the choice of the training approach is critical in generating accurate forecasts. Further, fast training and forecasting is desirable so that the policy-makers can generate forecasts for different future scenarios for many regions (countries, states, counties) to choose policies and evaluate possible effects.
We extend our prior model SIkJ [srivastava2020learning] to forecast deaths and demonstrate that it provides accurate and fast predictions. Often epidemic models are trained through numerical solutions to differential equations [cintron2020estimation] or through Bayesian inference [lekone2006statistical, dukic2012tracking]. These training approaches may be prone to overfitting as they attempt to simultaneously learn many parameters of a non-linear model without providing learnability guarantees, with the exception of few works [srivastava2020data, ducrot2020identifying, magal2018parameter]. Instead, we transform our model into a system of simple linear models for learning the parameters. We show that our model considers many complexities such as unreported cases due to any reason (asymptomatic, mild symptoms, willingness to get tested), immunity (if any) or complete isolation, and reporting delay, and yet, it can be reduced to a system of two linear equations which can be fitted one after the other resulting in fast yet reliable forecasts. We compare our approach with 7 other models that are being used for forecasting deaths in the United States by the CDC [CDCforecasts]. We show that our approach outperforms all the other methods during majority of the evaluation period (May 10, 2020 – June 28, 2020). On a 2 core desktop machine, our approach takes only 3.18s to tune hyper-parameters, learn parameters and generate 100 days of forecasts of reported cases and deaths for all the states in the US. The total execution time for 184 countries is 11.83s and for all the US counties ( 3000) is 101.03s.
Our US state-level forecasts and global country-level forecasts can be found on our interactive web-page111https://scc-usc.github.io/ReCOVER-COVID-19/, which is updated 3-4 times a week. All the codes for generating the forecasts and performing comparisons with other methods are available on github222https://github.com/scc-usc/ReCOVER-COVID-19.
In [srivastava2020learning], we proposed the SI-kJ model for the spread of a virus like COVID-19 across the world which captures (i) temporally varying infection rates (ii) arbitrary regions, and (iii) human mobility patterns. Within every region (hospital/city/state/country), an individual can exist in either one of two states: susceptible and infected. A susceptible individual gets infected when in contact with an infected individual at a rate depending on when that individual got infected, i.e., rate of infection is for an individual infected between and , for an individual infected between and , and so on, thus resulting in sub-states of infection.
is a hyperparameters introduced for a smoothing effect to deal with noisy data. It also avoids over-fitting the model by using a smallto capture dependency on the last days. The hypothesis is that how actively one passes on the infection is affected by when they get infected. We assume that after being infected for a certain time, individuals no longer spread the infection, i.e., , such that .
Also, people traveling from other regions can increase the number of infections in a given region. Suppose represents mobility from region to region . Further extensions were proposed in [srivastava2020data]
to account for the fraction of the population that is either immune or completely isolated, thus not participating in the epidemic (non-carriers). The probability of reporting a case is also accounted for. A case may not have been reported if a person is asymptomatic or has mild symptoms, or has symptoms but decides not to get tested. Our model is represented by the following system of recurrence relations for a regionat time .
Equation 1 suggests that the number of susceptible individuals is determined by the difference between the population that is not immune/isolated and cumulative infections so far. Here represents the fraction of population out of total population that is immune/isolated. Equation 2 suggests that an infected case is reported with probability after a constant lag , resulting in observed reported cases. Equation 3 describes how new infections are created after time . It consists of infections from the same region (community spread) accounting for heterogeneous infection rates . The new infections may also be created by incoming travel (travel spread). Parameter captures the influence of passengers coming into the region. Observe that the the new infections at time are dependent only on the new infections in the last time units. Therefore, the model implicitly assumes that after days an individual is not infectious, for any reason such as recovery, death or being quarantined.
We extend this model to measure number of deaths as a function of infected cases:
Equation 4 suggests that new deaths occur at time as individuals infected between time and with probability . Note that we do not require or . For state-level forecasts when many states have experienced a rapid rise of the pandemic, travel is being avoided. Therefore, we may assume that the community spread is dominant and travel spread can be ignored. Ignoring travel spread in Equation 3 and combining with 1, we get
Combining the above with Equation 2 and setting , we get
Assuming that the time period between contracting the virus and reporting of the positive case is shorter than contracting the virus and death, we can rewrite Equation 7 as:
Figure 1 summarizes the transformation of the model that we started with and its simplification.
The learning of parameters is described next.
We consider and as learnable parameters, and and are treated as hyper-parameters. We let each region have independent hyper-parameters. The reason for considering as a hyper-parameter is to keep the model linear during training. We have tested the model in a non-linear setting and it produces worse results. Further, these parameters may not provide meaningful insights in a non-linear setting – as shown in [srivastava2020data], learning the true value of can be done reliably only under certain condition. We used the results obtained in [srivastava2020data] for some US states () and assume that all states will stay in the same range. We observed that generally produced good results, however, the difference in short-term forecasts was not significant when changing . We perform a grid search for and , while ensuring that . This is along the lines of the motivation for 14 days of quarantine333https://www.cdc.gov/coronavirus/2019-ncov/travelers/after-travel-precautions.html. The value of below 7 was not considered driven by the observed weekly periodicity in the data [ricon2020seven], suggesting that we should smooth over at least 7 days. These hyper-parameters are allowed to be different for different states. The values of and can also be selected by a grid search. However is allowed to be higher. We observed that and generally performed well. Note that the evaluations to identify best hyper-parameters were performed on a held-out portion of training data (validation set), and no part of test data is used. For each state, the evaluation on the validation set was performed using the root mean squared error (RMSE) of the predictions on the last days of training data.
where is the observed value of reported cases for state at time .
We learn the parameters as described in [srivastava2020learning], which incorporates the fast evolving trend of COVID-19 due to changing policies using weighted least squares. The best fit in the least-squares sense minimizes the sum of squared weighted residuals, i.e., the difference between observed data and predicted values provided by our learned model. The weighing scheme is determined by a hyper-parameter to put more weight to the recent infection trend when learning the model. Lower implies more emphasis on the more recent data. We minimize the sum of squares of weighted errors as
Where is given by the linear equation 6. For learning , we assume that the death rates do not evolve very rapidly and can be learned considering a window of the last time steps. This introduces another hyper-parameter, which we find to generally perform well at . We minimize the unweighted least squared error to obtain .
where is the true number of new deaths at time in region . Note that before we attempt to learn the parameters, we perform a moving average smoothing over 7 days to reduce noise due to periodicity in reporting. This is performed in irrespective of the smoothing hyper-parameter .
We picked those methods as baselines from COVID-19 Forecast Hub444https://github.com/reichlab/covid19-forecast-hub/tree/master/data-processed that had their reports available every Sunday in the month of May and June. This day was chosen based on CDC’s definition of a week [CDCweek]. Here we provide an overview of the baselines.
The Shaman Group from Columbia university uses a meta-population county-level SEIR model [pei2020initial]
to make COVID-19 projections in the US based on various contact rates between counties. SELECT is a selection of their forecast reports that tries to fit the realistic contact rates during different time periods given current observations and planned intervention policies. The method considers county-to-county commuters and random visitors, and formulates the transmission during day-time and night-time separately. They use a combination of iterated filtering (IF) and the Ensemble Adjustment Kalman Filter (EAKF)[anderson2001ensemble] framework [ionides2006inference, king2008inapparent] to calibrate the model’s parameters. They integrate their model equations using a Poisson process to represent the stochasticity of the transmission process.
The Johns Hopkins ID Dynamics COVID-19 Group has developed a scenario modeling pipeline [Lemaitre2020.06.11.20127894] for the epidemic projection. The pipeline contains three module: 1) module 1 is an epidemic seeding module that incorporates the first case appearance in data and air travel import model; 2) module 2 is a disease transmission model that takes in the seeding information and simulates a county-level meta-population model with stochastic SEIR disease dynamics; and 3) module 3 is a health outcomes projection model that takes in consideration of realistic time delays between infection, symptoms, hospitalization, intensive care, ventilator use, and death to model deaths and hospitalizations in the population. As pointed out by the authors [Lemaitre2020.06.11.20127894], the model does not account for asymptomatic transmissions.
UCLA Statistical Machine Learning Lab555https://covid19.uclaml.org/ has proposed a variant of the SEIR model called SuSEIR [zou2020epidemic]. It accounts for the untested and unreported COVID-19 cases. The model can be described by the following ODE:
The key claim of the SuEIR model is that it can infer the untested cases as well as unreported cases, although a discussion on learnability [srivastava2020data]
is not provided. More specifically, they treat the “Exposed” group in the SEIR model as the individuals that have already been infected and have not been tested, which are also capable of infecting the “Susceptible” group. Moreover, some of the people in the “Exposed” group can get tested and transferred to the “Infectious” group (which are reported to the public), while the rest of them will recover/die and transit to the so-called “Unreported Recovered” group (which are not reported to the public). They use gradient-based optimizers to minimizing a loss function composed of squared errors on infected and removed cases.
Lily Wang’s Research Group from Iowa State University666https://covid19.stat.iastate.edu has developed a non-parametric spatio-temporal epidemic transmission model [wang2020spatiotemporal]. They have designed a space-time epidemic modeling framework by including area-level characteristics to the traditional SIR model. The model assumes that there are local characteristics of a given area that will not vary with time. It further assumes that such characteristics will influence the daily new cases in the local area and its surrounding areas.
COVID-19 Simulator is developed to simulate the trajectory of COVID-19 in US state-level by researchers from Mass General Hospital, Harvard Medical School, Georgia Tech and Boston Medical Center777https://covid19sim.org/. The compartment model is defined using the SEIR model with continuous time progression. The model parameters are learned by first defining sensible ranges on the parameters and then using simulated annealing888https://covid19sim.org/images/docs/COVID-19_simulator_methodology_download_20200507.pdf. The model can simulate three intervention strategies: minimal restriction, current intervention in each state, and complete lockdown. The forecast reports we use is the simulation of the current intervention in each state.
CovidActNow is a team of computer scientists, epidemiologists, health experts, and public policy leaders999https://covidactnow.org/. They use an SEIR model to forecast trajectory of COVID-19 in US state-level. Their forecasts rely on fitting predicted cases, deaths, and hospitalizations to the observations using known ranges of the parameters and maximum-likelihood optimization.
YYG_ParamSearch uses an SEIR model101010https://covid19-projections.com/. It fixes some of the parameters that it considers to be known such as the latency of reporting, infectious period, time between illness onset to hospitalization, time between illness onset to death, hospital stay time, and time to recovery. It learns the basic reproduction number , mortality rate, and the reproduction numbers in various mitigation scenarios using a grid-search, evaluated using means squared error.
We use three sources of data for positive cases and number of deaths in the US states.
JHU: The JHU CSSE COVID19 dataset [JHUdata]. This dataset is used by YYG_ParamSearch and Covid19Sim_Simulator.
NYT: The New York Times dataset [NYTdata]. This dataset is used by UCLA_SuEIR, COVIDActNow_SEIR_CAN, and IowaStateLW_STEM.
USF: The US Facts dataset [USFdata]. This dataset is used by JHU_IDD_CovidSP and CU_select.
The three data sources have some inconsistencies as shown in Figure 1(a) as three heatmaps obtained by subtracting the data of NYT from JHU (JHU NYT), USF from NYT (NYT USF), and JHU from USF (USFJHU). The y-axis represents the days of the evaluation period and the x-axis represents the states. It can be observed that the discrepancy is limited to few states, but it can be significant – over 150 deaths. Due to such inconsistencies, we present separate evaluation for methods using different data sources. Figure 1(b) shows the number of states and deaths covered by various ranges of deviations (absolute value) in data. For a fixed value of the deviation , we pick the states such that the range of the number of deaths for that state at any given point in the evaluation period is less than . We observe that all three datasets are consistent across 31 states () that approximately cover approximately 33.7% of total deaths at the end of our evaluation period (June 28, 2020). For these 31 states, we are able to compare all the methods together.
|Step||US states||184 countries||3000 counties|
|Hyper-parameter selection||2.80 s||10.46 s||2.81 s|
|Reported cases param learning||0.17 s||0.55 s||29.01 s|
|Reported cases forecasts (100 days)||0.07 s||0.33 s||43.46 s|
|Deaths param learning||0.09 s||0.29 s||8.37 s|
|Death forecasts (100 days)||0.08 s||0.20 s||17.38 s|
|Total||3.18 s||11.83 s||101.03 s|
We implemented our approach in Matlab R2020a on an Intel(R) Core(TM) i3-3220, 3.30GHz CPU (2 cores) running Windows 10 with 8GB RAM. Since our approach reduces to fitting two linear functions, the parameters can be learned very quickly. Table 1 shows the runtimes of all the steps involved in generating forecasts – hyper-parameter tuning, parameter learning for reported cases and deaths, and forecasting reported cases and deaths for 100 days. For all the states in the US, the entire process takes 3.18s. This enables extremely fast scenario analysis. We do not include hyper-parameter tuning for death forecasting, as so far, we have the hyper-parameters fixed, and did not see a significant improvement by identifying new hyper-parameters every time. To demonstrate scalability of our approach we have also presented the runtimes for global forecasts for 184 countries, and that for more than 3000 US counties. Our approach takes 11.83s to for the whole process for 184 countries and 101.03s for the US counties. Note that we do not attempt to learn different hyper-parameters for different counties. Instead, we pick the hyper-parameters corresponding to the state that county belongs to. We have observed that picking hyper-parameters independently for regions with small number of infections (which is the case with most counties) can lead to over-fitting.
|Daily RMSE||Weekly RMSE|
|Methods||JHU||NYT||USF||No Conflict||JHU||NYT||USF||No Conflict|
We evaluate the incident (new) death forecasts using the Root Mean Squared Error (RMSE). We consider two windows of forecasts, daily and weekly. The provided forecasts are daily which can be aggregated over a week to evaluate weekly forecasts. For a given date of forecast release, we compare the true incident deaths to the forecasted deaths over the next 14 days, i.e, over points for weekly evaluation and over points for daily evaluation. The RMSE is calculated separately for each state and averaged to get the final error. Mathematically, for a given set of states , we evaluate
denotes the forecasted death in the period (day/week) and denotes the observed incident deaths.
Table 2 provides a comparison of all the methods based on RMSE measured daily and weekly, averaged ovber the evaluation period. Our approach (SIkJ) produces the best forecasts in all the datasets based on daily RMSE. For RMSE calculated weekly, UCLA_SuEIR performs slightly better in case of NYT dataset. For other datasets, our approach tops the list. When considering the 31 states which do not show any conflict among the three datasets, our approach performs the best based on weekly RMSE, while UCLA_SuEIR outperforms our approach by a close margin. Overall, in most cases, our approach produces the best RMSE, and in other cases it is close to the performance of best performing baseline.
Figures 3 shows the performance of all the forecasts released weekly based on daily RMSE aggregated over two weeks. Note that a higher RMSE on one week compared to another does not imply worse performance as higher ground truth may lead to higher RMSE. These plots should be assessed based on relative performance on every day of forecast release rather than absolute numbers across the weeks. We observe that our approach SIkJconsistently performs well over the evaluation period on all three datasets.
To observe the effect of data on the performance, we split the dataset into two halves - high death states consisting of the top 25 states with most deaths at the start of the evaluation, and low death states constituting the rest of the states. The results are shown in Figures 4 and 5 . We observe that the relative performance all the methods are closer to each other on high death states compared to low death states. This is expected as states with higher deaths are likely to average out noise/anomalies, making it easier to train models and forecast.
We also evaluated all the methods on the 31 states for which there is no conflict among the three datasets (see Figure 10). We compare this against the errors obtained by all the methods on the three datasets calculated on each forecast day. We observe that the relative errors of the methods are closer to each other when considering the states with no conflict. This suggests that the inconsistency in reporting may have introduced additional noise making it difficult to train the models.
We repeated the above experiments for the forecasts released weekly based on weekly RMSE aggregated over two weeks. We observe similar trends in all the figures as noted above, with the exception of YYG_ParamSearch and UCLA_SuEIR performing slightly better than our approach on some forecasts. Our online interactive website has a dedicated page to update and show the comparison our approach with the baselines111111https://scc-usc.github.io/ReCOVER-COVID-19/#/leaderboard.
Based on the learned values of at a time for a region , we can identify the dynamic reproduction number [huang2020dynamic] of the epidemic in that region, defined as the number of new infections created by one infected individual in the remaining susceptible population. Suppose, the individual is infected at time , then according to Equations 3, they remain active for time steps. From to they will contribute to new infections at the rate of , from to at the rate of , and so on. Therefore, the total infections created by one individual is:
Note that if we assume that the new reported cases in Equation 6 in the last time steps are roughly equal, then
which suggests that the traditional interpretation of reproduction number applies to our model as well – if the disease will die out, as it will create fewer new infections. The basic reproduction number, that assumes the entire population to be susceptible, is .
Case fatality rate is given by the fraction of reported positive cases that end up in death. A simple way of calculating this is by taking the ratio of total deaths and total reported cases. However, this does not consider the lag between reporting positive and then dying, and also does not account for recent changes in the dynamics. The learned values of can be used to compute model-based case fatality rate (MCFR). According to Equation 8, an individual reported positive at time dies on each of the time steps among to with probability , to , and so on. Therefore, MCFR can be calculated as expected number of new deaths created by one reported case as
The weekly updated reproduction numbers and model-based case fatality rates for all US states and all the countries can be found on our webpage121212https://scc-usc.github.io/ReCOVER-COVID-19/#/score.
This work was supported by National Science Foundation Award No. 2027007.