Since the beginning of the reported cases in December 2019, the outbreak of COVID-19 has spread globally within weeks. On March 11, 2020, the World Health Organization (WHO) deemed COVID-19 to be a pandemic (WHO, 2020), and on March 24, they warned that the U.S. could be the next epicenter of the global coronavirus pandemic. The reported confirmed cases in the U.S. have soared in the following weeks. The coronavirus is spreading from the biggest cities in the U.S. to its suburbs, and it has begun encroaching on the nation’s rural regions. According to the New York Times, as of April 24 8:01 A.M. EST, there are now at least 867,105 confirmed cases and 44,464 deaths from COVID-19 in the U.S.
An essential question for developing a defense against COVID-19 is how far the virus will spread and how many lives it will claim. It is not clear to anyone where this crisis will lead us. Understanding the dynamics of the disease is therefore undoubtedly critical. One way to answer these questions is through scientific modeling. Several attempts have been made to model and forecast the spread and mortality of COVID-19 (Sun et al., 2020; Zhang et al., 2020; Fanelli and Piazza, 2020; Elmousalami and Hassanien, 2020; Kucharski et al., 2020).
In epidemiology, the fundamental concept of infectious disease is the investigation of how infections spread. Mathematical methods, such as the class of susceptible-infectious-recovered (SIR) models (Weiss, 2013; Chen et al., 2020; Lawson et al., 2016; Allen et al., 2008; Pfeiffer et al., 2008; Wakefield et al., 2019)
, are widely used in epidemics to capture the dynamic process of the spread of the infectious disease. However, pure mathematical modeling is deterministic, and only demonstrates the average behavior of the epidemic. In addition, its focus is often on the form of models, not the parameter estimation for observed data; thus, it is difficult to quantify the uncertainty. Instead, statistical models provide a varying characterization of different types of errors. When it comes to analyzing the reported numbers of infectious diseases, other factors may also be responsible for temporal or spatial patterns. The spread of the disease varies a lot across different geographical regions. Local area features, like socioeconomic factors and demographic conditions, can dramatically influence the course of the epidemic. These data are usually supplemented with the population information at the county level. In addition, the capacity of the health care system, and control measures, such as government-mandated social distancing, also have a significant impact on the spread of the epidemic. SIR models with assumptions of random mixing can overestimate the health service needed by not taking into account the behavioral change and government-mandated action. In this paper, we propose a class of novel nonparametric dynamic epidemic models to analyze the infectious disease data by incorporating the spatiotemporal structure and the effect of explanatory variables.
In this paper, we borrow the mechanistic rules from the SIR model by including three compartments: infected, susceptible and removed states, and develop a class of data-driven statistical models to reconstruct the spatiotemporal dynamics of the disease transmission. We build a novel space-time epidemic modeling framework for the infected count data, to study the spatial-temporal pattern in the spread of COVID-19 at the county level. The proposed methodology can be used to dissect the spatial structure and dynamics of spread, as well as to assess how this outbreak may unfold through time and space.
Given an parametric epidemic model, the typical inference problem involves estimating the parameters associated with the parametric models from the data to hand. Such specifications are ad hoc, and if misspecified, can lead to substantial estimation bias problems. In practice, this question might be addressed by considering alternative parametric models, or sensitivity analyses if some of the underlying model parameters are assumed to be known. Nonparametric approaches to fitting epidemic models to the data have received relatively little attention in the literature possibly due to the lack of data. By allowing the infection to depend on time and location, we consider a generalized additive varying coefficient model to estimate the unobserved process of the disease transmission. By adopting a nonparametric approach, we do not impose a particular parametric structure, which significantly enhances the flexibility of the epidemic models that practitioners use. For our model estimation, we propose a quasi-likelihood approach via the penalized spline approximation and the iteratively reweighted least squares technique.
Prediction models for COVID-19 at the county-level that combine local characteristics and actions are very beneficial for the community to understand the dynamics of the disease spread and support decision making at a time when they are urgently needed. Models can help predict rates of new infections, and estimate when the strain on the hospital system could peak. In this paper, we consider both the short-term and long-term impact of the virus. To assess the uncertainty associated with the prediction, we develop a projection band constructed based on the envelope of the bootstrap forecast paths, which are closest to the forecast path obtained on the basis of the original sample. Based on our research findings, we develop multiple R shiny apps embedded into a COVID-19 dashboard, which provides a 7-day forecast and a 4-month forecast of COVID-19 infected and death count at both the county level and state level.
The rest of the paper is organized as follows. Section 2 introduces our case study on COVID-19, including a detailed description of the data. Section 3 outlines the nonparametric spatiotemporal modeling framework and describes how to incorporate additional covariates. Section 4 introduces our estimation method, presents our algorithms, and discusses the details of the implementation. Section 5 starts with a description of the prediction of the infection count and provides the uncertainty with the band of the forecast path. Section 6 shows the results and findings of the case study. Section 7 concludes the paper with a discussion. The supplementary materials (Wang et al., 2020b) contain additional figures, and an animation of the estimation results.
2 COVID-19 Case Study and Data
2.1 Research Goal of the Study
The goal of this study is threefold. First, we develop a new dynamic epidemic modeling framework for public health surveillance data to study the spatial-temporal pattern in the spread of COVID-19. We aim to investigate whether the proposed model could be used to guide the modeling of the dynamic of the spread at the county level by moving beyond the typical theoretical conceptualization of context where a county’s infection is only associated with its own features. Second, to understand the factors that contribute to the spread of COVID-19, we model the daily infected cases at the county level in consideration of the demographic, environmental, behavioral, socioeconomic factors in the U.S. Third, we project the spatial-temporal pattern of the spread of the virus in the U.S. For the short-term forecast, we provide the prediction of the daily infection count and death count up to the county level. As for the long-term forecast, we project the total infected and death cases in the next three months.
2.2 Epidemic Data from the COVID-19 Outbreak in the U.S.
This study analyzes data from the reported confirmed COVID-19 infections and deaths at the county level, which are reported by the 3,104 counties from the 48 mainland U.S. states and the District of Columbia. The aggregated COVID-19 cases are from January 20 until April 25, 2020. The data are collected, compiled and cleaned from a combination of public sources that aim to facilitate the research effort to confront COVID-19, including Health Department Website in each state or region, the New York Times (NYT, 2020), the COVID-19 Data Repository by the Center for Systems Science and Engineering at Johns Hopkins University (CSSE, 2020), and the COVID Tracking Project (Atlantic, 2020). These data sources automatically updated every day or every other day. We have created a dashboard https://covid19.stat.iastate.edu/ to visualize and track the infected and death cases, which was launched on March 27, 2020.
2.3 Information of the covariates
We consider a variety of county-level characteristics as covariate information in our study, which can be divided into seven groups. The data sources and the operational definitions of these features are discussed as follows.
Policies. Government declarations are used to identify the dates that different jurisdictions implemented various social distancing policies (emergency declarations, school closures, bans on large gatherings, limits on bars, restaurants and other public places, the deployment of severe travel restrictions, and “stay-at-home” or “shelter-in-place” orders). President Trump declared a state of emergency on March 13, 2020, to enhance the federal government response to confront the COVID-19. By March 16, 2020, every state had made an emergency declaration. Since then, more severe social distancing actions have been taken by the majority of the states, especially those hardest hit by the pandemic. We compiled the dates of executive orders by checking national and state governmental websites, news articles and press releases.
Demographic Characteristics. To capture the demographic characteristics of a county, five variables are considered in the analysis to describe the racial, ethnic, sexual and age structures: the percent of the population who identify as African American, the percent of the population who identify as Hispanic or Latino, the rate of aged people ( years) per capita, the ratio of male over female and population density over square mile of land area. The former two variables were obtained from the 2010 Census (U.S. Census Bureau, 2010), and the latter three variables are extracted from the 2010–2018 American Community Survey (ACS) Demographic and Housing Estimates (U.S. Census Bureau, 2010).
Healthcare Infrastructure. We incorporated three components in our analysis to describe the healthcare infrastructure in each county: percent of the population aged less than 65 years without health insurance, local government expenditures for health per capita, and total counts of hospital beds per 1,000 population. These components measure the access for residents to public health resources within and across counties. The first component is available in the USA Counties Database (U.S. Census Bureau, 2010), the second is from Economic Census 2012 (U.S. Census Bureau, 2012), and the last is compiled from Homeland Infrastructure Foundation-level Data (U.S. Department of Homeland Security).
Socioeconomic Status. A diverse of factors are considered to describe the socioeconomic status in each county. We first apply the factor analysis to seven factors collected from the 2005–2009 ACS 5-year estimates (U.S. Census Bureau, 2010), and generate two factors: social affluence and concentrated disadvantage. To be specific, the former is comprised of the percent of families with annual incomes higher than $75,000 (factor loading = 0.86), percent of the population aged 25 years or older with a bachelor’s degree or higher (factor loading = 0.92), percent of the people working in management, professional, and related occupations (factor loading = 0.73), and the median value of owner-occupied housing units (factor loading = 0.74); whereas the latter includes the percent of the households with public assistance income (factor loading = 0.34), the percent of households with female householders and no husband present (factor loading = 0.81), and civilian labor force unemployment rate (factor loading = 0.56). These two factors, affluence and disadvantage, explain more than 60% of the variation.
We also incorporate the Gini coefficient to measure income inequality. The Gini coefficient, also known as Gini index, is a well-known measure for income inequality and wealth distribution in economics, with value ranging from zero (complete equality, where everyone has exactly the same income) to one (total inequality, where one person occupies all of the income). The 2005–2009 ACS (U.S. Census Bureau, 2010) provided the household income data that allow us to calculate the Gini coefficient.
Rural/urban Factor. In the literature, rural/urban residence has been found to be associated with the spread of epidemics. Specifically, rural counties are often characterized by poor socio-economic profiles and limited access to healthcare services, indicating a potential higher risk. To capture rural/urban residence, we use the urban rate from the 2010 Census (U.S. Census Bureau, 2010).
Geographic Information. The longitude and latitude of the geographic center for each county in the U.S. are available in Gazetteer Files (U.S. Census Bureau, 2019).
3 Space-time Epidemic Modeling
In this section, we propose a class of nonparametric space-time models to estimate the infection count at the area level. In the following, let be the number of new cases at time for area , . Also for area , let , and be the cumulative number of active infectious, death and recovered cases at time , and let be the number of cumulative confirmed cases up to time . Then, it is clear that . Further, denote the population for area , and the number of susceptible subjects at time would be . Define .
We denote be the GPS coordinates of the geographic center of area , which ranges over a bounded domain of the region under study. Let be the covariates of area that is not varying with time, see the description in Section 2.3. For example, the socioeconomic factors, health service resources, and demographic conditions. Let denotes the
th dummy variable of actions or measures taken for areaat time , and let , which varies with the time.
In this paper, we consider the exponential families of distributions. The conditional density of given can be represented as
for some known functions and , dispersion parameter and the canonical parameter . Let be the conditional expectation of given .
We assume that the determinants of the daily new cases of a certain area can be explained not only by the features of this area but also by the characteristics of the surrounding areas. Based on the idea of the SIR models, we propose a discrete-time spatial epidemic model comprising the susceptible, infected and removed states, and area-level characteristics. At time point , we assume , which is modeled via a link function as follows:
where ’s are unknown time-varying coefficients, and are unknown bivariate coefficient functions, , , are univariate functions to be estimated. The parameter in ’s denotes a small delay time allowing for the control measure to be effective. For model identifiability, we assume , . Note that illustrates the transmission rate at location , , are the mixing parameters of the contact process. The rationale for including () is to allow for deviations from mass action and to account for the discrete-time approximation to the continuous time model; see Finkenstädt and Grenfell (2000); Wakefield et al. (2019). In many cases, the standard bilinear form may not necessarily hold. The above proposed epidemic model incorporates the nonlinear incidence rates, which represents a much wider range of dynamical behavior than those with bilinear incidence rates (Liu et al., 1987). These dynamical behaviors are determined mainly by and . When and are both 1, it is corresponding to the standard assumption of homogeneous mixing in De Jong et al. (1995).
Since is the number of new cases at time for area , , Poisson or negative binomial (NB) might be an appropriate option for random component; see Yu et al. (2020), and Kim and Wang (2020). We assume that
where can be modeled via the same log link as follows:
At the beginning of the outbreak, infected and death cases could be rare, so “Poisson” might be a reasonable choice of the random component to describe the distribution of rare events in a large population. As the disease progresses, the variation of infected/death count increases across counties and states. So, at the acceleration phase of the disease, the negative binomial random component might be an appropriate option to account for the presence of over-dispersion.
The above spatiotemporal epidemic model (STEM) is developed based on the foundation of epidemic modeling, but it is able to provide a rich characterization of different types of errors for modeling the uncertainty. In addition, it accounts for both spatiotemporal nonstationarity and area-level local features simultaneously. It also offers more flexibility in assessing the dynamics of the spread at different times and locations than various parametric models in the literature.
4 Estimation of the STEM
4.1 Penalized Quasi-likelihood Method
In this section, we describe how to estimate the parameters and nonparameteric components in the proposed STEM model (2).
To capture the temporal dynamics, we consider the moving window approach. For the current time , and nonnegative smoothness parameters for , we consider the following penalized quasi-likelihood problem:
where is the window width for the model fitting, and it can can be selected by minimizing the prediction errors or maximizing the correlation between the predicted and observed values. The energy functional is defined as follows:
where is the th order derivative in the direction , , at any location .
Note that, except for parameters
, other functions are related to curse of dimensionality due to the nature of functions. To overcome this difficulty, we introduce the basis expansion for univariate and bivariate functions discussed below.
The univariate additive components and the spatially varying coefficient components in model (2) are approximated using univariate polynomial spline and bivariate penalized splines over triangulation (BPST), respectively. The BPST method is well known to be computationally efficient to deal with data distributed on complex domains with irregular shape or with holes inside; see the details in Lai and Schumaker (2007), Lai and Wang (2013) and Sangalli et al. (2013). We introduce a brief review of univariate splines and bivariate splines in the following.
Suppose that the covariate is distributed on an interval , . For , denote a partition of with interior knots. Let be the space of the polynomial splines of order , which are polynomial functions with -degree (or less) on intervals , and , and have continuous derivatives globally. Next, let , which ensures that the spline functions are centered; see Yu et al. (2020).
Let be the original B-spline basis functions for the th covariate, where is the index set of the basis functions. Let , , , then and . Suppose the nonlinear component can be well approximated by a spline function so that, for all , , where and
is a vector of coefficients.
For the bivariate coefficient functions and in the STEM model (2), we introduce bivariate spline over triangulation. The spatial domain with either an arbitrary shape or holes inside can be partitioned into finitely many triangles, , that is, , and any nonempty intersection between a pair of triangles in is either a shared vertex or a shared edge. A collection of these triangles, , is called a triangulation of the domain Lai and Schumaker (2007); Lai and Wang (2013). For a triangle in with vertices , for numbered in counter-clockwise order, we can write . Then, any point can be uniquely represented as such that , where the coefficients are called the “barycentric coordinates” of point . The Bernstein basis polynomials of degree relative to are defined as , for .
Given an integer , let be the space of all polynomials of degree on . Note that the barycentric coordinates of are all linear functions of the Cartesian coordinates, therefore, the set of Bernstein basis polynomials forms a basis for . For a triangle and coefficients , any polynomial can be uniquely written as called the B-form of relative to . Let be the space of th continuously differentiable functions over the domain . Given and a triangulation , the spline space of degree and smoothness over is defined as
For triangulation with triangles, denote a set of bivariate Bernstein basis polynomials for as , where is an index set for basis functions on triangulation with cardinality . Then, we can approximate the bivariate functions in the STEM model (2) by , where at a location point , and are the vector of bivariate basis functions and the corresponding spline coefficient vector at a time point , respectively.
In practice, the triangulation can be obtained through varieties of software; see for example, the “Delaunay” algorithm (delaunay.m in MATLAB or DelaunayTriangulation in MATHEMATICA), the R package “Triangulation” (Wang and Lai, 2019), and the “DistMesh” Matlab code. The bivariate spline basis are generated via the R package “BPST” (Wang et al., 2019).
Considering the basis expansion, for the current time , the maximization problem (3) is changed to minimize
In addition, we consider the energy functional in (4) can be approximated by , for , where is the block diagonal penalty matrix. Introducing the constraint matrix which satisfies , is a common strategy to reflect global smoothness in in (5).
Directly solving the optimization problem in (6) is not straightforward due to the smoothness constraints inside. Instead, suppose that the rank matrix is decomposed into , where is the first
columns of an orthogonal matrix, and is a matrix of zeros, which is a submatrix of an upper triangle matrix . Then, reparametrization of for some , , enforces . Thus, the constraint problem in (6) can be changed to an unconstrained optimization problem as follows:
Let , , and be the maximizers of (7) at time point . We obtain the estimators of :
the estimator of is , , and the spline estimator is , .
4.2 A Penalized Iteratively Reweighted Least Squares Algorithm
For the current time , let
be the vector of the response variable where. Denote , , and , where , and and . Let , and . In addition, let the mean vector
, the variance function matrix, the diagonal matrix with the derivative of link function as element, and the weight matrix , where .
In order to numerically solve the minimization in (7), we design the penalized iteratively reweighted least squares (PIRLS) algorithm as described below. Suppose at the th iteration, we have , and . Then at th iteration, we consider the following objective function:
Take the first order Taylor expansion of around , then
where with for . The PIRLS procedure is represented in Algorithm 1. In the numerical analysis, we set and as the initial values to start the iteration.
4.3 Modeling the Number of Fatal and Recovered Cases
To fit the proposed STEM and make predictions for cumulative positive cases, one obstacle is the lack of direct observations for the number of active cases, . Instead, the most commonly reported number is the count of total confirmed cases, . Some departments of public health also release information about fatal cases and recovered cases , while such kind of data tends to suffer from missingness, large error and inconsistency due to its difficulty in data collection; see the discussions in KCRA (2020).
Based on the fact that , we attempt to modeling and in order to facilitate the estimation and prediction of newly confirmed cases based on the proposed STEM model. Let be the new fatal cases on day , and following similar notations in the STEM model (2), we assume that
Ideally, if sufficient data for recovered cases can be collected from each area, a similar model can be fitted to explain the growth of the recovered cases. However, there are no uniform criteria to collect recovery reports across the U.S. (CNN, April 4). According to the U.S. Centers for Disease Control and Prevention, severe cases with COVID-19 often require medical care and receive supportive care in the hospital. At the same time, in general, most people with the mild illness are not hospitalized and suggested to recover at home. Currently, only a few states regularly update the number of recovered patients, but seldom can the counts be mapped to counties.
Due to the lack of data, we are no longer able to use all the explanatory variables discussed above to model daily new recovered cases. Instead, we mimic the relationship between the number of recovered and active cases from some Compartmental models in epidemiology (Anastassopoulou et al., 2020; Siettos and Russo, 2013). At current time point , we assume that , in which the recovery rate enables us to make reasonable predictions for future recovered patients counts and provide researchers with the foresight of when the epidemic will end. The rate can be either estimated from available state-level data, or obtained from prior medical studies due to the under-reporting issue in actual data.
4.4 Zero-inflated Models at the Early Stage of the Outbreak
Early in an epidemic, the quality of data on infections, deaths, tests, and other factors often are limited by underdetection or inconsistent detection of cases, reporting delays, and poor documentation, all of which affect the quality of any model output. There are many counties with zero daily counts at the early stage of disease spread. Therefore, we consider zero-inflated models based on a zero-inflated probability distribution, which allows for frequent zero-valued observations. Following the works byArab et al. (2012), Beckett et al. (2014) and Wood et al. (2016), we assume the observed counts
contributes to a zero-inflated Poisson distribution
Similarly, we also consider zero-inflated models, in which we assume the observed count contributes to a zero-inflated Poisson distribution
where follows (9), , and with and estimated in a parallel fashion to .
5 Forecast and Band of the Forecast Path
To understand the impact of COVID-19, it requires accurate forecast for the spread of infectious cases along with analysis of the number of death and recovery cases. In this section, we describe our prediction procedure of these counts, specifically, we are interested in predicting , and . We also provide the prediction intervals to quantify the uncertainty of the prediction.
We consider an -step ahead prediction. As described in Section 3, if we observe for , then the infection model and fatal cases model can be fitted by regressing , on , respectively. The predictions of infectious count at time and iteratively at are
respectively, where and . Meanwhile, let
and , where we predict by , and by . Then, the predicted number of active cases and susceptible cases are , and . The above one-step predicted values can be thus plugged back into equation (10) to obtain the predictions for the following days by repeating the same procedure.
There is substantial interest in the problem of how to quantify the uncertainty for the forecasts with a succession of periods. To construct the band for forecast path , we consider the bootstrap method (Staszewska-Bystrova, 2009), in which the bootstrap samples are generated using the bootstrap-after-bootstrap procedure; see Algorithms 2 and 3 for the details.