1 Introduction
With every new pandemic, epidemiologists struggle to come up with models consistent with observed results. With the death toll of the COVID19 pandemic reaching over 200,000 in US alone, it is crucial that local policymakers have the tools they need to marshal resources and target policies that respond to the severity of the epidemic in their region. Current modeling techniques focus on predicting the course of the epidemic at the state and national level. There is a lack of understanding on how to chart the course of the epidemic on regional areas, like counties in the United States. This poses a huge issue for local governments, since the course of the epidemic can vary widely across these larger areas (especially in large states like California). This makes it difficult to know where to deploy resources, and when to ease social distancing restrictions on a countybasis.
Current models use either epidemiological techniques or basic machine learning to predict the mean number of deaths. We propose a novel method for predicting deaths from COVID19 on the county level — we combine stateoftheart machine learning architectures (LSTM, Gradient Boosted Trees, Random Forests, etc.) and epidemiological techniques (SEIRQD) to predict quantile estimates for daily deaths, on a variablelength forecast period. This allows us to:

Convey more information about the daily death distribution (quantile forecasts)

Capture the severity of the pandemic on a local scale (countylevel)
Our goal is to provide policymakers with a model that is useful for predicting the course of the pandemic in their area.
In Section 2
we describe our evaluation metric used to improve and choose models. In Section
3 we describe the date processing pipelines we fed into the models. In Section 4 we describe the architectures for all the machine learning/epidemiological models we use. In Section 5 we describe the results from the various models we tried (as per the evaluation metric in 2). We then discuss these results in Section 6.2 Evaluation
Most COVID19 prediction models tend to focus predicting mean mortality rate, in that they output a single number to reflect death count/mortality rate (1), (2). Our goal is to provide a better description of the death distribution — we approach this by providing an estimate of COVID19 deaths in the United States with error bars. Specifically, we provide quantile estimates for deaths for the 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, and 90th quantile. The metric used to evaluate models was the 9quantile pinball loss, where for quantile , the loss for the quantile given ground truth and prediction is:
Pinball loss is one of the common ways to evaluate quantile forecasts (3), since it weights errors depending on the quantile (the error incurred by the prediction will vary as we vary the quantile ). For example, underpredicting for the th quantile incurs much more error than over predicting; similarly, over predicting for the
th quantile incurs more error than underpredicting. You can see in the below figure that for different quantiles, the loss function behaves differently.
Given the quantile estimates for an individual county on a particular day , the pinball loss at a individual county level is:
The final evaluation metric is the average of taken over every county, and every day in the forecast period. The ground truth dataset used for evaluation is the New York Times coronavirus dataset (4).
3 Data Usage and Preprocessing
We describe the data processing and feature engineering done to generate train and test data for all the models used. Labels for daily death values at a county level are obtained from the New York Times Dataset. Initial preprocessing to reallocate and filter the ground truth data is required due to inconsistent reporting and "data dumps”, wherein counties underreport on certain days and then account for it by accumulating these deaths and overreporting on a single day. We deal with these "data dumps" by spreading the deaths equally among the days in the period since the last "data dump."
This dataset is then combined with the aggregated Berkeley dataset (5), which captures a wide range of stationary features for each county. As these features are constant for a given county across time, they are much less important than the dynamic features, but certain density features (e.g. population density) proved relevant in differentiating counties. This is finally combined with the Descartes Labs mobility data (6)
, with missing values imputed in a forwardfill fashion by using the most recent value. The main mobility feature used is a mobility index (m50_index) which captures a percentage of normal mobility present in a region.
The features with the most predictive power are the dynamic features such as cases, mobility and deaths. For several models, these are processed to create derived features with different levels of fixed lag (e.g. 14days ago deaths, 7days ago deaths). The amount of lag is constrained by the length of the forecast period and informed by the inherent lag between infection onset and death. For example, for a forecast period of 14 days, the set of lags used (in days) was {15, 16, 17, 18}.
The focus of the feature engineering is to augment the timerelated information available in the data. Such features capture what stage of the pandemic a given county is in. Other information (e.g. day of the week) is also useful in capturing under and overreporting trends. Finally, we add categorical features such as the state a county belongs to, as well as the index of the cluster a county belongs to based on the clustering technique described in section 4.5.2.
4 Approach
Our modeling approaches can be grouped into two main categories: machine learning models and epidemiological models. While the epidemiological models show promising results in capturing broad trends, they fail to capture any complex patterns in the data; however, the machine learning models are able to capture more granular trends and variations in the data. All of the individual models are combined using a nonlinear ensemble described in section 4.5.
4.1 Epidemiological Models
4.1.1 Imperial College Model
One of the main models being used is an adaptation of the main Imperial College Model (7), a semimechanistic Bayesian hierarchical model trained to predict daily deaths that is able to incorporate intervention data on dates of lockdown and other closures. The original study was done on 11 European countries; this was adapted to train on US counties instead. The intervention data is used to learn parameters that determine the timevarying reproduction number, and this value is used in combination with an internal infection model to predict the number of true cases and deaths.
A noteworthy change made to the original model is that multiple parameters are used to learn the uncertainty in death predictions per county, instead of a single shared parameter across all counties. In particular, the top 10 counties by cumulative deaths are trained on individually, separately from the main model used to train on the rest of the counties. In addition, since information on the infection fatality rates (IFR) are not available on a percounty basis, these values are initially set to a common value of the IFR value of the United Kingdom and a multiplicative deviation parameter is learned per county.
4.1.2 Generalized SEIR Models
Another one of our initial models is the SEIRQD model (8), which is a variant of the generalized SEIR model (discussed in (8)
). The SEIRQD model is a compartmental model including states for susceptible, exposed, infected, recovered, quarantined, and death with ordinary differential equations governing disease transmission and movement between these states. This variant of SEIR models has been shown to model the COVID19 epidemic well
(9). The model parameters for these ODEs are learnt on a percounty level using a least squares optimization approach to minimize the error in both predicted cases and deaths. For counties in the early stages of the epidemic, a weighted loss function favoring the accuracy of predicted cases is used since the ground truth data on deaths is very noisy. However, for counties with more severe outbreaks, the weighting is shifted to favor accuracy of predicted deaths since the true data on deaths was more reliable. More importantly, an unweighted loss metric yields poor results because the number of reported cases is significantly higher than the number of reported deaths. The prediction uncertainty is generated using the technique based on the negative binomial distribution described in section
4.5.1.4.2 Neural Networks
4.2.1 Lstm
The architecture of the LSTM follows an encoderdecoder approach, taking into account both temporally variable data and fixed variables. Temporal variables used are cases, deaths, a onehot encoded weekday feature, and fixed variables are population features and hospital bed counts in the county. First, a ConvLSTM with a 5day convolution window is applied to the data to generate a feature vector. Using this approach over a standard LSTM seemed to help the network handle the noise in the data without having to predict a moving average of deaths, which eliminates some interesting trends that may be beneficial to capture. The output of the ConvLSTM is stacked on top of the vector of fixed features, which is then fed into a dense layer that outputs the final encoding of the death time series.
Multiheaded Decoders
Especially early in the epidemic, directly training the encoderdecoder network to predict for the given forecast period from static lagged features (e.g. 14daysago cases and deaths), was problematic, as it greatly diminishes the number of training samples we can generate for the network. For example, there are few 14day periods available in the first 2 months of the epidemic in the United States. To increase the data available to train the encoder segment of the network, multiple decoders are created, each of which is an LSTM that generates an output sequence of , vectors, each representing the quantile predictions on each of the days in the forecast period. In the day forecast example, multiple decoder heads (sharing the same encoder backbone) are created for 3, 7, and 14 day time horizons; the 14 day time horizon decoder is used in the final encoderdecoder in the 2week prediction.
LSTM Training Pipeline
With the lack of data on larger outbreaks, especially early in the epidemic, a multistep training pipeline is used to increase the probability of quick convergence on larger datasets for parts of the model, and then finetuning on smaller datasets to improve performance on the specific targets.
First the shared encoder is trained along with the shorter timescale decoders for a few epochs, in increasing order of time. Empirically, we found that pretraining the encoder using decoders designed to predict shorter sequences of deaths (in the 14day forecast example, the 3 and 7 day decoders) achieved optimal performance across stages of the epidemic. Once the encoder is trained with these decoder networks, the final decoder head was attached to the network, and the entire system is trained for a longer period (empirically, about 10 times the number of epochs used to train each smaller timescale works best). Initialization of the network weights by first training for the easier task of predicting shorter sequences improved the final validation performance of the full forecast period network, in addition to the overall stability of training. This effect is especially prevalent in the earlier days of the epidemic when data was sparse (which as mentioned in
(9) is a difficult part about modeling an epidemic).This model is then finetuned to generate models that predicted deaths on the top 50 counties for the epidemic (in terms of total deaths) and the less affected counties separately. This improves performance considerably on more affected counties like New York County.
4.2.2 Standard Neural Network
Motivated by the model ensembling performance described in 4.5
, one of the individual models used was a standalone neural network. We use a small network with a series of hidden dense layers (three layers of dimensions {20, 10, 10} respectively) activated by rectified linear units. Dropout is used for both hidden and input layers. These hyperparameters are tuned using automated grid searches. Early stopping is also used with a specified tolerance. The model is trained on all counties together to optimize for mean squared error, and a quantile prediction distribution was obtained using the negative binomial method described in section
4.5.1.4.2.3 Quantile Neural Network
The quantile neural network is constructed by training nine independent neural networks, each optimizing for quantile loss of one of the nine quantiles described in section 2 (i.e nine loss functions ). The architecture of each of these models is identical to the input data and architecture described in 4.2.2. The predictions of these models are finally concatenated to form a quantile forecast.
4.3 TreeBased Models
4.3.1 Random Forest
One of the decision treebased methods that we use is a random forest regressor model. This model performs poorly early on in the pandemic, mainly due to a lack of data (which leads to poor partitioning). The quality of the predictions vastly improves as more data became available. The model is trained on all counties together to minimize MSE. The quantile distribution is obtained by forming an empirical distribution from the predictions of the individual trees. One of the advantages of this model is that it is able to capture weekly reporting trends well. However, it struggles to generate reasonable quantile distributions for some of the top counties due to a lack of data in similar death ranges. This is again an issue of poor partitioning of leaves for very high death values, leading to unstable predictions for counties like New York City. This leads to exploding top quantiles for the highest death counties. Clipped top quantiles are used to combat this.
4.3.2 Quantile Gradient Boosted Decision Trees
Another decision tree based method we explored was gradient boosting decision tree models. For each county, separate GBDT models are trained directly on quantile loss, and the average of those predictions (for every day in the forecast period) is taken. The benefit of this model is that it optimized directly for quantile loss, and predicts different county level trends for each county. The main issue with the predictions is that a lot of predictions end up missing trends across the counties (since this is an individual county level model), and tend to look linear.
4.4 Gaussian Processes
Another model that we explored is a gaussian process regressor (as per (10)). As with a lot of statistical methods, this method introduces the idea of a marginal likelihood function, where the prior is taken to be a gaussian given by where
is a user defined kernel. This marginal likelihood is maximized via an optimizer (in this case, we used the LBGFSB optimizer), where at the end of training we have a predicted mean and variance (which defines a distribution to generate output for the forecast period). Instead of sampling from this distribution directly to generate quantiles, we took the predicted mean, use that as the mean of a negative binomial distribution for the method described in
4.5.1 to generate quantiles. The only significant improvements in the model is given by choice of kernel, specifically a combination of a constant kernel, and a rational quadratic kernel.4.5 Ensemble
Early on, we noticed that different models have unique strengths and weaknesses in their predictions, so we began exploring ensembling techniques that could combine predictions from all of our individual models. Our initial attempts focused on simple methods (linear regression, averaging, and gradient boosted decision tree ensembling) but we found that combining the predictions of these models using a neural network trained directly on pinball loss improved the consistency, stability, and accuracy of the final predictions. We use a shallow network with a series of hidden dense layers activated by rectified linear units (ReLU) with by dropout regularization. Smaller networks improve the reliability of the predictions and consistency of model convergence. Experiments with deeper or more complicated architectural patterns result in overfitting due to the lack of data.
The primary input to the aggregation network are the individual model predictions. However, we observed that there were some important features that impact ensemble performance: county characteristics, the number of days into the prediction period, and the day of the week. To help the network identify trends in these features, we provide as additional input onehot encoded features for: which state the county is in, the county cluster found using the clustering technique in section 4.5.2, the number of days into the prediction timeline, and the day of the week.
To generate final predictions, we train each of the individual models daily and formed an aggregation set consisting of the model predictions for models trained for 28 consecutive days. This aggregation set is then used to train the ensemble network, which generates predictions for the forecast period.
4.5.1 Negative Binomial Technique
Many of our individual models are trained to predict mean daily death values. To generate a quantile distribution from these mean predictions, we assumed the daily death values followed a negative binomial distribution with the mean and variance of the daily death random variable
given by:Here, is the prediction of the model per day, and is a parameter that controls the spread/uncertainty of the distribution. Instead of trying to fit this
value, we used a manual heuristic of using the variance of daily deaths in the past
days to estimate the value of . Using this parametrization of the distribution, we then sampled from the distribution to generate quantile predictions. This trick is motivated by the sampling method used in (7).4.5.2 Clustering
Initially, we attempted to cluster with dynamic time warping (DTW) as a metric since it has been shown to work well in other timeseries tasks, specifically in the audio domain (11), (12)
. However, clustering with DTW as a metric and KMeans and DBSCAN as clustering algorithms usually assigned all counties to one main cluster and a few small clusters (this behavior persisted across different choices for hyperparameters). We also experimented with using the raw time series data as the featurization for a county, and doing KMeans clustering on that, but that resulted in similar behavior (where most counties get clustered together, and the differences between timeseries in individual clusters seemed random).
Instead, we use a more intuitive approach to clustering, motivated by (13), wherein we consider changes in time vs. changes in daily death magnitude (this featurizes all time series on a relative scale). Specifically, using the daily death data for each county since March 10, we create a 2D histogram using difference in magnitude, and difference in time. Essentially, we create bins (difference in magnitude bins), and bins (difference in time bins), given by:
where a "bin" represents a range of values. Note that we consider all bins creating a 2D grid of bins. Then all pairs of two points on the time series were considered, resulting in a corresponding pair ( is difference in magnitude, is difference in time). Compiling these , values for each pair of points on the time series, we can fill in a 2D histogram based on the number of values that satisfy that pair.
This time series is turned into a matrix, which when plotted looks like figure 1(b). Each such (, ) image is then pooled using a (2, 2) block to take an average across the grid (so that the (8, 8) representation is turned into a (4, 4) representation), see figure 1(c) for reference. These (4, 4) representations of the daily death time series are flattened into a 16sized array. These feature representations are given as input to KMeans clustering (with number of clusters set to ), and the resulting cluster labels for each county is used as features in various models.
5 Results
5.1 Model Performance
All models are evaluated using pinball loss and RMSE (based on the predictions for the 50th quantile) as shown in Table 1. We show results for a 14day forecast for two periods: Period 1 (May 10, 2020 to May 23, 2020) and Period 2 (May 25, 2020 to June 4, 2020). The relative rankings of models are more or less consistent as we increase forecasttime period, with the LSTM and the ensemble as our best models.
Model  Period 1  Period 2  
Pinball Loss  RMSE  Pinball Loss  RMSE  
Ensemble Model 4.5  0.1054  1.6391  0.0896  1.4543 
Gaussian Processes 4.4  0.1560  4.6376  0.1038  1.9252 
Imperial Model 4.1.1  0.1337  2.0439  0.1012  1.6285 
LSTM 4.2.1  0.1182  1.7416  0.0907  1.4036 
Standard Neural Network 4.2.2  0.1308  1.5577  0.0994  1.5068 
Quantile Gradient Boosted Decision Trees 4.3.2  0.1356  3.2682  0.1155  1.9947 
Quantile Neural Network 4.2.3  0.1163  1.6557  0.0929  1.4639 
Random Forest 4.3.1  0.1431  3.0462  0.0937  1.4425 
Random Forest (Moving)  0.1199  1.6663  0.1183  1.4834 
SEIRQD 4.1.2  0.1455  1.9874  0.1189  1.746 
The quantiles predicted by the model on 4 of the counties most affected by the pandemic in Period 1 and Period 2 are shown in Figure 3 and figure 4 below. 10 days of deaths preceding our model predictions (part of the training set) are also shown on the same plots.
6 Discussion and Conclusion
As outlined in the paper, the best performing model is ultimately an ensemble of epidemiological and machine learning models. Our initial focus was on epidemiological techniques, the predominant class of models used in epidemiology to model pandemics . While these models are able to generate stable prediction distributions over longer time horizons, they are unable to take full advantage of all of the available data sources to predict finer resolution trends in the death data.
As the pandemic progressed, our models got to train on more data. These factors motivated us to explore machine learning models that would require minimal injection of priors on the dynamics of the disease spread, and instead rely on learning patterns in the data. Sequence models (described in 4.2.1) show the best performance as expected, but even nontimeseries models such as treebased models (4.3.2, 4.3.1) and standard neural networks (4.2.2, 4.2.3) are able to capture daily fluctuations in the data, while maintaining reasonable predictions over longer horizons.
While most machine learning models are difficult to interpret, some of them offer information on the importance of the various input features in predicting death rates. Lagged dynamic features of cases, deaths and mobility are the most important variables, with lagged cases being the most important among them. In addition, temporal features such as date offsets, and day of the week are important, presumably helping with differentiating which stage of the pandemic a region is in (as well as capturing weekly fluctuations in reporting, or the socalled "Sunday effect"). Static features are much less important, but certain density features such as population density had some predictive power.
To the best of our knowledge, this is the first geographically local (county level), variablelength forecast, US COVID19 daily death model combining epidemiological and machine learning methods. We have found a model that provides a more complete description of the US COVID19 daily death distribution, and has a consistently low RMSE score on daily deaths across different time periods. We hope this work can be useful for local/state level policymakers, as they decide how to deal with the ongoing pandemic in the US, and how to prepare for future waves of COVID19like epidemics.
References
 (1) Shreshth Tuli, Shikhar Tuli, Rakesh Tuli, and Sukhpal Singh Gill. Predicting the growth and trend of covid19 pandemic using machine learning and cloud computing. Internet of Things, page 100222, 2020.

(2)
Samuel Lalmuanawma, Jamal Hussain, and Lalrinfela Chhakchhuak.
Applications of machine learning and artificial intelligence for covid19 (sarscov2) pandemic: A review.
Chaos, Solitons & Fractals, page 110059, 2020.  (3) Ingo Steinwart, Andreas Christmann, et al. Estimating conditional quantiles with the help of the pinball loss. Bernoulli, 17(1):211–225, 2011.
 (4) The New York Times. nytimes/covid19data. GitHub, Jun 2020.
 (5) Nick Altieri, Rebecca Barter, James Duncan, Raaz Dwivedi, Karl Kumbier, Xiao Li, Robert Netzorg, Briton Park, Chandan Singh, Yan Shuo Tan, et al. Curating a covid19 data repository and forecasting countylevel death counts in the united states. arXiv preprint arXiv:2005.07882, 2020.
 (6) Michael S. Warren and Samuel W. Skillman. Mobility changes in response to covid19. arXiv preprint arXiv:2003.14228, 2020.
 (7) Seth Flaxman, Swapnil Mishra, Axel Gandy, H Unwin, Helen Coupland, T Mellan, Harisson Zhu, T Berah, J Eaton, P Perez Guzman, et al. Report 13: Estimating the number of infections and the impact of nonpharmaceutical interventions on covid19 in 11 european countries. 2020.
 (8) Liangrong Peng, Wuyue Yang, Dongyan Zhang, Changjing Zhuge, and Liu Hong. Epidemic analysis of covid19 in china by dynamical modeling. arXiv preprint arXiv:2002.06563, 2020.
 (9) Wuyue Yang, Dongyan Zhang, Liangrong Peng, Changjing Zhuge, and Liu Hong. Rational evaluation of various epidemic models based on the covid19 data of china. arXiv preprint arXiv:2003.05666, 2020.
 (10) Carl Edward Rasmussen and Hannes Nickisch. Gaussian processes for machine learning (gpml) toolbox. The Journal of Machine Learning Research, 11:3011–3015, 2010.
 (11) Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. Voice recognition algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques. arXiv preprint arXiv:1003.4083, 2010.
 (12) Xianglilan Zhang, Jiping Sun, and Zhigang Luo. Oneagainstall weighted dynamic time warping for languageindependent and speakerdependent speech recognition in adverse conditions. PloS one, 9(2):e85458, 2014.
 (13) Ashish Mahabal, Kshiteej Sheth, Fabian Gieseke, Akshay Pai, S George Djorgovski, Andrew J Drake, and Matthew J Graham. Deeplearnt classification of light curves. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI), pages 1–8. IEEE, 2017.
Comments
There are no comments yet.