Curse of Small Sample Size in Forecasting of the Active Cases in COVID-19 Outbreak

11/06/2020 ∙ by Mert Nakıp, et al. ∙ Sapienza University of Rome 10

During the COVID-19 pandemic, a massive number of attempts on the predictions of the number of cases and the other future trends of this pandemic have been made. However, they fail to predict, in a reliable way, the medium and long term evolution of fundamental features of COVID-19 outbreak within acceptable accuracy. This paper gives an explanation for the failure of machine learning models in this particular forecasting problem. The paper shows that simple linear regression models provide high prediction accuracy values reliably but only for a 2-weeks period and that relatively complex machine learning models, which have the potential of learning long term predictions with low errors, cannot achieve to obtain good predictions with possessing a high generalization ability. It is suggested in the paper that the lack of a sufficient number of samples is the source of low prediction performance of the forecasting models. The reliability of the forecasting results about the active cases is measured in terms of the cross-validation prediction errors, which are used as expectations for the generalization errors of the forecasters. To exploit the information, which is of most relevant with the active cases, we perform feature selection over a variety of variables. We apply different feature selection methods, namely the Pairwise Correlation, Recursive Feature Selection, and feature selection by using the Lasso regression and compare them to each other and also with the models not employing any feature selection. Furthermore, we compare Linear Regression, Multi-Layer Perceptron, and Long-Short Term Memory models each of which is used for prediction active cases together with the mentioned feature selection methods. Our results show that the accurate forecasting of the active cases with high generalization ability is possible up to 3 days only because of the small sample size of COVID-19 data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Since the first COVID-19 case confirmed on December 2019 in Wuhan, the COVID-19 outbreak has been spreading with acceleration all around the world. According to rapid spread of this pandemic, in the most of the countries, the first concern was that the medical facilities may not be sufficient to handle with the massive number of patients. To plan necessary actions such as increasing the facilities or taking preventive decisions to flatten the curve of daily cases, the determination of the future pattern of the active cases has become one of the most important issues. So, many studies on the forecasting of the number of active cases have been published in the literature [benvenuto2020application, anastassopoulou2020data, petropoulos2020forecasting]. Although there are many valuable results in the published works, some of the publications optimistically make long term predictions for the pandemic [anastassopoulou2020data, yang2020modified]. Furthermore, in most of the works, the test performance of the forecasting results has not been demonstrated well due to the restricted size of the available time series data covering several months only [benvenuto2020application, anastassopoulou2020data, yang2020modified].

In this study, we perform an analysis on the generalization ability of the data-dependent forecasters to explain why forecasting models used for determining the trend of COVID-19 cases possess poor medium and long term prediction performances in the special case of machine learning models. For this purpose, a forecasting system that consists of a feature selection module and a machine learning based forecasting module is designed and implemented. In the early phase of our studies, we observed that such an architecture provides the best forecasting performance among the considered models including the standard architectures of Linear Regression (LR), Multi-Layer Perceptron (MLP), and Long-Short Term Memory (LSTM) state-of-the-art models with or without feature selection. The rational behind the choice of these models relies on the following three facts: 1) LR is a linear static model which is the least complex architecture, so possessing the high generalization ability [yan2009linear]

. 2) MLP is a nonlinear static neural network model which has universal function approximation property and can be said to be the most widely used neural network model with producing successful results in many applications

[hastie2009elements]

. 3) LSTM is a recurrent neural network model which is capable of approximating to the nonlinear dynamics and has proved itself as the best model in many challenging applications requiring to capture the temporal relations hidden in inherently nonlinear dynamics

[sak2014long]. In order to determine the best performance provided by this feature selection based forecasting model in terms of the cross-validation error, we trained and tested all of the feature selection and forecaster pair combinations: For the feature selection module, No Feature Selection (No FS), iterative feature selection based on the Pairwise Correlation (PCorr), Recursive Feature Selection (RFS), and feature selection by using the Lasso regression (Lasso, in short) [muthukrishnan2016lasso] are used. For the machine learning module, LR, MLP, and LSTM are chosen as the forecasters to process the selected features.

I-a Relationship to the State of the Art

Now, we present the relationship between our study and the works that aim to forecast the active cases in COVID-19 outbreak. According to the best of the authors’ knowledge, with respect to the method of forecasting, we classify the studies that forecast the active cases in COVID-19 outbreak into 3 categories as follows: (1) SIR (Susceptible, Infected, Recovered) family

[ndiaye2020analysis]; (2) statistical time series analysis methods (for example, Auto-Regressive Integrated Moving Average (ARIMA)) [benvenuto2020application]; (3) machine learning methods [pereira2020forecasting, rizk2020COVID, villalobos2020using].

The SIR model is a dynamical compartmental model for describing the time evolution of a disease transmitted from human to human within a population by a set of nonlinear ordinary differential equations. In the SIR model, the total population is assumed to be constant and divided into the following classes: Susceptible (S), Infected (I), and Recovered (R)

[ndiaye2020analysis]. The works in [ndiaye2020analysis, anastassopoulou2020data, roda2020difficult]

use the SIR model as the estimator for the number of active cases. In

[ndiaye2020analysis], the authors focus on forecasting the active cases for all countries, where forecasting horizon is 5 days. The study in [anastassopoulou2020data], SIRD (Susceptible, Infectious, Recovered, Dead) method is used to predict the active cases under 3 different scenarios for only the Hubei province of China. Furthermore, in [roda2020difficult], the authors compare SIR with SEIR (Susceptible, Exposed, Infectious, Recovered) considering only active cases in Wuhan. They show that the SIR model performs much better than the SEIR model in representing the information contained in the confirmed-case data. This indicates that predictions using more complex SIR-like dynamical models may not be reliable in comparison to the ones using simpler SIR-like models. On the other hand, although SIR-like models explains rise-and-fall nature of growth of the pandemic, they fail to capture the peak and the whole time-evolution of the disease within a reasonable accuracy due to the sensitive dependence of the time waveform of the solutions to the SIR differential equations on model parameters such as the average number of contacts per person per time. It should be noted that SIR-like models could be used for an accurate prediction of active cases only when highly accurate parameter estimations are achievable depending on the real data.

Besides SIR, statistical time series analysis methods are also used to predict the active cases in COVID-19 outbreak. For the COVID-19 outbreak in Italy, the works forecast the active cases by using ARIMA which is a linear time-invariant dynamical model with stochastic input[ceylan2020estimation, ribeiro2020short, kumar2020forecasting], for 2 days [benvenuto2020application] and Seasonal ARIMA (SARIMA) for 60 days [chintalapudi2020COVID]. In [petropoulos2020forecasting], the exponential smoothing based models are used to predict future of the cumulative number of cases. The results of the works in this category show that these forecasters make the prediction similar to the trend of the past data. Thus, although those forecasters are able to capture the increasing trend of the active cases until the peak point, they are not capable of determining the whole time evolution of the disease.

There are a few works that uses machine learning methods in order to forecast the active cases in the COVID-19 pandemic. In the works [pereira2020forecasting, zandavi2020forecasting, pal2020neural, vadyala2020prediction], the LSTM based models are used to forecast the future of the pandemic by training the model for the past COVID-19 data for each of the selected countries. Similar to this work, the MLP based models in [rizk2020COVID, tamang2020forecasting]

, support vector machine models in

[yadav2020analysis]

and the logistic regression in

[villalobos2020using] were trained and then tested on the COVID-19 data of each country that was selected for test. In [yang2020modified], the authors took into account the problem of the small sample size for the COVID-19 pandemic and trained their model on the 2003 SARS corona virus outbreak data. The proposed results by the works in this category show that the forecasting models perform well with respect to the error metric measurements; however, the graphs show that the models are not able to capture either the peak days and the values or the non-increasing parts of the time series.

The rest of this paper is organized as follows: In Section II, we state the problem and our method proposed for the forecasting of the number of active cases. In Section III, we present the feature selection methods and the parameter optimization for each method. In Section IV, we present the implementation of the considered forecasting methods. In Section V, we present our results on the forecasting of the number of active cases in COVID-19 pandemic. In Section VI, we present our conclusions.

Ii Statement of the Problem

In this section, we describe the forecasting problem for the active cases in COVID-19 outbreak. We aim to examine the generalization ability of the machine learning based forecasters for identifying their predictive powers on the COVID-19 data. To this end, we first analyze the effects of the different features on the forecasting of active cases and then select the important features that increase the forecasting accuracy. Second, we design forecasting models that perform prediction of the number of active cases. Furthermore, we analyze the performance of the forecasters for different forecasting horizons in an increasing order, and we provide the most reliable forecasting horizon for this problem by means of an empirical analysis.

Ii-a System Design

For the forecasting of the number of active cases, we design a system shown in Fig. 1 that consists of the Feature Selection module and the Forecasting module. The output of the system is the predicted value of each of the number of active cases for - to -day ahead forecasting. Furthermore, the detailed explanation of the methods that are used in the Feature Selection module and the Forecasting module are given in Section III and Section IV, respectively.

Fig. 1: System Design of the Integrated Feature Selection - Forecasting for the Number of Active Cases in the COVID-19 Pandemic

Ii-B Selection of the Important Features

Since we know that there are many different features that may affect the spread of the COVID-19 outbreak, we analyze the features that we are able to access and select the feature subset. Each feature in this subset has important effects on the number of active cases. In order to improve the performance of the overall system, we perform the feature analysis combined with the forecasting module. That is, by using the feature selection methods in Section III, we select feature subset that achieves the best forecasting performance under the considered forecasting scheme.

Ii-C Forecasting of the Active Cases

In the forecasting problem, we aim to compute the future value of the active cases. To this end, we use machine learning models with supervised learning whose output is future value of the active cases at

th day. According to the best of the authors’ knowledge, there is no study that examines the maximum length of forecasting horizon that provides the forecasting within a reasonable accuracy for COVID-19 pandemic. In order to determine this horizon (which is the value of ), we forecast the total number of active cases for the increasing forecasting horizon length from -day to -days. We give the details of the machine learning methods that are used as the forecasting model and their input-output structures in Section IV

Iii Analysis and Selection of the Features

In this section, we describe the methods for the selection of the relevant subset of features. For each country, first, we take past values of each of the following daily time series datasets: The number of total cases, the total number of deaths, and the total number of recovered patients. In order to convert the number of total cases to the number of active cases, which is actually the important value that will affect the control of the hospital facilities during the pandemic, we subtract the total sum of the deaths and recovered from the total cases. Then, in order to convert each of the total deaths and the total recovered into the per day basis, we take the difference between the samples. That is, the resulting three time series are the number of active cases, the number of deaths per day, and the number of recovered patients per day.

Second, we have selected additional 36 different features that might affect the spread of COVID-19 and which are online available for all countries [COVID19_merged, countryinfo]. Note that none of these features is not a time series data. The details of these features are given in Section V-A

In order to select the subset of these feature candidates, we apply three different feature selection methods: Iterative feature selection based on the pairwise correlation (PCorr) of each feature candidate pair (in short, correlation matrix), Recursive Feature Selection (RFS), and feature selection by using the Lasso regression (Lasso, in short). For each of the forecasting models given in Section IV for each value of , we choose one of these feature selection methods by calculating the overall performance based on the cross-validation.

Iii-a Feature Selection based on the Pairwise Correlation (PCorr)

In this method, first, for each feature, we calculate the correlation of this feature with each of the other features. After we complete the calculation of the pairwise correlation values for all of the feature candidate pairs, we compute the indices of the feature candidate pairs each of which whose correlation value is greater than the threshold value or less than . Third, from each pair of feature candidates, we eliminated a feature whose average correlation with other features is the greatest. Note that, we select the optimal value of by using an exhaustive search in the range of with the increment of .

In order to calculate the correlation matrix, we used the Pearson product-moment correlation, which is defined with the formula in Equation 

1. Each of the coefficients calculated by using this formula measures the strength and the direction of the linear relationship between two feature candidates.

(1)

In Equation 1, denotes the correlation value between the th and the th features. denotes the value of the th feature candidate at sample , denotes the mean, and

denotes the standard deviation of the

th feature candidate. denotes the total number of samples. The numerator of this formula is the covariance between th and th features.

Iii-B Recursive Feature Selection (RFS)

The aim of the RFS is to shrink the set of the selected features in a recursive way. The RFS algorithm takes the set of feature candidates as the input. First, it trains an LR model with all of the feature candidates and keeps the coefficients of this LR. Note that we selected LR as the coefficient determination model in order to increase the generalization. The loop of the RFS searches for the optimal value of the desired number of features between and the total number of feature candidates. For each value of , RFS works as follows: 1) It updates the input set by pruning features for which the LR coefficients are smaller than others. 2) The forecasting model is trained and tested with the resulting feature set by using cross-validation. 3) The mean of the test scores over the folds of the cross-validation is appended to the score list. When the end of the loop for is reached, the optimal value of is computed empirically. Finally, the resulting feature set is selected as to contain the first features that have the LR coefficients with the highest-values.

Iii-C Feature Selection by using the Lasso Regression (Lasso)

In order to select the subset of the feature candidates, we use the classical Lasso Regression with 5-fold cross-validation over the input sample set. For each fold of the cross-validation, we split data into the training and validation sets, then train the Lasso model on training set, test it on the test set and finally calculate the test score. As the best Lasso Regression model, we select the model that achieves the highest test performance over all of the models, each of which is trained in a cross-validation fold. Finally, we select the features each of whose Lasso coefficient is not equal to zero.

Iv Forecasting of the Number of Active Cases

In this section, we describe how we forecast the future values of the number of active cases, and the detailed design of the forecasting module in Fig. 1.

In the forecasting module shown in Fig. 1, we use different forecasters for to step ahead forecasting. Each of these forecasters is defined by its inputs, its output, and its internal model parameters. For each forecasting step , we set the input of each forecasting model to the features that are selected as explained in Section III. We set the output of this forecaster to the value of the number of active cases at the th step in the future, denoted by .

In order to forecast the future values of the number of active cases, we perform a comparative study with LR, MLP, and LSTM. We now describe the design and implementation of each of these models.

Iv-a Linear Regression

We have selected the well-known linear regression model as a benchmark forecaster. In the implementation of this model, we use the Linear Regression module from scikit-learn library [scikitlearn]. The module fits a linear model with the coefficients to minimize the residual sum of squares between the observed targets in the dataset, and the predicted values by the linear approximation.

Iv-B Multi-Layer Perceptron

We design an MLP model, which consists of two hidden layers. We let

denote the number of neurons at hidden layer

. In order to find the local optimal architecture of the MLP model, we search for the values of for within the range of for each integral power of two. We present the resulting architecture of the MLP model and compare the performances of these models in Section V-D

. Furthermore, we set the activation function of each neuron to

. In the implementation of the MLP model, we use the Keras library in Python [chollet2015keras].

Iv-C Long-Short Term Memory

Our implementation of the LSTM model, which is coded by using Keras library, consists of one lstm layer, two fully connected layers, and an output layer. We let denote the number of lstm units at the lstm layer and denote the number of neurons at each fully connected layer . We exhaustively search for the local optimal values of and within the range of for each integral power of two.

V Results

V-a Dataset

In this paper, we have considered two different data domains. The first data domain is the time series data which consists of the number of active cases, the number of deaths and the number of recoveries for 71 different countries from 22th of January 2020 to 20th of July 2020. This domain contains one dataset collected from [novel-corona-virus-2019-dataset]. The second data domain consists of two different datasets each of which includes different features that regard to each country. The first dataset in the second domain consists of 63 different features for 173 countries and is taken from [COVID19_merged]. The second dataset in this domain consists of 58 different features for 194 countries and is taken from [countryinfo].

We first got the intersection of all of the datasets with respect to the countries. Then, in the resulting dataset, we eliminated the features that are not available for all of the countries. Furthermore, we chose the subset of the country specific features, and we got 36 features. These country specific features are as follows: (1) latitude, (2) longitude, (3) population, (4) the number of people per kilometer square (in short, Density), (5) urban population (in short, Urban-Pop), (6) fertility, (7) median of the age (in short, Median-Age), (8) average of the temperature between January 2020 and March 2020 (in short, Avg-Temperature), (9) average of the humidity between January 2020 and March 2020 (in short, Avg-Humidity), (10) the number of male children born per female giving birth (in short, Male-Birth), (11-16) the number of males per female in overall (in short, MF) and in the age groups 0-14 (in short, MF-14), 15-25 (in short, MF-25), 26-54 (in short, MF-54), 55-64 (in short, MF-64), and 65+ (in short, MF-65+), (17) the percentage of the smokers (in short, Smokers), (18) the number of beds in hospitals (in short, Bed-Capacity), (19-21) the percentage of each of the female (in short, % Female-Lung), male (in short, % Male-Lung) and both female and male (in short, % Lung) that have lung diseases, (22) the death rate per 100000 caused by flu pneumonia (in short, Pneumonia-Death-100K), (23) the binary flag that is equal to one for a country if the number of H1N1 cases is underestimated in 2009 for this country (in short, H1N1-Underestimate), (24) the total number of confirmed cases per country during the H1N1 pandemic in 2009 (in short, H1N1-Confirmed), (25) the total number of confirmed deaths caused by H1N1 during the H1N1 pandemic in 2009 (in short, H1N1-Deaths), (26) the annual precipitation (in short, (Annual-Precipitation), (27) the ratio of median property prices to the median familial disposable income (in short, Property-Affordability), (28) the estimation [HealthCareIndex] of the overall quality of the health care system (in short, Health-Care), health care professionals, equipment, staff, doctors and cost, (29) the gross domestic product in 2019 (in short, GDP-2019), (30) the health expenses in USD (in short, Health-Expenses), (31) the health expenses per one million individuals (in short, Health-Expenses-1M), (32) the limit of the number of person for gathering (in short, Gathering-Limit), and (33-36) the number of days past between the date of the first confirmed case and each of the closing date of the non-essential public places (in short, Nonessential-Close-Days), the starting date of the public gathering limitations (in short, Gathering-Limit-Days), the closing date of the schools (in short, School-Close-Days), and the closing date of the public places (in short, PublicPlace-Close-Days).

As a result, our dataset consists of features, where of them are the time series features during the COVID-19 pandemic, and of them are the general country related features.

Fig. 2: Heatmap of the pairwise pearson correlation for the feature candidates that are not time-series

V-B Performance Evaluation by using 10-Fold Cross-Validation

For each of the LR, MLP, and LSTM models, in order to measure the generalization ability of the model, we perform 10-fold Cross-Validation (CV).

In each fold of the CV, we first split the dataset into the training set and test set randomly with ratios of and , respectively. Second, we train the model on the training set and test it on the test set for the current fold. Third, we measure both of the training and test performance of the model by using the r metric [glantzslinker].

In the training of the MLP and LSTM models, we use the ADAM algorithm as optimizer with the loss selected as the mean squared error (MSE). We set the parameters of the ADAM algorithm as follows: the initial learning rate to , beta1 to , beta2 to . Furthermore, we set the batch size to

. During the training of MLP and LSTM models, we set the maximum number of epochs to

for the early stopping that executes the training at the epoch where the training loss has not been decreasing for the last successive epochs.

V-C Feature Selection Methods for the Forecasting of Active Cases

In this subsection, for the forecasting problem of the active cases, we give the resulting parameters and the computation results for each of the PCorr, RFS, and Lasso feature selection methods. Furthermore, in Section V-D, based on the r score, we present the forecasting results for all of these feature selection methods for each forecasting model.

V-C1 Feature Selection based on the Pairwise Correlation (PCorr)

In Fig. 2, we present the heatmap of the resulting pairwise pearson correlation for the feature candidates that are not time-series. In this figure, as the color of a pixel becomes whiter, the correlation value of the feature pair for the corresponding pixel increases. That is, the pixels each of whose color is very close to white and very close to black indicate the feature pairs that are highly correlated. For example, the Latitude of each country is inversely correlated with the Avg-temperature in that country, where the correlation value is . In addition, Median-Age and Fertility are inversely correlated with the correlation value . On the other hand, in Fig. 2, we see that all of % Lung, % Male-Lung,% Female-Lung features are highly correlated with correlation value above .

V-C2 Recursive Feature Selection (RFS)

In Fig. 3, we present the optimal number of features that are selected by the RFS algorithm. In this figure, we see that the RFS selects all of the features for the LR forecaster. The reason is that the elimination of the features does not improve the LR performance any more. In addition, we see that the RFS selects only at most 9 features, which is the lowest number of features with respect the selected features for the other forecasters. The features that are selected for the LSTM model are the time series features with one exception, which is that the one of the selected features for the forecasting of day is the GDP-2019. Furthermore, we see that the numbers of selected features for LR and LSTM models do not change significantly with the increasing forecasting step; however, the number of selected features for the MLP forecaster varies with the increasing forecasting step.

Fig. 3: Comparison of the total number of selected features by RFS for each of LR, MLP, and LSTM forecasters

V-C3 Feature Selection by using the Lasso Regression (Lasso)

In Fig. 4, we present the total number of selected features by Lasso for the increasing forecasting step. Note that the selected features by Lasso are the same for all forecasting models. We see that the Lasso tends to select more features as the number of forecasting step increases because the forecasting problem of determining the number of active cases becomes harder as the forecasting horizon widen. Furthermore, in the problem of forecasting of the near future, the number of active cases for the last day in the past has the best relationship with the desired output.

Fig. 4: The total number of selected features by Lasso for the increasing number of forecasting step

V-D Forecasting Results for the Active Cases

In this subsection, we discuss the predictability of the number of active cases for the COVID-19 pandemic. We also compare the forecasting performances of the LR, MLP and LSTM models.

Fig. 5: Forecasting performance of the Linear Regression for training and test sets under different feature selection methods.

In Fig. 5, we show the forecasting performance of the LR model under no feature selection (No FS), PCorr, RFS, and Lasso with respect to r metric. In this figure, for all of the feature selection methods, we see that the generalization gap between the training and the test performance of LR enlarges as the number of time step increases. In addition, the training performance (in r metric) of the LR method is above ; however the test performance is under after . Even tough the LR is a linear model with high generalization ability, our results show that the LR model over fits the training data for the average of the CV folds. Furthermore, since the of the COVID-19 patients develop symptoms and are hospitalized within -days [lauer2020incubation], we see a significant performance improvement for for all of the feature selection methods. That is, the LR model is able to capture the linear relationship between the past and the -day ahead. Although the LR model cannot achieve a reasonable performance after -days, it performs the best under Lasso.

Fig. 6: Forecasting performance of the MLP for training and test sets under different feature selection methods.

In Fig. 6, we present the CV results on both of the training and test sets for the MLP model under four different feature selection methods. First, we see that the mean of the training performance of MLP does not fall under for all of the feature selection methods. However, the test performance of the MLP model significantly decreases as the value of increases. That is, MLP, as a nonlinear model, highly over fits to the training data. Next, the MLP achieves its best r performance under RFS up to , and its performances under No FS and under Lasso are comparable after .

Fig. 7: Forecasting performance of the LSTM for training and test sets under different feature selection methods.

In Fig. 7, we show both of the training and test r performances of the LSTM model under four feature selection methods. For the LSTM model under each of the No FS, PCorr, RFS, and Lasso, since the training r performance of the LSTM model does not fall under , the test performance of that is under after . Furthermore, we see that RFS outperforms all of the other feature selection methods for the LSTM model. However, even the performance of the LSTM under RFS decreases to r value for .

Fig. 8: Comparison of the forecasting performance of LR, MLP, and LSTM under the best feature selection method for each value of with respect to the mean of the CV test scores.

In Fig. 8, we give the comparison of LR, MLP, and LSTM models each of which is applied together with the best performing feature selection method for each value of . First, we see that only up to , the r performances of all models are higher than . However, the MLP model significantly decreases at , where this point is for the LSTM and for the LR. Second, we see that after , there are no forecaster that achieves the r score which is higher than . It is concluded that since the number of samples for the forecasting problem of the number of active cases during COVID-19 pandemic is quite small to represent the feature space, we see that LR outperforms the other two models for all values of , except .

According to our results shown in Fig. 5, Fig. 6, Fig. 7, and Fig. 8, we see that due to the curse of small sample size, it is hard to forecast the number of active cases in COVID-19 outbreak with high generalization ability after days, except the th day for which LR produces high prediction accuracy that might be due to the linear relation caused by the 14-day quarantine period applied to suspected persons.

V-D1 Forecasting of Active Cases on Extended Dataset

In order to see the performance improvement with increasing sample size, we extended the dataset (which was collected from January 22 to July 20, 2020) with the number of active cases, the number of deaths and the number of recoveries for 71 different countries in COVID-19 pandemic until July 20, 2020. For this extended dataset, we repeat the methodology (in Section II-A) to generate the results in the rest of this section.

Fig. 9: Comparison of the forecasting performance of LR, MLP, and LSTM under the best feature selection method for each value of with respect to the mean of the CV test scores on the extended dataset.

In Fig. 9, we display the r performance of each forecasting scheme LR, LSTM, and MLP. We see that for each value of , LR outperforms to both MLP and LSTM forecasters; however, even the performance of LR is around the for . In addition, the r performance of LSTM decreases after , and that of MLP decreases significantly after . Furthermore, due to the increased sample size of the dataset from Fig. 8 to Fig. 9, we see that the performances of all forecasting schemes increase significantly for all value of .

V-D2 Forecasting of Active Cases for Turkey

Now, in Fig. 10, we present the forecasting results for the number of active cases for the increasing time step ahead forecasting in Turkey between 26th of March 2020 and 20th of July 2020. From Fig. 10(a) to Fig. 10(f), we respectively set the value of equal to , , , , and . In Fig. 10, for each value of , we concatenate the th-step ahead forecasting over the sliding windows with -day sliding at each step.

Fig. 10: Comparison of the LR, MLP, and LSTM with respect to each of , , , , and -step ahead forecasting of the number of active cases in Turkey between 26th of March 2020 and 20th of July 2020.

In Fig. 10(a), we see that the LR and MLP perform better than the LSTM forecaster, where LSTM is not able to forecast the number of active cases at around peak day. In this figure, except the days between 20th April and 10th May, all of the LR, MLP and LSTM models perform forecasting, which is close to real number of active cases. From Fig. 10(a) to Fig. 10(f), as the value of increases, we see that forecasting performances of all forecasting schemes decreases, and LR performs the closest forecasting to the real value of the number of active cases. Thus, in Fig. 10(f), although the MLP forecasts close to real until 1st of May and LR forecasts close to real between 1st of June and 15th of June, we see that none of the forecasting models are able capture the general trend of the number of active cases and forecast the number of active cases for the peak day correctly for Turkey when .

Vi Conclusions

In this paper, we perform a study to determine the accurate forecasting horizon for the number of active cases in COVID-19 pandemic. To this end, we compare the performance of the Linear Regression (LR), Multi-Layer Perceptron (MLP), and Long-Short Term Memory (LSTM) for a variety of forecasting horizon lengths. Herein, the linear static model LR is chosen for its potentially high generalization ability. The most widely used static nonlinear neural network model MLP is preferred due to its powerful approximation property. The recurrent neural network LSTM is taken as the third benchmark model since it is the state-of-the-art model that is highly successful in capturing temporal relations in time series data. Considering the existence of limited number of samples only for COVID-19 pandemic, in order to achieve acceptable generalization ability for each of the three forecasters, we perform a feature selection to the input of the forecaster for reducing the model complexities. The forecaster under no feature selection (No FS) are then compared to the forecasters with the feature selection based on the Pairwise Correlation (PCorr), Recursive Feature Selection (RFS), and feature selection by using Lasso regression (Lasso), respectively.

Our main conclusion is that the long term forecasting (in other words, prediction) of the number of active cases in COVID-19 pandemic is not possible with high test accuracy at least for the considered three benchmark models as a consequence of their poor generalization abilities under the very limited number of samples available, up to now, for the COVID-19 pandemic. This study is not conclusive. The other machine learning models such as 1-dimensional or multi-dimensional Convolutional Neural Networks may be applied for forecasting COVID-19 features such as active cases. However, all of these forecasting models will suffer from the small sample size problem.

The study presented in this paper shows that the forecasting problem of the active cases might be solved by achieving the high performance and generalization ability up to -days ahead only. In addition, this statement is also valid for the th day ahead but only by using a linear model. Furthermore, even the best performing model is not able to perform better than fitting the mean of the data (which corresponds r value equals to ) after -days.

References