Predictive Analysis of COVID-19 Time-series Data from Johns Hopkins University

We provide a predictive analysis of the spread of COVID-19, also known as SARS-CoV-2, using the dataset made publicly available online by the Johns Hopkins University. Our main objective is to provide predictions for the number of infected people for different countries. The predictive analysis is done using time-series data transformed on a logarithmic scale. We use two well-known methods for prediction: polynomial regression and neural network. As the number of training data for each country is limited, we use a single-layer neural network called the extreme learning machine (ELM) to avoid over-fitting. Due to the non-stationary nature of the time-series, a sliding window approach is used to provide a more accurate prediction.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

01/06/2022

Bayesian Regression Approach for Building and Stacking Predictive Models in Time Series Analytics

The paper describes the use of Bayesian regression for building time ser...
03/17/2019

Time Series Predict DB

In this work, we are motivated to make predictive functionalities native...
11/08/2019

Discovering Invariances in Healthcare Neural Networks

We study the invariance characteristics of pre-trained predictive models...
05/11/2021

Modelling and predicting soil carbon sequestration: is current model structure fit for purpose?

Soil carbon accounting and prediction play a key role in building decisi...
01/17/2018

Seismic-Net: A Deep Densely Connected Neural Network to Detect Seismic Events

One of the risks of large-scale geologic carbon sequestration is the pot...
04/20/2020

COVID-19 Time-series Prediction by Joint Dictionary Learning and Online NMF

Predicting the spread and containment of COVID-19 is a challenge of utmo...
04/06/2021

Autoencoder-based Representation Learning from Heterogeneous Multivariate Time Series Data of Mechatronic Systems

Sensor and control data of modern mechatronic systems are often availabl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Goal

The COVID-19 pandemic has led to a massive global crisis, caused by the rapid spread rate and severe fatality, especially, among those with a weak immune system. In this work, we use the available COVID-19 time-series of the infected cases to build models for predicting the number of cases in the near future. In particular, given the time-series till a particular day, we make predictions for the number of cases in the next days, where . This means that we predict for the next day, after 3 days, and after 7 days. Our analysis is based on the time-series data made publicly available on the COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at the Johns Hopkins University (JHU) (https://systems.jhu.edu/research/public-health/ncov/) [JHU_article].

Let denote the number of confirmed cases on the -th day of the time-series after start of the outbreak. Then, we have the following

  • The input consists of the last samples of the time-series given by .

  • The predicted output is , .

  • Due to non-stationary nature of the time-series data, a sliding window of size is used over to make the prediction, and is found via cross-validation.

  • The predictive function is modeled either by a polynomial or a neural network, and is used to make the prediction:

Countries considered in the analysis
Sweden
Denmark
Finland
Norway
France
Italy
Spain
UK
China
India
Iran
USA
TABLE I: Countries considered from JHU dataset.

Ii Dataset

The dataset from JHU contains the cumulative number of cases reported daily for different countries. We base our analysis on 12 of the countries listed in Table I. For each country, we consider the time-series starting from the day when the first case was reported. Given the current day index , we predict the number of cases for the day by considering as input the number of cases reported for the past days, that is, for the days to .

Iii Approaches

We use data-driven prediction approaches without considering any other aspect, for example, models of infectious disease spread [folkhal]. We apply two approaches to analyze the data to make predictions, or in other words, to learn the function :

  • Polynomial model approach: Simplest curve fit or approximation model, where the number of cases is approximated locally with polynomials is a polynomial.

  • Neural network approach

    : A supervised learning approach that uses training data in the form of input-output pairs to learn a predictive model

    is a neural network.

We describe each approach in detail in the following subsections.

Iii-a Polynomial model

Iii-A1 Model

We model the expected value of as a third degree polynomial function of the day number :

The set of coefficients are learned using the available training data. Given the highly non-stationary nature of the time-series, we consider local polynomial approximations of the signal over a window of

days, instead of using all the data to estimate a single polynomial

for the entire time-series. Thus, at the -th day, we learn the corresponding polynomial using .

Iii-A2 How the model is used

Once the polynomial is determined, we use it to predict for -th day as

For every polynomial regression model, we construct the corresponding polynomial function by using as the most recent input data of size . The appropriate window size is found through cross-validation.

Iii-B Neural networks

Iii-B1 Model

We use Extreme Learning Machine (ELM) as the neural network model to avoid overfitting to the training data. As the length of the time-series data for each country is limited, the number of training samples for the neural network would be quite small, which can lead to severe overfitting in large scale neural network such as deep neural networks (DNNs), convolutional neural networks (CNNs), etc.

[DNN_2013, CNN_2012]. ELM, on the other hand, is a single layer neural network which uses random weights in its first hidden layer [elm_HUANG2015]. The use of random weights has gained popularity due to its simplicity and effectiveness in training [giryes_randomweights, SSFN_2020, HNF_2020]. We now briefly describe ELM.

Consider a dataset containing samples of pair-wise -dimensional input data and the corresponding

-dimensional target vector

as . We construct the feature vector as , where

To predict the target, we use a linear projection of feature vector onto the target. Let the predicted target for the -th sample be . Note that . By using -norm regularization, we find the optimal solution for the following convex optimization problem

(1)

where denotes the Frobenius norm. Once the matrix is learned, the prediction for any new input is now given by

Iii-B2 How the model is used

When using ELM to predict the number of cases, we define and . Note that and . For a fixed , we use cross-validation to find the proper window size , number of hidden neurons

, and the regularization hyperparameter

.

Iv Experiments

Iv-a With the available data till May 4, 2020

In this subsection, we make predictions based on the time-series data which is available until May 4, 2020. We estimate the number of cases for the last 31 days of the countries in Table I. For each value of , we compare the estimated number of cases with the true value and report the estimation error in percentage, i.e.,

(2)

We carry out two sets of experiments for each of the two approaches (polynomial and ELM) to examine their sensitivity to the new arriving training samples. In the first set of experiments, we implement cross-validation to find the hyperparameters without using the observed samples of the time-series as we proceed through 31 days span. In the second set of experiments, we implement cross-validation in a daily manner as we observe new samples of the time-series. In the latter setup, the window size varied with respect to time to find the optimal hyperparameters as we proceed through time. We refer to this setup as ’ELM time-varying’ and ’Poly time-varying’ in the rest of the manuscript.

We first show the reported and estimated number of infection cases for Sweden by using ELM time-varying for different ’s in Figure 1. For each , we estimate the number of cases up to days after which JHU data is collected. In our later experiments, we show that ELM time-varying is typically more accurate than the other three methods (polynomial, Poly time-varying, and ELM). This better accuracy conforms to the non-stationary behavior of the time-series data, or in other words that the best model parameters change over time. Hence, the result of ELM time-varying is shown explicitly for Sweden. According to our experimental result, we predict that a total of 23039, 23873, and 26184 people will be infected in Sweden on May 5, May 7, and May 11, 2020, respectively.

Histograms of error percentage of the four methods are shown in Figure 2 for different values of . The histograms are calculated by using a nonparametric kernel-smoothing distribution over the past 31 days for all 12 countries. The daily error percentage for each country in Table I is shown in Figures 3-11

. Note that the reported error percentage of ELM is averaged over 100 Monte Carlo trials. The average and the standard deviation of the error over 31 days is reported (in percentage) in the legend of each of the figures for all four methods. It can be seen that daily cross-validation is crucial to preserve a consistent performance through-out the pandemic, resulting in a more accurate estimate. In other words, the variations of the time-series as

increases is significant enough to change the statistics of the training and validation set, which, in turn, leads to different optimal hyperparameters as the length of the time-series grows. It can also be seen that ELM time-varying provides a more accurate estimate, especially for large values of . Therefore, for the rest of the experiments, we only focus on ELM time-varying as our favored approach.

Another interesting observation is that the performance of ELM time-varying improves as increases. This observation verifies the general principle that neural networks typically perform better as more data becomes available. We report the average error percentage of ELM time-varying over the last 10 days of the time-series in Table II. We see that as increases the estimation error increases. When , ELM time-varying works well for most of the countries. It does not perform well for France and India. This poor estimation for a few countries could be due to a significant amount of noise in the time-series data, even possibly caused by inaccurately reported daily cases.

V Conclusion

We studied the estimation capabilities of two well-known approaches to deal with the spread of the COVID-19 pandemic. We showed that a small-sized neural network such as ELM provides a more consistent estimation compared to polynomial regression counterpart. We found that a daily update of the model hyperparameters is of paramount importance to achieve a stable prediction performance. The proposed models currently use the only samples of the time-series data to predict the future number of cases. A potential future direction to improve the estimation accuracy is to incorporate constraints such as infectious disease spread model, non-pharmaceutical interventions, and authority policies [folkhal].

Country Sweden Denmark Finland Norway France Italy Spain UK China India Iran USA
1 day prediction 0.9 0.5 0.9 0.2 0.8 0.1 0.5 0.3 0 0.7 0.1 0.4
3 days prediction 2.6 0.7 0.7 0.6 2 0.3 2.5 1.3 0 2.1 0.2 1.7
7 days prediction 2 4.8 2.2 1.2 18.2 1.1 3.1 3 0.2 8.8 0.6 4.9
TABLE II: Average estimation error in percentage () over the last 10 days for ELM time-varying.
(a)
(b)
(c)
Fig. 1: Reported and estimated number of cases after days over the last 31 days of Sweden for ELM time-varying.
(a)
(b)
(c)
Fig. 2: Histogram of estimation error percentage over 31 days of all 12 countries for different values of .
(a) Sweden
(b) Denmark
(c) Finland
(d) Norway
Fig. 3: Daily error percentage of the last 31 days of 12 countries for ELM and polynomial regression. Here, .
(a) France
(b) Italy
(c) Spain
(d) UK
Fig. 4: Daily error percentage of the last 31 days of 12 countries for ELM and polynomial regression. Here, .
(a) China
(b) India
(c) Iran
(d) USA
Fig. 5: Daily error percentage of the last 31 days of 12 countries for ELM and polynomial regression. Here, .
(a) Sweden
(b) Denmark
(c) Finland
(d) Norway
Fig. 6: Daily error percentage of the last 31 days of 12 countries for ELM and polynomial regression. Here, .
(a) France
(b) Italy
(c) Spain
(d) UK
Fig. 7: Daily error percentage of the last 31 days of 12 countries for ELM and polynomial regression. Here, .
(a) China
(b) India
(c) Iran
(d) USA
Fig. 8: Daily error percentage of the last 31 days of 12 countries for ELM and polynomial regression. Here, .
(a) Sweden
(b) Denmark
(c) Finland
(d) Norway
Fig. 9: Daily error percentage of the last 31 days of 12 countries for ELM and polynomial regression. Here, .
(a) France
(b) Italy
(c) Spain
(d) UK
Fig. 10: Daily error percentage of the last 31 days of 12 countries for ELM and polynomial regression. Here, .
(a) China
(b) India
(c) Iran
(d) USA
Fig. 11: Daily error percentage of the last 31 days of 12 countries for ELM and polynomial regression. Here, .

References