COVID-19 Time-series Prediction by Joint Dictionary Learning and Online NMF

by   Hanbaek Lyu, et al.

Predicting the spread and containment of COVID-19 is a challenge of utmost importance that the broader scientific community is currently facing. One of the main sources of difficulty is that a very limited amount of daily COVID-19 case data is available, and with few exceptions, the majority of countries are currently in the "exponential spread stage," and thus there is scarce information available which would enable one to predict the phase transition between spread and containment. In this paper, we propose a novel approach to predicting the spread of COVID-19 based on dictionary learning and online nonnegative matrix factorization (online NMF). The key idea is to learn dictionary patterns of short evolution instances of the new daily cases in multiple countries at the same time, so that their latent correlation structures are captured in the dictionary patterns. We first learn such patterns by minibatch learning from the entire time-series and then further adapt them to the time-series by online NMF. As we progressively adapt and improve the learned dictionary patterns to the more recent observations, we also use them to make one-step predictions by the partial fitting. Lastly, by recursively applying the one-step predictions, we can extrapolate our predictions into the near future. Our prediction results can be directly attributed to the learned dictionary patterns due to their interpretability.



page 4


Applications of Online Nonnegative Matrix Factorization to Image and Time-Series Data

Online nonnegative matrix factorization (ONMF) is a matrix factorization...

Online nonnegative tensor factorization and CP-dictionary learning for Markovian data

Nonnegative Matrix Factorization (NMF) algorithms are fundamental tools ...

Online algorithms for Nonnegative Matrix Factorization with the Itakura-Saito divergence

Nonnegative matrix factorization (NMF) is now a common tool for audio so...

Predictive Analysis of COVID-19 Time-series Data from Johns Hopkins University

We provide a predictive analysis of the spread of COVID-19, also known a...

On Nonnegative Matrix and Tensor Decompositions for COVID-19 Twitter Dynamics

We analyze Twitter data relating to the COVID-19 pandemic using dynamic ...

Clustering US States by Time Series of COVID-19 New Case Counts with Non-negative Matrix Factorization

The spreading pattern of COVID-19 differ a lot across the US states unde...

"Back to the future" projections for COVID-19 surges

We argue that information from countries who had earlier COVID-19 surges...

Code Repositories


Time series applications of ONMF/ONTF on COVID19 data sets

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The rapid spread of coronavirus disease (COVID-19) has had devastating effects globally. The virus first started to grow significantly in China and then in South Korea around January of 2020, and then had a major outbreak in European countries within the next month, and as of April the US alone has over 400,000 cases with over 12,000 deaths. Predicting the rapid spread of COVID-19 is a challenge of utmost importance that the broader scientific community is currently facing.

A conventional approach to this problem is to use compartmental models (see, e.g. [keeling2005networks, brauer2008compartmental]

), which are mathematical models used to simulate the spread of infectious diseases governed by stochastic differential equations describing interactions between different compartments of the population (e.g. susceptible, infectious, and recovered). Namely, one may postulate a compartmental model tailored to COVID-19 and find optimal parameters for the model by fitting it them the available data. An alternative approach is to use data-driven machine learning techniques, especially deep learning algorithms

[deng2013recent, bengio2013deep, deng2014tutorial]

, which have had great success in various problems including image classification, computer vision, and voice recognition

[krizhevsky2012imagenet, boureau2010theoretical, hannun2014deep, amodei2016end].

In this paper, we propose an entirely different approach to predicting the spread of COVID-19 based on dictionary learning (or topic modeling), which is a machine learning technique that is typically applied to text or image data in order to extract important features of a complex dataset so that one can represent said dataset in terms of a reduced number of extracted features, or dictionary atoms [steyvers2007probabilistic, blei2010probabilistic]

. Although dictionary learning has seen a wide array of applications in data science, to our best knowledge this work is the first to apply such an approach to time-series data and time-series prediction.

Our proposed method has four components:



(Minibatch learning) Use online nonnegative matrix factorization (online NMF) to learn “elemental” dictionary atoms which may be combined to approximate short time evolution patterns of correlated time-series data.


(Online learning) Further adapt the minibatch-learned dictionary atoms by traversing the time-series data using online NMF.


(Partial fitting and one-step prediction) Progressively improve our learned dictionary atoms by online learning while concurrently making one-step predictions by partial fitting.


(Recursive extrapolation) By recursively using the one-step predictions above, extrapolate into the future to predict future values of the time-series.

Our method enables us to learn dictionary atoms from a diverse collection of correlated time-series data (e.g. new daily cases of COVID-19, number of fatal and recovered cases, and degree of observance of social distancing measures). The learned dictionary atoms describe “elemental” short-time evolution patterns from the correlated data which may be superimposed to recover and even predict the original time-series data. Online Nonnegative Matrix Factorization is at the core of our learning algorithm, which continuously adapts and improves the learned dictionary atoms to newly arrived time-series data sets.

There are a number of advantages of our proposed approach that may complement some of the shortcomings of the more traditional model-based approach or large-data-based machine learning approach. First, Our method is completely model-free and has the universality of data types, as the dictionary atoms directly learned from the data serve as the ‘model’ for prediction. Hence a similar method could be applied to predict not only the spread of the virus but also other related parameters. These include the spread of COVID-19 media information, medical and food supply shortages and demands, patient subgroup infections, immunity and many more. Second, our method does not lose interpretability as some of the deep-learning-based approaches do, which is particularly important in making predictions for health-related areas. Third, our method is computationally efficient and can be executed on a standard personal computer in a short time. This enables our method to be applied in real-time in online-setting to generate continuously improving prediction. Lastly, our method has a strong theoretical foundation based on the recent work [lyu2019online].

In this article, we demonstrate our general online NMF-based time-series prediction method on COVID-19 time-series data by learning a small number of fundamental time evolution patterns in joint time-series among the six countries in three different cases (confirmed/death/recovered) concurrently. Our analysis shows that we can indeed extract interpretable dictionary atoms for short-time evolution of such correlated time-series and use them to get accurate short-time predictions with a small variation. This approach could further be extended by augmenting various other types of correlated time-series data set that may contain nontrivial information on the spread of COVID-19 (e.g. time-series quantifying commodity, movement, and media data).

This paper is organized as follows. In Section LABEL:section:algorithms, we give a brief overview of dictionary learning by nonnegative matrix factorization, and provide the full statement of our learning and prediction algorithms. In Subsection LABEL:subsection:data_Set, we give a description of the time-series data set of new COVID-19 cases and discuss a number of pre-processing methods for regularizing high fluctuations in the data set. Then we discuss our data analysis scheme and simulation setup in Subsection LABEL:subsection:scheme. In the following subsections LABEL:subsection:minibatch-LABEL:subsection:prediction, we present our main simulation results. Finally, we conclude and suggest further directions in Section LABEL:section:conclusion.