Vector quantisation and partitioning of COVID-19 temporal dynamics in the United States

by   Chris von Csefalvay, et al.

The statistical dynamics of a pathogen within a population depend on a range of factors: population density, the effectiveness and investment into social distancing, public policy measures and non-pharmaceutical interventions (NPIs) are only some examples of factors that influence the number of cases over time by state. This paper outlines an analysis of time series vector quantisation and paritioning of COVID-19 cases in the United States, using a soft-DTW (Dynamic Time Warping) k-means clustering and a k-shape based clustering algorithm to identify internally consistent clusters of case counts over time. The identification of characteristic types of time-dependent variations can lead to the identification of patterns within sets of time series. This, in turn, can help discern the future of infectious dynamics in an area and, through identifying the most likely cluster-wise trajectory by calculating the cluster barycenter, inform public health decision-making.



There are no comments yet.


page 1

page 2

page 3

page 4


Generalized k-Means in GLMs with Applications to the Outbreak of COVID-19 in the United States

Generalized k-means can be incorporated with any similarity or dissimila...

Clustering patterns connecting COVID-19 dynamics and Human mobility using optimal transport

Social distancing and stay-at-home are among the few measures that are k...

SARS-CoV-2 Dissemination using a Network of the United States Counties

During 2020 and 2021, severe acute respiratory syndrome coronavirus 2 (S...

A Bayesian spatio-temporal nowcasting model for public health decision-making and surveillance

As COVID-19 spread through the United States in 2020, states began to se...

Supervised Robust Profile Clustering

In many studies, dimension reduction methods are used to profile partici...

Changing Clusters of Indian States with respect to number of Cases of COVID-19 using incrementalKMN Method

The novel Coronavirus (COVID-19) incidence in India is currently experie...

Statistical dynamics of social distancing in SARS-CoV-2 as a differential game

The novel coronavirus SARS-CoV-2 has rapidly emerged as a significant th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The emergence of SARS-CoV-2, and its associated viral syndrome COVID-19, has raised important questions about the ways we analyse and identify dynamic temporal processes. In particular, by identifying similarities in principal time-dependent indicators of epidemic dynamics, such as cumulative prevalence (the running total of confirmed cases over time), we can gain insight into similarities that are likely to emerge across various regions. Such similarities may be reflective of various hidden processes, be they related to the pathogen, to the response thereto or to various predisposing factors. By way of this, time series clustering has the potential to play a significant role in understanding the spatio-temporal factors governing the dynamic processes that drive an outbreak.

Vector quantisation and partitioning, or clustering, is the wider set of algorithms within unsupervised statistical learning that identify similar patterns among data in arbitrarily high-dimensional vector spaces, effectively taking a set of vectors in an -dimensional vector space and assigning to each of these a label from the label set , so that the assignment of each element of to the groups defined by the labels comprising minimise some objective function (typically referred to as the distance metric of the clustering). Cluster algorithms are widely used today and their practical applications are manifold, ranging from identifying clinical phenotypes in medicine and population health1, 7, 13, 26, 28 through fraud detection2, 12, 18, 20, 23 to image segmentation.4, 5, 10, 11, 17, 27

Time series clustering presents a particular complication of this problem insofar as the subject of clustering is not a vector representing a single value, but rather a time series. These time series are typically not in synchrony, but rather exhibit a range of delays, lags and leads, and may depend on extrinsic and/or hidden variables. We may formulate the essential task of time series clustering as follows. Let be a set of time series . Further, let denote the cardinality of the label set – in other words, the number of partitions we wish to split the data into, with . Then, the mapping is a clustering of the set of time series if it assigns to any element one (and only one) label , so as to minimise an objective function (typically referred to in this context as a distance metric) within each cluster defined by its label.

This paper examines the use of two time series clustering algorithms – soft-DTW k-means clustering and k-shape clustering – to identify different patterns in COVID-19 prevalence in the continental United States, and comparing the results of the classifiers for inter-classifier consistency. By isolating the barycenters of the time-shifted clusters, we can identify consistent patterns in prevalence dynamics across multiple states. This in turn can be used to quantify the overall effect of pre-existing characteristics, population dynamics and non-pharmaceutical interventions (NPIs) between states.

2 Methods

2.1 Source data

Source data for the 48 states of the continental United States was obtained from the Starschema COVID-19 Data Set,24 and filtered only for confirmed case counts. Data was loaded into Python 3.7 using pandas,14 and values were scaled using tslearn.preprocessing’s TimeSeriesScalerMeanVariance to and . The results of this transformed raw data set are laid out, by state, in Figure 1.

Figure 1: Scaled raw data of prevalence by state.

2.2 Soft-DTW k-means clustering

Since first described by Sakoe and Chiba (1978),22 the dynamic time warping algorithm has been expressed in multiple formulations. The presentation below is based on Cuturi and Blondel’s 2017 paper introducing Soft-DTW, with the marginal difference of using instead of to represent the distance function.6

Given two time series and , there exists a cost matrix for the distance function , from which we can derive the cost matrix


For the two above-mentioned series, we may describe the set of matrices of all possible alignments as , which is a strict subset of . Then, DTW can be defined as the function that for any pair of time series identifies an alignment so as to minimise the inner product of A with the cost matrix as


Thus, DTW can be conceived of as a search task, in which is the search space within which we search for an alignment given and so as to minimise the inner product .

Soft-DTW universalises the notion underlying the DTW cost metric and the global alignment kernel metric


into a single metric.9 Given the generalisation of the minimum metric with a smoothing factor as


we may now define Soft-DTW as


Importantly, Soft-DTW – unlike the original DTW approach by Sakoe and Chiba22 – is explicitly differentiable. In particular, as Saigo (2006) noted,21 the gradient of Equation (3) can be calculated quite conveniently. Let be the average alignment matrix following the Boltzmann distribution for all . Then,


and consequently


This can be easily calculated using backward recursion, as described in Algorithm 2 of Cuturi and Blondel (2017).6 In addition, the notion of a clustering centroid can be generalised to the metric space comprising the time series to yield Frêchet means, also referred to in this context as barycenters. For a metric space , is a Frêchet mean of order of the time series

if it minimises the Frêchet variance, i.e.


Thus, based on the soft-DTW metric, laid out above, we can extract from the time series of COVID-19 cumulative incidence a clustering that iteratively minimises the within-cluster sum of squares (k-means clustering). For the purposes of this paper, soft-DTW clustering was performed using tslearn 0.4.125 using Python 3.7, with a parameter of .

2.3 k-shape clustering

k-shape clustering is a novel, robust clustering algorithm for time series that relies on iteratively refining clusters, with cross-correlation as the underlying distance metric.16 Specifically, k-shape relies on a normalised version of cross-correlation, referred to in this context as Shape Base Distance (SBD): time series are Z-normalised (i.e. and

), and the resulting cross-correlation sequence is divided by the geometric mean of the individual time series’ autocorrelations. In this sense, k-shape can be understood as a k-means clustering that uses a cross-correlation based metric

. Let be the series

shifted, with zero-padding, by s, and the same be true for

respectively, mutatis mutandis. For two time series of equal length and , we recursively define shift-wise cross-correlation for shifts in the range as


Then, for the cross-correlation sequence, we obtain the cross-correlation for any value of as


Now, we can define the distance metric by


Because of the convolution theorem, which states that under certain conditions convolution in one domain of a time series (or more generally, any signal) is equivalent to elementwise multiplication in the other domain,15 we can efficiently compute

by taking the complex conjugate of the discrete Fourier transform of each series

, where is the complex conjugate operator.16 Then, given the inverse discrete Fourier transform ,


and as Paparrizzos and Gravano showed,16 Fast Fourier Transforms allow this to be calculated efficiently in time rather than time.

Similarly to the cluster analysis carried out in Subsection 

2.2, k-shape clustering was performed using the tslearn package’s clustering.KShape implementation, with an n_init setting at 16 iterations for centroid seeds, using the result with the lowest inertia, random initialization and a convergence tolerance of .

3 Results

3.1 Clustering time dynamics of disease prevalence

Figure 2: Mutually consistent clusters (rows) between the k-means and k-shape cluster algorithms. Data is time adjusted and barycenters are displayed in red.

After fitting the soft-DTW k-means and k-shape clustering models on the data set described in Section 2.1 with a label set cardinality (i.e. number of clusters) of 3, indicators for goodness of fit were obtained using sklearn.metrics. These show that the clustering is relatively sound. The silhouette scores (soft-DTW k-means: 0.249, k-shape: 0.276) indicate while there is some chance of an overlap, the clustering is a relatively good fit.19 This is confirmed by a strong Variance Ratio Criterion (Calinski-Harabasz score) of 18.809 and 20.155 for soft-DTW k-means and k-shape, respectively.3 As Figure 2 clearly indicates, there are three distinctly characterisable patterns based on the barycenters:

  1. Late peaking (k-means cluster 1, k-shape cluster 1): states in this cluster typically have a steady, consistent pattern affected only by weekly periodicities, and begin to surge around mid-June 2020.

  2. Early peaking (k-means cluster 2, k-shape cluster 2): states in this cluster display a rapid-onset initial peak in April to May 2020, thereafter tapering off.

  3. Bimodal (k-means cluster 3, k-shape cluster 3): within this cluster, states appear to exhibit a steady number of cases and the beginnings of a bimodal distribution over time, with a peak in April-May 2020 that subsides in June, then follows on to another rise in July and August.

Figure 2 highlights in red the barycenters or Frêchet means of the time series, expanding the notion of a centroid as a central tendency to the metric space of the time series. While the barycenters are different between the clustering algorithms (largely due to small differences in clustering, thus leading to different compositions for the barycenter calculation), they identify consistently the underlying pattern characteristic of the cluster. Notably, the barycenters calculated by the k-shape classification exhibit much stronger short-term (weekly) periodicity in all three clusters. At the same time, the second cluster’s abnormal peak in June is much less reflected in the barycenter based on the k-shape clustering than it is on the k-means cluster, and the k-shape cluster presents a barycenter with a much flatter ’peak’ in mid-April than the k-means barycenter.

Figure 3: Combined time traces of k-means and k-shape classifications for the major consensus groups. Bimodal behaviour accounts for 21% of states, early-peaking behaviour covers 17% and late-peaking, ascending behaviour accounts for over half (56%) of states. Three states do not fall within the major consensus groups.

The distribution of time series (i.e. states) over the permutations of soft-DTW k-means and k-shape cluster assignments (see Figure 3) shows that the majority of states fall into matching soft-DTW k-means and k-shape categories, with only 3 states falling outside. Over half (56%) of states fall into the soft-DTW k-means cluster 3 and k-shape cluster 3, while 8 states (17%) fall into the soft-DTW k-means cluster 2 and k-shape cluster 2, and 9 states (21%) fall into the soft-DTW k-means cluster 1 and k-shape cluster 1. This distribution is displayed in the inter-classifier agreement matrix in Figure 4.

3.2 Cross-cluster agreement

Figure 4: Inter-classifier agreement between k-shape (k-shape) and soft-DTW k-means (kmeans) classification.

In order to ascertain cross-cluster agreement, the Adjusted Rand Index (ARI) was used to quantify consensus between the k-shape and soft-DTW k-means classifiers.8 This index, first proposed by Hubert and Arabie in 1985, is symmetric, thus it can be used to identify consensus between clusters with different metrics. At 0.864, the ARI indicates strong concurrence between the soft-DTW k-means and the k-shape classifiers.

Cross-cluster agreement is illustrated in Figure 4. As it is evident therefrom, over half of the states fall into the late-peaking (k-means cluster 3, k-shape cluster 3) category, with relatively few cases and no pronounced peaks until June 2020, after which the data evidences an oscillating but gradually increasing case count.

The strong cross-cluster agreement, covering 96% of all samples, indicates that despite their methodological differences, both the soft-DTW k-means clustering algorithm and the k-shape algorithm yield largely identical results when it comes to assigning states’ time series to clusters. The strong concurrence and favourable ARI indicate that the cluster assignments are unlikely to be artefactual results of the underlying algorithms but rather reflect truly significantly distinct groupings of states by their case count time series.

4 Discussion

k-shape and soft-DTW k-means classification strongly concur in identifying the three fundamental behavioural clusters of confirmed COVID-19 case count in the 48 states of the continental United States: a bimodal pattern, an early peaking pattern and a late, slower pattern that is largely stationary until approx. June 2020, then displays a rapid rise of cases.

Figure 5: Choropleth map of the United States displaying the permutations of k-shape and soft-DTW k-means clustering results by state.

The geographical distribution of these is worth noting. As Figure 5 shows, at the time of writing, most of the area of the continental United States follows the late peaking regime, and the calculated barycenters indicate these states are currently poised to experience further growth in case counts. Only a few states (green shades) have followed an early outbreak with a significant reduction in cases and no further resurgence, as may be considered evidence of successful mitigation/suppression efforts on their part. Finally, a number of states (blue shades) have experienced early outbreaks and are exhibiting a bimodal pattern, whereby an initial surge in April to late May 2020 has been followed not by successful suppression but a reduction followed by yet another rise in the number of reported cases of COVID-19.

As this paper has shown, time series clustering allows for finding commonalities between time series that are by necessity out of synchrony. In doing so, it can be helpful in illuminating geographical and regional patterns of disease dynamics. In particular, by using two different methods – a soft-DTW based, time-shifted k-means classifier and the correlation-based k-shape classifier –, a significant consensus between such classifications has been demonstrated where the number of confirmed COVID-19 cases in the continental United States is concerned. This lends credence to the hypothesis that epidemic dynamics of COVID-19 follow three distinct temporal patterns. These are in all likelihood conditioned by a combination of spatio-temporal factors (position along the epidemic’s ’wavefront’), mitigation measures such as NPIs, their reltive effectiveness, as well as pre-existing factors of resilience and vulnerability.

Thus, by identifying the case count response, we can recognise different internally consistent clusters of case count progression over time. This may assist in understanding the governing patterns and dynamics of the SARS-CoV-2 pandemic, and assist in tailoring responses to the needs of individual areas and communities based on the temporal patterns of epidemic dynamics they exhibit.

Competing interests

The author declares no competing interests.

Supplementary data

All simulations, code and data are available on Github and under the DOI 10.5281/zenodo.3970209. Shape files for the choropleth diagram in Figure 5 have been obtained from the United States Census Bureau, and are included in the data set noted above.


  • T. Ahmad, M. J. Pencina, P. J. Schulte, E. O’Brien, D. J. Whellan, I. L. Piña, D. W. Kitzman, K. L. Lee, C. M. O’Connor, and G. M. Felker (2014) Clinical implications of chronic heart failure phenotypes defined by cluster analysis. Journal of the American College of Cardiology 64 (17), pp. 1765–1774. Cited by: §1.
  • T. K. Behera and S. Panigrahi (2015)

    Credit card fraud detection: a hybrid approach using fuzzy clustering & neural network

    In 2015 Second International Conference on Advances in Computing and Communication Engineering, pp. 494–499. Cited by: §1.
  • T. Caliński and J. Harabasz (1974) A dendrite method for cluster analysis. Communications in Statistics – Theory and Methods 3 (1), pp. 1–27. Cited by: §3.1.
  • K. Chuang, H. Tzeng, S. Chen, J. Wu, and T. Chen (2006) Fuzzy c-means clustering with spatial information for image segmentation. Computerized Medical Imaging and Graphics 30 (1), pp. 9–15. Cited by: §1.
  • G. B. Coleman and H. C. Andrews (1979) Image segmentation by clustering. Proceedings of the IEEE 67 (5), pp. 773–785. Cited by: §1.
  • M. Cuturi and M. Blondel (2017)

    Soft-dtw: a differentiable loss function for time-series

    arXiv preprint arXiv:1703.01541. Cited by: §2.2, §2.2.
  • P. Haldar, I. D. Pavord, D. E. Shaw, M. A. Berry, M. Thomas, C. E. Brightling, A. J. Wardlaw, and R. H. Green (2008) Cluster analysis and clinical asthma phenotypes. American Journal of Respiratory and Critical Care Medicine 178 (3), pp. 218–224. Cited by: §1.
  • L. Hubert and P. Arabie (1985) Comparing partitions. Journal of Classification 2 (1), pp. 193–218. Cited by: §3.2.
  • H. Janati, M. Cuturi, and A. Gramfort (2020) Spatio-temporal alignments: optimal transport through space and time. In

    International Conference on Artificial Intelligence and Statistics

    pp. 1695–1704. Cited by: §2.2.
  • X. Jin, G. Xie, K. Huang, and A. Hussain (2018) Accelerating infinite ensemble of clustering by pivot features. Cognitive Computation 10 (6), pp. 1042–1050. Cited by: §1.
  • K. Lafata, Z. Zhou, J. Liu, and F. Yin (2018) Data clustering based on Langevin annealing with a self-consistent potential. arXiv preprint arXiv:1806.10597. Cited by: §1.
  • Q. Liu and M. Vasarhelyi (2013) Healthcare fraud detection: a survey and a clustering model incorporating geo-location information. In 29th World Continuous Auditing and Reporting Symposium (29WCARS), Brisbane, Australia, Cited by: §1.
  • C. Lochner, S. M. Hemmings, C. J. Kinnear, D. J. Niehaus, D. G. Nel, V. A. Corfield, J. C. Moolman-Smook, S. Seedat, and D. J. Stein (2005) Cluster analysis of obsessive-compulsive spectrum disorders in patients with obsessive-compulsive disorder: clinical and genetic correlates. Comprehensive Psychiatry 46 (1), pp. 14–19. Cited by: §1.
  • W. McKinney et al. (2011) Pandas: a foundational python library for data analysis and statistics. Python for High Performance and Scientific Computing 14 (9). Cited by: §2.1.
  • A. V. Oppenheim, J. R. Buck, and R. W. Schafer (2001) Discrete-time signal processing. vol. 2. Upper Saddle River, NJ: Prentice Hall. Cited by: §2.3.
  • J. Paparrizos and L. Gravano (2015) K-shape: efficient and accurate clustering of time series. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1855–1870. Cited by: §2.3, §2.3, §2.3.
  • T. N. Pappas and N. S. Jayant (1989) An adaptive clustering algorithm for image segmentation. In International Conference on Acoustics, Speech, and Signal Processing, pp. 1667–1670. Cited by: §1.
  • Y. Peng, G. Kou, A. Sabatka, Z. Chen, D. Khazanchi, and Y. Shi (2006) Application of clustering methods to health insurance fraud detection. In 2006 International Conference on Service Systems and Service Management, Vol. 1, pp. 116–120. Cited by: §1.
  • P. J. Rousseeuw (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, pp. 53–65. Cited by: §3.1.
  • A. S. Sabau (2012) Survey of clustering based financial fraud detection research. Informatica Economica 16 (1), pp. 110. Cited by: §1.
  • H. Saigo, J. Vert, and T. Akutsu (2006) Optimizing amino acid substitution matrices with a local alignment kernel. BMC Bioinformatics 7 (1), pp. 246. Cited by: §2.2.
  • H. Sakoe and S. Chiba (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26 (1), pp. 43–49. Cited by: §2.2, §2.2.
  • S. Subudhi and S. Panigrahi (2017) Use of optimized fuzzy c-means clustering and supervised classifiers for automobile insurance fraud detection. Journal of King Saud University-Computer and Information Sciences. Cited by: §1.
  • F. Tamás and C. von Csefalvay (2020) Starschema covid-19 data set External Links: Document, Link Cited by: §2.1.
  • R. Tavenard, J. Faouzi, G. Vandewiele, F. Divo, G. Androz, C. Holtz, M. Payne, R. Yurchak, M. Rußwurm, K. Kolar, and E. Woods (2020)

    Tslearn, a machine learning toolkit for time series data

    Journal of Machine Learning Research 21 (118), pp. 1–6. External Links: Link Cited by: §2.2.
  • M. Weatherall, J. Travers, P. Shirtcliffe, S. Marsh, M. Williams, M. Nowitz, S. Aldington, and R. Beasley (2009) Distinct clinical phenotypes of airways disease defined by cluster analysis. European Respiratory Journal 34 (4), pp. 812–818. Cited by: §1.
  • Z. Wu and R. Leahy (1993) An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (11), pp. 1101–1113. Cited by: §1.
  • L. Ye, G. W. Pien, S. J. Ratcliffe, E. Björnsdottir, E. S. Arnardottir, A. I. Pack, B. Benediktsdottir, and T. Gislason (2014) The different clinical faces of obstructive sleep apnoea: a cluster analysis. European Respiratory Journal 44 (6), pp. 1600–1607. Cited by: §1.