1 Introduction
The emergence of SARSCoV2, and its associated viral syndrome COVID19, has raised important questions about the ways we analyse and identify dynamic temporal processes. In particular, by identifying similarities in principal timedependent indicators of epidemic dynamics, such as cumulative prevalence (the running total of confirmed cases over time), we can gain insight into similarities that are likely to emerge across various regions. Such similarities may be reflective of various hidden processes, be they related to the pathogen, to the response thereto or to various predisposing factors. By way of this, time series clustering has the potential to play a significant role in understanding the spatiotemporal factors governing the dynamic processes that drive an outbreak.
Vector quantisation and partitioning, or clustering, is the wider set of algorithms within unsupervised statistical learning that identify similar patterns among data in arbitrarily highdimensional vector spaces, effectively taking a set of vectors in an dimensional vector space and assigning to each of these a label from the label set , so that the assignment of each element of to the groups defined by the labels comprising minimise some objective function (typically referred to as the distance metric of the clustering). Cluster algorithms are widely used today and their practical applications are manifold, ranging from identifying clinical phenotypes in medicine and population health^{1, 7, 13, 26, 28} through fraud detection^{2, 12, 18, 20, 23} to image segmentation.^{4, 5, 10, 11, 17, 27}
Time series clustering presents a particular complication of this problem insofar as the subject of clustering is not a vector representing a single value, but rather a time series. These time series are typically not in synchrony, but rather exhibit a range of delays, lags and leads, and may depend on extrinsic and/or hidden variables. We may formulate the essential task of time series clustering as follows. Let be a set of time series . Further, let denote the cardinality of the label set – in other words, the number of partitions we wish to split the data into, with . Then, the mapping is a clustering of the set of time series if it assigns to any element one (and only one) label , so as to minimise an objective function (typically referred to in this context as a distance metric) within each cluster defined by its label.
This paper examines the use of two time series clustering algorithms – softDTW kmeans clustering and kshape clustering – to identify different patterns in COVID19 prevalence in the continental United States, and comparing the results of the classifiers for interclassifier consistency. By isolating the barycenters of the timeshifted clusters, we can identify consistent patterns in prevalence dynamics across multiple states. This in turn can be used to quantify the overall effect of preexisting characteristics, population dynamics and nonpharmaceutical interventions (NPIs) between states.
2 Methods
2.1 Source data
Source data for the 48 states of the continental United States was obtained from the Starschema COVID19 Data Set,^{24} and filtered only for confirmed case counts. Data was loaded into Python 3.7 using pandas,^{14} and values were scaled using tslearn.preprocessing’s TimeSeriesScalerMeanVariance to and . The results of this transformed raw data set are laid out, by state, in Figure 1.
2.2 SoftDTW kmeans clustering
Since first described by Sakoe and Chiba (1978),^{22} the dynamic time warping algorithm has been expressed in multiple formulations. The presentation below is based on Cuturi and Blondel’s 2017 paper introducing SoftDTW, with the marginal difference of using instead of to represent the distance function.^{6}
Given two time series and , there exists a cost matrix for the distance function , from which we can derive the cost matrix
(1) 
For the two abovementioned series, we may describe the set of matrices of all possible alignments as , which is a strict subset of . Then, DTW can be defined as the function that for any pair of time series identifies an alignment so as to minimise the inner product of A with the cost matrix as
(2) 
Thus, DTW can be conceived of as a search task, in which is the search space within which we search for an alignment given and so as to minimise the inner product .
SoftDTW universalises the notion underlying the DTW cost metric and the global alignment kernel metric
(3) 
into a single metric.^{9} Given the generalisation of the minimum metric with a smoothing factor as
(4) 
we may now define SoftDTW as
(5) 
Importantly, SoftDTW – unlike the original DTW approach by Sakoe and Chiba^{22} – is explicitly differentiable. In particular, as Saigo (2006) noted,^{21} the gradient of Equation (3) can be calculated quite conveniently. Let be the average alignment matrix following the Boltzmann distribution for all . Then,
(6) 
and consequently
(7) 
This can be easily calculated using backward recursion, as described in Algorithm 2 of Cuturi and Blondel (2017).^{6} In addition, the notion of a clustering centroid can be generalised to the metric space comprising the time series to yield Frêchet means, also referred to in this context as barycenters. For a metric space , is a Frêchet mean of order of the time series
if it minimises the Frêchet variance, i.e.
(8) 
Thus, based on the softDTW metric, laid out above, we can extract from the time series of COVID19 cumulative incidence a clustering that iteratively minimises the withincluster sum of squares (kmeans clustering). For the purposes of this paper, softDTW clustering was performed using tslearn 0.4.1^{25} using Python 3.7, with a parameter of .
2.3 kshape clustering
kshape clustering is a novel, robust clustering algorithm for time series that relies on iteratively refining clusters, with crosscorrelation as the underlying distance metric.^{16} Specifically, kshape relies on a normalised version of crosscorrelation, referred to in this context as Shape Base Distance (SBD): time series are Znormalised (i.e. and
), and the resulting crosscorrelation sequence is divided by the geometric mean of the individual time series’ autocorrelations. In this sense, kshape can be understood as a kmeans clustering that uses a crosscorrelation based metric
. Let be the seriesshifted, with zeropadding, by s, and the same be true for
respectively, mutatis mutandis. For two time series of equal length and , we recursively define shiftwise crosscorrelation for shifts in the range as(9) 
Then, for the crosscorrelation sequence, we obtain the crosscorrelation for any value of as
(10) 
Now, we can define the distance metric by
(11) 
Because of the convolution theorem, which states that under certain conditions convolution in one domain of a time series (or more generally, any signal) is equivalent to elementwise multiplication in the other domain,^{15} we can efficiently compute
by taking the complex conjugate of the discrete Fourier transform of each series
, where is the complex conjugate operator.^{16} Then, given the inverse discrete Fourier transform ,(12) 
and as Paparrizzos and Gravano showed,^{16} Fast Fourier Transforms allow this to be calculated efficiently in time rather than time.
Similarly to the cluster analysis carried out in Subsection
2.2, kshape clustering was performed using the tslearn package’s clustering.KShape implementation, with an n_init setting at 16 iterations for centroid seeds, using the result with the lowest inertia, random initialization and a convergence tolerance of .3 Results
3.1 Clustering time dynamics of disease prevalence
After fitting the softDTW kmeans and kshape clustering models on the data set described in Section 2.1 with a label set cardinality (i.e. number of clusters) of 3, indicators for goodness of fit were obtained using sklearn.metrics. These show that the clustering is relatively sound. The silhouette scores (softDTW kmeans: 0.249, kshape: 0.276) indicate while there is some chance of an overlap, the clustering is a relatively good fit.^{19} This is confirmed by a strong Variance Ratio Criterion (CalinskiHarabasz score) of 18.809 and 20.155 for softDTW kmeans and kshape, respectively.^{3} As Figure 2 clearly indicates, there are three distinctly characterisable patterns based on the barycenters:

Late peaking (kmeans cluster 1, kshape cluster 1): states in this cluster typically have a steady, consistent pattern affected only by weekly periodicities, and begin to surge around midJune 2020.

Early peaking (kmeans cluster 2, kshape cluster 2): states in this cluster display a rapidonset initial peak in April to May 2020, thereafter tapering off.

Bimodal (kmeans cluster 3, kshape cluster 3): within this cluster, states appear to exhibit a steady number of cases and the beginnings of a bimodal distribution over time, with a peak in AprilMay 2020 that subsides in June, then follows on to another rise in July and August.
Figure 2 highlights in red the barycenters or Frêchet means of the time series, expanding the notion of a centroid as a central tendency to the metric space of the time series. While the barycenters are different between the clustering algorithms (largely due to small differences in clustering, thus leading to different compositions for the barycenter calculation), they identify consistently the underlying pattern characteristic of the cluster. Notably, the barycenters calculated by the kshape classification exhibit much stronger shortterm (weekly) periodicity in all three clusters. At the same time, the second cluster’s abnormal peak in June is much less reflected in the barycenter based on the kshape clustering than it is on the kmeans cluster, and the kshape cluster presents a barycenter with a much flatter ’peak’ in midApril than the kmeans barycenter.
The distribution of time series (i.e. states) over the permutations of softDTW kmeans and kshape cluster assignments (see Figure 3) shows that the majority of states fall into matching softDTW kmeans and kshape categories, with only 3 states falling outside. Over half (56%) of states fall into the softDTW kmeans cluster 3 and kshape cluster 3, while 8 states (17%) fall into the softDTW kmeans cluster 2 and kshape cluster 2, and 9 states (21%) fall into the softDTW kmeans cluster 1 and kshape cluster 1. This distribution is displayed in the interclassifier agreement matrix in Figure 4.
3.2 Crosscluster agreement
In order to ascertain crosscluster agreement, the Adjusted Rand Index (ARI) was used to quantify consensus between the kshape and softDTW kmeans classifiers.^{8} This index, first proposed by Hubert and Arabie in 1985, is symmetric, thus it can be used to identify consensus between clusters with different metrics. At 0.864, the ARI indicates strong concurrence between the softDTW kmeans and the kshape classifiers.
Crosscluster agreement is illustrated in Figure 4. As it is evident therefrom, over half of the states fall into the latepeaking (kmeans cluster 3, kshape cluster 3) category, with relatively few cases and no pronounced peaks until June 2020, after which the data evidences an oscillating but gradually increasing case count.
The strong crosscluster agreement, covering 96% of all samples, indicates that despite their methodological differences, both the softDTW kmeans clustering algorithm and the kshape algorithm yield largely identical results when it comes to assigning states’ time series to clusters. The strong concurrence and favourable ARI indicate that the cluster assignments are unlikely to be artefactual results of the underlying algorithms but rather reflect truly significantly distinct groupings of states by their case count time series.
4 Discussion
kshape and softDTW kmeans classification strongly concur in identifying the three fundamental behavioural clusters of confirmed COVID19 case count in the 48 states of the continental United States: a bimodal pattern, an early peaking pattern and a late, slower pattern that is largely stationary until approx. June 2020, then displays a rapid rise of cases.
The geographical distribution of these is worth noting. As Figure 5 shows, at the time of writing, most of the area of the continental United States follows the late peaking regime, and the calculated barycenters indicate these states are currently poised to experience further growth in case counts. Only a few states (green shades) have followed an early outbreak with a significant reduction in cases and no further resurgence, as may be considered evidence of successful mitigation/suppression efforts on their part. Finally, a number of states (blue shades) have experienced early outbreaks and are exhibiting a bimodal pattern, whereby an initial surge in April to late May 2020 has been followed not by successful suppression but a reduction followed by yet another rise in the number of reported cases of COVID19.
As this paper has shown, time series clustering allows for finding commonalities between time series that are by necessity out of synchrony. In doing so, it can be helpful in illuminating geographical and regional patterns of disease dynamics. In particular, by using two different methods – a softDTW based, timeshifted kmeans classifier and the correlationbased kshape classifier –, a significant consensus between such classifications has been demonstrated where the number of confirmed COVID19 cases in the continental United States is concerned. This lends credence to the hypothesis that epidemic dynamics of COVID19 follow three distinct temporal patterns. These are in all likelihood conditioned by a combination of spatiotemporal factors (position along the epidemic’s ’wavefront’), mitigation measures such as NPIs, their reltive effectiveness, as well as preexisting factors of resilience and vulnerability.
Thus, by identifying the case count response, we can recognise different internally consistent clusters of case count progression over time. This may assist in understanding the governing patterns and dynamics of the SARSCoV2 pandemic, and assist in tailoring responses to the needs of individual areas and communities based on the temporal patterns of epidemic dynamics they exhibit.
Competing interests
The author declares no competing interests.
Supplementary data
All simulations, code and data are available on Github and under the DOI 10.5281/zenodo.3970209. Shape files for the choropleth diagram in Figure 5 have been obtained from the United States Census Bureau, and are included in the data set noted above.
References
 Clinical implications of chronic heart failure phenotypes defined by cluster analysis. Journal of the American College of Cardiology 64 (17), pp. 1765–1774. Cited by: §1.

Credit card fraud detection: a hybrid approach using fuzzy clustering & neural network
. In 2015 Second International Conference on Advances in Computing and Communication Engineering, pp. 494–499. Cited by: §1.  A dendrite method for cluster analysis. Communications in Statistics – Theory and Methods 3 (1), pp. 1–27. Cited by: §3.1.
 Fuzzy cmeans clustering with spatial information for image segmentation. Computerized Medical Imaging and Graphics 30 (1), pp. 9–15. Cited by: §1.
 Image segmentation by clustering. Proceedings of the IEEE 67 (5), pp. 773–785. Cited by: §1.

Softdtw: a differentiable loss function for timeseries
. arXiv preprint arXiv:1703.01541. Cited by: §2.2, §2.2.  Cluster analysis and clinical asthma phenotypes. American Journal of Respiratory and Critical Care Medicine 178 (3), pp. 218–224. Cited by: §1.
 Comparing partitions. Journal of Classification 2 (1), pp. 193–218. Cited by: §3.2.

Spatiotemporal alignments: optimal transport through space and time.
In
International Conference on Artificial Intelligence and Statistics
, pp. 1695–1704. Cited by: §2.2.  Accelerating infinite ensemble of clustering by pivot features. Cognitive Computation 10 (6), pp. 1042–1050. Cited by: §1.
 Data clustering based on Langevin annealing with a selfconsistent potential. arXiv preprint arXiv:1806.10597. Cited by: §1.
 Healthcare fraud detection: a survey and a clustering model incorporating geolocation information. In 29th World Continuous Auditing and Reporting Symposium (29WCARS), Brisbane, Australia, Cited by: §1.
 Cluster analysis of obsessivecompulsive spectrum disorders in patients with obsessivecompulsive disorder: clinical and genetic correlates. Comprehensive Psychiatry 46 (1), pp. 14–19. Cited by: §1.
 Pandas: a foundational python library for data analysis and statistics. Python for High Performance and Scientific Computing 14 (9). Cited by: §2.1.
 Discretetime signal processing. vol. 2. Upper Saddle River, NJ: Prentice Hall. Cited by: §2.3.
 Kshape: efficient and accurate clustering of time series. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1855–1870. Cited by: §2.3, §2.3, §2.3.
 An adaptive clustering algorithm for image segmentation. In International Conference on Acoustics, Speech, and Signal Processing, pp. 1667–1670. Cited by: §1.
 Application of clustering methods to health insurance fraud detection. In 2006 International Conference on Service Systems and Service Management, Vol. 1, pp. 116–120. Cited by: §1.
 Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, pp. 53–65. Cited by: §3.1.
 Survey of clustering based financial fraud detection research. Informatica Economica 16 (1), pp. 110. Cited by: §1.
 Optimizing amino acid substitution matrices with a local alignment kernel. BMC Bioinformatics 7 (1), pp. 246. Cited by: §2.2.
 Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26 (1), pp. 43–49. Cited by: §2.2, §2.2.
 Use of optimized fuzzy cmeans clustering and supervised classifiers for automobile insurance fraud detection. Journal of King Saud UniversityComputer and Information Sciences. Cited by: §1.
 Starschema covid19 data set External Links: Document, Link Cited by: §2.1.

Tslearn, a machine learning toolkit for time series data
. Journal of Machine Learning Research 21 (118), pp. 1–6. External Links: Link Cited by: §2.2.  Distinct clinical phenotypes of airways disease defined by cluster analysis. European Respiratory Journal 34 (4), pp. 812–818. Cited by: §1.
 An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (11), pp. 1101–1113. Cited by: §1.
 The different clinical faces of obstructive sleep apnoea: a cluster analysis. European Respiratory Journal 44 (6), pp. 1600–1607. Cited by: §1.
Comments
There are no comments yet.