I Introduction
Many biomedical measurements, including pain, are taken at irregular intervals. Within a hospital environment, for example, patient vital signs are measured based on hospital staff schedule and patient availability [1]. For selfreported data collected through a mobile app [2], the irregularity may be even greater. Unevenly sampled data present a unique challenge for analysis, because the data do not allow for direct onetoone comparison of measurements [3]. This becomes even more challenging when the sampling is also sparse [4].
If this challenge can be overcome, however, doctors and patients may benefit through a deeper understanding of the different ways in which patients can experience complex diseases [5] (e.g., by identifying patient subgroups sharing particular dynamics). In the present study, we consider methods for aligning sparsely and unevenly sampled data into a common coordinate system, applied to both synthetic data and a real data of patients’ selfreported pain over time, and seek to identify a classification for the ways that patients with sickle cell disease experience pain.
Other examples of problems with unevenly sampled data comprise topics as varied as the brightness of celestial bodies [6], gene expression in cells [7], bidding in online auctions [8], and paleoclimate reconstruction [9]
. Approaches to handling unevenly sampled data are similarly varied, including recurrent neural networks
[6], Gaussian processes [4], modelbased clustering [10], and frequencydomain analysis
[9]. Here, we focus on ways to align unevenly sampled data into consistent coordinates as a means to generalize standard clustering techniques to unevenly sampled data.The realworld data considered in the present study measures the longterm pain dynamics of individuals with sickle cell disease. Sickle cell disease (SCD) is a cardiovascular disease affecting over 5 million people worldwide in which rigid misshapen red blood cells can occlude blood vessels, causing pain and eventual organ damage [11]. Pain is the most common symptom associated with SCD, through either acute pain events or through chronic pain. More than 50% of patients have at least one hospitalization each year for pain, and 1% of patients have 6 or more [12, 13]. Chronic pain is clinically defined as pain on more than half of the days over the previous 6 month period; about 30% of adults with SCD are diagnosed with chronic pain [14]. Beyond the classification of chronic pain, there do not currently exist descriptions of the different ways in which patients’ pain varies over time.
The SMART app was developed to monitor and interact with patients with sickle cell disease [2]. A key feature of the app included selfreporting of pain scores on a scale of 010. Participants were asked to report pain scores twice a day, but actual participation varied significantly. In all, we consider data from 39 individuals with various data volumes and frequencies; the same data have been used previously to develop a hybrid statistical and mechanistic model for the relationship between pain and medication [15]
. Within this data set, the 39 participants reported their pain an average of 67.2 times over an average of 164.6 days. These pain scores show how patients’ experiences with pain changed over time. In the present study, we investigate whether it is possible to classify patients’ experiences with pain over time based only on their reported pain scores. This may be particularly helpful in the future, as promising studies have shown the potential to infer subjective pain from physiological data from wearable devices and electronic health records
[1, 16].To discover whether classes of patients with different typical pain dynamics exist, we wish to generalize existing clustering methods to unevenly and sparsely sampled data. We propose and compare four different methods for aligning the data into common coordinates: two interpolationbased approaches (nearestneighbor and linear) and two based on the distribution of reported data.
Unfortunately, existing methods for irregularly sampled are not applicable to our problem. Approaches based on leastsquares spectral analysis (e.g., the LombScargle (LS) periodogram [17, 18]
) are oriented toward frequencydomain analysis and can’t be applied directly to clustering, especially of sparsely sampled trajectories. Model based estimators
[19] would require us to propose a model for pain dynamics before clustering.The simple interpolation methods we test have the disadvantage that they require strong assumptions about what happens between measurements of reported pain, but they do permit standard time series analysis techniques. The distribution methods do not require any assumptions about pain dynamics between measurements (beyond the assumption of a stationary process), but they lose the information contained in the temporal ordering of pain reports. We compare these four approaches by applying spectral clustering and evaluating the results through a variety of cluster metrics described below.
Ii Methodology
To cluster pain trajectories into groups, one needs a measure of distance or similarity between two trajectories, which is particularly challenging when they do not have a consistent number of measurements or times at which measurements are taken. This challenge is highlighted in Figure 1, which shows three example trajectories as they evolve in time. Most distance metrics rely on data being of a consistent dimension, and so the key step of this work is that of data alignment, transforming the data into a common space for similarity measures.
Iia Preprocessing
We first window the data into twoweek periods. This windowing was done to increase the number of trajectories and to standardize their lengths. The length of 14 days was chosen based on clinical experience to be short enough that pain dynamics are approximately invariant over that time period, but long enough to ensure that acute pain events can be fully captured within a trajectory. Windows containing fewer than three pain recordings are removed from the analysis.
We also include synthetic data for comparison with the real data. This synthetic data set is designed to mimic the irregularly sampled behavior of the real data, but have a predictable clustering structure. We designed the data set to have three clusters of trajectories centered around pain values of 2, 5, and 8 to provide three distinct groups. To generate the synthetic data, the number of measurements is first taken from a uniform distribution between 3 and 28. Then, the timestamp for each of those measurements is drawn from a uniform distribution between 0 and 14 days and the measurements are sorted. Finally, the pain value at each measurement is drawn from a normal distribution centered around the cluster center value with unit standard deviation. Values above 10 or below 0 are set to the boundary value.
IiB Data alignment
After windowing the data into two week groups, the data are aligned into common coordinates for clustering. Four different data alignment methods were considered, which are described here. These alignment approaches are also highlighted in Figure 2.
In Nearestneighbor interpolation and linear interpolation, the 14 day window is divided into regularly spaced 30 minute increments, and values between entries are interpolated to either equal the nearest measurement or linearly between measurements, respectively. Linear interpolation assumes constant values outside of the first and last measurements. These interpolations are densely sampled to ensure that the data from closelyspaced measurements are not lost.
Bulk statistical properties of reported pain levels
: For the first data transformation, we consider properties of the distribution of reported pain levels within each sample. Properties considered include the mean, standard deviation, skew, kurtosis, fraction of entries where pain is 0, fraction of entries where pain is high (
), and mean slope of the trajectory. Each bulk property is then normalized by dividing all entries by the maximum magnitude, so that each property scales from either 0 to 1 or from 1 to 1. This way, no single property will dominate the distance between trajectories. Figure 2 shows these properties before transformation.Histograms of reported pain levels
: Another approach involves directly comparing the histograms of reported pain values. The reported pain values are binned into 20 evenly spaced bins and normalized by the total number of reported values within each trajectory. In calculating the distance between two histograms, we use the cumulative distribution functions. To understand why, consider two constant trajectories at pain levels of 4 and 5. Their histograms would each be entirely contained within a single bin, and the distance between the histograms would be large (assuming a norm like
or ), even though the trajectories are very similar. Their cumulative distributions, on the other hand, would be step functions which increase from 0 to 1 at 4 and 5 respectively. The distance between these histograms would be small, as desired.The different transformations of the data provide vastly different representations, and each approach could potentially be justified. The transformations also present the data in wildly different dimensionalities, with the bulk properties approach having just 7 dimensions, the histogram approach having 20, and the interpolations representing the data in 672 dimensions. The histogram approach loses all time information contained in the data, and the bulk properties approach contains time information only in the average slope. However, the interpolation approaches make strong assumptions about what is happening between pain entries within each sample—assumptions likely (based on clinical experience) to be unjustified.
IiC Clustering
To cluster the transformed trajectories into groups, we first constructed a matrix of similarities between trajectories, where the similarity, is defined as,
where is the distance between points and and are the minimum and maximum of all distances, respectively. The Euclidean distance is used for the bulk properties as well as the nearestneighbor and linear interpolated trajectories. The Manhattan distance (i.e., the distance, the sum of the absolute values of differences) was applied to the cumulative distribution functions from the histograms in order to calculate the integrated area between curves. These similarities form a dense, weighted network of trajectories, where nearby trajectories have similarity near 1, and trajectories which are further apart have a similarity approaching 0.
We apply spectral clustering [20, 21]
to the normalized similarity matrix between trajectories. In spectral clustering, kmeans clustering
[22]is performed on normalized eigenvalues of the graph Laplacian
[22] of this similarity network, which allows for clusters of different sizes to be more easily identified. The similarity matrices for the synthetic data and the real data, ordered into clusters, are shown in Figures 3 and 4, respectively.IiD Evaluation
For each data alignment, the clustering process was repeated over an increasing number of specified clusters, and the results of those clusterings were evaluated by the Dunn index [23] and the DaviesBouldin index [24]. The Dunn index and DaviesBouldin index are cluster validity measures used to evaluate the quality of a clustering.
The Dunn index measures the ratio of the minimum distance between cluster centers to the cluster size, given by the maximum mean distance between samples within a single cluster:
(1) 
where is the mean distance of points in cluster to the cluster center , is the number of elements in cluster , and is the distance between cluster centers. A larger Dunn index indicates a better clustering; the distance between clusters is larger than the biggest cluster.
The DaviesBouldin index is similar, but instead calculates the size of each cluster, , and the intercluster distance between each pair of clusters, , and then calculates the DaviesBouldin index with,
(2) 
where is the number of clusters. The DaviesBouldin index gives the worst ratio of cluster size to intercluster distance for each cluster, and then takes the average over all clusters. A lower DaviesBouldin index indicates that the cluster sizes are small relative to the intercluster distance for all pairs of clusters, and therefore lower values of represent better clusterings. For the real data, the relevance of clusters we discovered was validated based on the clinical experience of coauthor Shah.
Iii Results
Iiia Clustering results
By evaluating the clustering metrics for each data alignment method, we can directly compare them. Figure 3 shows the clustered similarity matrix for the optimal number of clusters according to the DaviesBouldin index, as well as the Dunn and DaviesBouldin indices for each number of clusters up to 6 for the synthetic data. Figure 4 shows the same comparison for the real data.
The first notable result from Figure 3 is that 3 clusters was not identified to be the optimal number of clusters for all data alignment methods applied to the synthetic data. Both the bulk properties approach and the nearestneighbor interpolation identify that the data should be partitioned into two clusters, according to the DaviesBouldin index. The DaviesBouldin index for the nearestneighbor interpolation of synthetic data is nearly identical for 2 and 3 clusters, so the approach still seems reasonable. The Dunn index is also highest at 3 clusters for nearestneighbor interpolation.
However, for the bulk properties approach, neither the similarity matrix nor the DaviesBouldin index shows the three cluster behavior that we would expect. This is likely due to the contrast between the construction of the synthetic data and the statistical properties selected. Each synthetic sample is given the same standard deviation and is expected to be approximately constant in time, the means of the three clusters are all on the edges of the middle (rather than all the way at the extreme), and the samples are drawn from normal distributions, so we would expect an approximately constant kurtosis and zero skewness. Therefore, within the bulk statistical properties selected, the data should mostly vary in the mean, which is just 1 out of 7 dimensions. The normalization methods used may also have the effect of amplifying noise in this case. Despite these issues, however, the identification of three clusters is only marginally worse than two according to either the DB or the Dunn index.
The real data (see Figure 4) also show a disagreement in optimal number of clusters among the data alignment methods. Both the bulk properties approach and the histogram yield 3 clusters according to both cluster validation measures, although they disagree on the sizes of the clusters. The interpolation approaches yield 2 clusters. Both of the interpolation schemes yield only slightly higher DaviesBouldin indices for 3 clusters than for 2 clusters. The Dunn index, on the other hand, is clearly higher for 2 clusters than for 3.
IiiB Cluster behaviors for real data
To show the lessons learned from clustering, we will select one set of clusters and investigate what the groupings say about the data. Although there is some ambiguity in the optimal number of clusters, a clustering of 3 groups gives more information than a clustering of 2, allowing for more distinctions between individual trajectories. Further, even in situations where 2 clusters is optimal, the improvement from 2 over 3 is smaller than the improvement from 3 clusters over 4, so 3 clusters seems to be a reasonable choice. Here, we will describe in more detail the clusters identified by the histogram method in Figure 4. Figure 5 shows the histograms of reported pain data and the time dynamics of the identified classes.
Class I represents the low pain group, comprising 47 trajectories with a mean of 1.12 and a standard deviation of 1.57. Patients in this group tend to report no pain most of the time, with occasional low to medium pain entries. This class shows 12 entries of individuals reporting a very high pain of 8 and 9, as shown in Figure 5.
Class II represents the medium pain group. It is the largest group, with 128 trajectories and a mean and standard deviation of 4.22 and 2.12. This group has the highest standard deviation, and individual trajectories often have large variation over time: patients in this group tend to have highly variable pain, usually nonzero.
Class III represents the highest pain group, with a mean and standard deviation of 8.57 and 1.23 from 46 trajectories. Patients in this group tend to have persistent high levels of pain. Nearly all measurements lie between 8 and 10.
IiiC Comparison with chronic pain diagnosis
Next, we compare the groups themselves with the diagnosis of chronic pain. For our purposes, this diagnosis is defined by whether or not the patients have been prescribed longacting opioids. Chronic pain patients comprised 22 out of 47 (46.8%) trajectories in Class I, 107 out of 128 (83.6%) trajectories in Class II, and 39 out of 46 (84.8%) trajectories in Class III.
Iv Discussion
When data are sparsely and irregularly sampled in time, it can be difficult to make clear decisions about the way to handle clustering. Here, we present four different methods for transforming the data into a consistent dimensional space, so that trajectories may be directly compared in order to calculate the distance or similarity between trajectories.
The methods we present do not agree on the optimal number of clusters, even for the case of wellseparated synthetic data. The distributionbased approaches lose time information, instead focusing on the distribution of entries, but make no assumptions about what happens between entries. The interpolation approaches do carry time information and allow for traditional time series methods, but make strong assumptions about what happens between entries which may or may not be accurate. For instance, it is possible that some patients only chose to enter pain when they took pain medication, which would imply that their pain should be significantly lower between entries. Unfortunately, unless we apply a specific mechanistic model for pain dynamics, we have no way of knowing what happens between measurements. Some promising work has been done on developing such mechanistic models [15], but more work remains to be done to establish better models of pain dynamics.
The classes discussed in Section IIIB and shown in Figure 5 give an interesting comparison with the diagnosis of chronic pain. Both Class II and Class III have more than 80% of the trajectories from patients with chronic pain, according to their medication, but the pain dynamics are substantially different. Within Class III, patients experience persistent extreme pain, with very low standard deviation. On the other hand, within Class II, patients have a lot of variation in the pain they experience, both within the cluster and within individual trajectories. These patients generally experience frequent pain, and therefore would fall under the definition of chronic pain. However, their pain varies a lot from day to day, and is generally not as extreme as the patients within Class III.
A recent study classified patient pain into groups through Latent Dirichlet Analysis of patient survey responses [25]. In their findings, the authors described three groups: (1) patients with low pain in both the past 7 days and past 6 months, (2) patients with moderate pain over the past 7 days and high pain over the past 6 months, and (3) patients with high pain over both the past 7 days and the past 6 months. These groups align nicely with the three classes discovered in the present study. Some patients experience relatively little pain, some patients experience elevated pain levels but not consistently, and some patients experience persistently high pain.
V Conclusions
In the present study, we have applied unsupervised learning to the pain dynamics of patients with sickle cell disease, as selfreported by patients using a mobile app. The nature of data collection led to the challenge of aligning irregularly sampled data. We have compared several data alignment methods applied to the pain dynamics data as well as to synthetic data, and shown that the choices made in aligning the data may significantly affect the results of clustering. For pain associated with sickle cell disease, three clusters appears to be the most reasonable choice, and we have presented a detailed view of the classes discovered.
More fundamental theoretical work is clearly needed to carefully quantify the robustness and confidence level for proposed clusterings in cases like these, where data is sparse and irregularly sampled. We speculate that, given the everincreasing reserve of medical timeseries data, automated grouping of patients into similarity clusters will have great clinical value in the future.
Acknowledgment
The authors would like to thank the the National Institutes of Health for support through grant R01AT010413.
References

[1]
S. Padhee, A. Alambo, T. Banerjee, A. Subramaniam, D. M. Abrams, G. K. Nave,
and N. Shah, “Pain intensity assessment in sickle cell disease patients
using vital signs during hospital visits,” in
International Conference on Pattern Recognition
. Springer, 2021, pp. 77–85.  [2] C. R. Jonassaint, N. Shah, J. Jonassaint, and L. De Castro, “Usability and feasibility of an mHealth intervention for monitoring and managing pain symptoms in sickle cell disease: The sickle cell disease mobile application to record symptoms via technology (SMART),” Hemoglobin, vol. 39, no. 3, pp. 162–168, 2015.
 [3] D. M. Kreindler and C. J. Lumsden, “The effects of the irregular sample and missing data in time series analysis.” Nonlinear dynamics, psychology, and life sciences, 2006.

[4]
Z. Liu and M. Hauskrecht, “Learning adaptive forecasting models from
irregularly sampled multivariate clinical data,” in
Proceedings of the AAAI Conference on Artificial Intelligence
, vol. 30, no. 1, 2016.  [5] P. Schulam, F. Wigley, and S. Saria, “Clustering longitudinal clinical marker trajectories from electronic health data: Applications to phenotyping and endotype discovery,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1, 2015.

[6]
B. Naul, J. S. Bloom, F. Pérez, and S. van der Walt, “A recurrent neural network for classification of unevenly sampled variable stars,”
Nature Astronomy, vol. 2, no. 2, pp. 151–155, 2018.  [7] C. MöllerLevet, F. Klawonn, K.H. Cho, H. Yin, and O. Wolkenhauer, “Clustering of unevenly sampled gene expression timeseries data,” Fuzzy Sets and Systems, vol. 152, no. 1, pp. 49–66, 2005, fuzzy Sets in Bioinformatics.
 [8] J. Peng and H.G. Müller, “Distancebased clustering of sparsely observed stochastic processes, with applications to online auctions,” The Annals of Applied Statistics, vol. 2, no. 3, pp. 1056–1077, 2008.
 [9] K. Rehfeld, N. Marwan, J. Heitzig, and J. Kurths, “Comparison of correlation analysis techniques for irregularly sampled time series,” Nonlinear Processes in Geophysics, vol. 18, no. 3, pp. 389–404, 2011.
 [10] G. M. James and C. A. Sugar, “Clustering for sparsely sampled functional data,” Journal of the American Statistical Association, vol. 98, no. 462, pp. 397–408, 2003.
 [11] D. C. Rees, T. N. Williams, and M. T. Gladwin, “Sicklecell disease,” The Lancet, vol. 376, no. 9757, pp. 2018–2031, 2010.
 [12] O. S. Platt, B. D. Thorington, D. J. Brambilla, P. F. Milner, W. F. Rosse, E. Vichinsky, and T. R. Kinney, “Pain in sickle cell disease: rates and risk factors,” New England Journal of Medicine, vol. 325, no. 1, pp. 11–16, 1991.
 [13] Z. J. Wang, D. J. Wilkie, and R. Molokie, “Neurobiological mechanisms of pain in sickle cell disease,” Hematology 2010, the American Society of Hematology Education Program Book, vol. 2010, no. 1, pp. 403–408, 2010.
 [14] A. M. Brandow, K. J. Zappia, and C. L. Stucky, “Sickle cell disease: a natural model of acute and chronic pain,” Pain, vol. 158, no. Suppl 1, p. S79, 2017.
 [15] S. M. Clifton, C. Kang, J. J. Li, Q. Long, N. Shah, and D. M. Abrams, “Hybrid statistical and mechanistic mathematical model guides mobile health intervention for chronic pain,” Journal of Computational Biology, vol. 24, no. 7, pp. 675–688, 2017.
 [16] M. J. Panaggio, D. M. Abrams, F. Yang, T. Banerjee, and N. R. Shah, “Can subjective pain be inferred from objective physiological data? evidence from patients with sickle cell disease,” PLOS Computational Biology, vol. 17, no. 3, pp. 1–20, 03 2021.
 [17] J. D. Scargle, “Studies in astronomical time series analysis. iistatistical aspects of spectral analysis of unevenly spaced data,” The Astrophysical Journal, vol. 263, pp. 835–853, 1982.
 [18] J. T. VanderPlas, “Understanding the lomb–scargle periodogram,” The Astrophysical Journal Supplement Series, vol. 236, no. 1, p. 16, 2018.
 [19] W. Harteveld, R. Mudde, and H. Van den Akker, “Estimation of turbulence power spectra for bubbly flows from laser doppler anemometry signals,” Chemical engineering science, vol. 60, no. 22, pp. 6160–6168, 2005.
 [20] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 4, pp. 395–416, 2007.

[21]
A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in
Advances in neural information processing systems, 2002, pp. 849–856.  [22] M. Newman, Networks. Oxford university press, 2018.
 [23] J. C. Dunn, “Wellseparated clusters and optimal fuzzy partitions,” Journal of cybernetics, vol. 4, no. 1, pp. 95–104, 1974.
 [24] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE transactions on pattern analysis and machine intelligence, no. 2, pp. 224–227, 1979.
 [25] M. R. Knisely, P. J. Tanabe, Q. Yang, R. Masese, M. Jiang, and N. R. Shah, “Severe pain profiles and associated sociodemographic and clinical characteristics in individuals with sickle cell disease,” The Clinical journal of pain, July 2021.
Comments
There are no comments yet.