Internal hemorrhage, also known as internal bleeding, is defined as leakage of blood from the circulatory system into surrounding tissue and neighboring body cavities ho2005rapid and is the “most frequent complication of a major surgery” lozano2012global
. It causes an estimated 1.9 million deaths worldwide annuallylozano2012global; falck2018deephemorrhage. Survivors of hemorrhage suffer from poor long-term adverse outcomes, such as multiple organ failures and an increased mortality rate mitra2014long; halmin2016epidemiology. Thus, analyzing hemorrhage and its underlying physiological patterns could provide clinicians with a better understanding of the progression of a bleed event, inform effective interventions, and help reduce adverse outcomes.
Machine learning can provide a framework to harness the high-dimensional, raw time series data produced by the vital sign monitoring equipment and yield valuable medical insights. For example, falck2018deephemorrhage
first showed that the prediction of hemorrhage is reliably possible from raw, multivariate vital sign data using recurrent and convolutional neural networks, while being generally inferior compared to a random forest classifier trained on handcrafted statistical features.nagpal2019hemozambrano2015detection demonstrated that a random forest classifier trained on features from analyzing shapes of arterial blood pressure waveforms is able to detect bleeding. li2018graph predicted hemorrhagic shock in pigs from prebleed blood draw data by analyzing Graphs of Temporal Constraints (GTC) guillame2017classification.
While prior work has focused on rapid detection of hemorrhage and survival prediction, research on understanding the underlying changes in physiological responses associated with blood loss is lacking. Supervised learning techniques are not applicable to discover these patterns, since ground truth data on physiological states is unavailable. Using unsupervised learning techniques, however, allows us to examine unlabeled physiological data and may enable the discovery of such state changes. lei2017learnings’ work demonstrated that such interesting patterns could be found from CVP through correlation clustering. These observed clusters, however, correspond to broad physiological states – one cluster corresponded mostly to the pre-bleed phase, a second cluster would take over after bleeding started, and a third cluster would appear even further into the bleed – which is why we urge for a more finer-grained grouping, in particular based on multiple vital signs.
In this paper, we propose an embedding algorithm using a state-of-the-art deep unsupervised dilated, causal convolutional encoder model franceschi2019unsupervised to find informative embeddings from continuous vital sign time series hemorrhage data of six vital signs. We use agglomerative clustering to obtain groups (clusters) of latent embeddings that may correspond to different physiological states that the model detects during the observation period. A schematic overview of our methodological pipeline is shown in Fig. 1. Our contributions are two-fold: (1) An embedding algorithm with a novel sampling methodology (described in more detail in Appendix A.3) that encodes high-dimensional, raw vital sign data into lower-dimensional embeddings, and (2) an analysis of the hypothesized, underlying physiological response patterns of subjects through clustering of the embeddings.
Our data consists of vital sign measurements of 16 healthy pigs111The study protocol was approved by the University of Pittsburgh IACUC.. We used pigs because their hemorrhage data is more readily available, and expert opinion show that their response to hemorrhage is similar to humans. They are anesthetized and let to rest for a 30 minute period to establish a stable baseline period. Following this baseline period, the pigs are bled at a rate of until they reach a mean arterial pressure (MAP) of 40 mmHg. The rate of mimics what might be expected in a difficult to detect, occult bleed post-surgery. The time series data also contains irregular laboratory blood draws. We used 6 physiologic measurements (captured at 250 Hz): aortic, pulmonary arterial, and central venous blood pressures (ART, PAP, CVP, respectively), electrocardiogram (ECG), photoplethysmograph from a pulse oximeter, and airway pressure. Physicians’ intuition indicate that they may contain important semantic information about the physiological status of the pigs. In addition, blood draws for laboratory testing were performed on each pig regularly throughout the entire experiment. The data collection methodology follows pinsky1984instantaneous.
2.2 Causal Dilated Convolutional Neural Network
For our deep unsupervised embedding model, we use a convolutional encoder as proposed by franceschi2019unsupervised, which is inspired by bai2018empirical’s Temporal Convolutional Networks (TCN). The encoder is able to obtain meaningful embeddings that perform well on time series classification and regression tasks and trains significantly faster than a traditional RNN encoder-decoder model franceschi2019unsupervised
. We choose to use this model as opposed to traditional statistical feature extraction, since our goal is to automatically discover interest patterns with no assumptions on the underlying hemorrhage data. The generic dilated, causal convolution architecture is depicted in AppendixA.1, Fig. 3. We train this model in an unsupervised fashion with triplet loss (details provided in Appendix A.2). Our sampling algorithm extracts positive, negative, and reference samples, as shown Appendix A.3 in Fig. 4 and Algorithm 1
. To obtain embeddings of the bleed sequences, we split the time series hemorrhage data of each subject into nonoverlapping time windows of 600 timesteps. This allows us to get an embedding of the vital sign sequence data for every 2.4 seconds. The number of timesteps is a hyperparameter we found that provided reasonable computation times while yielding interesting results. In total, we obtain ~70,000 time windows. Furthermore, we explore two different training methodologies. (1) We let the model discriminate between subjects by sampling training windows across different subjects (allowing it to consider both intra-subject and inter-subject differences). (2) We restrict sampling of training windows within a subject (allowing it to only learn intra-subject differences).
2.3 Clustering and Evaluation
Since we have no ground truth, validation of these embeddings is difficult. However, by clustering the data along the time dimension, we can evaluate the embeddings indirectly by comparing the sequential clusters with clinical rational. We use agglomerative clustering with Ward linkage mullner2011modern
. Intuitively, adjacent time windows within one subject’s data should belong to the same cluster. We also look for consistency in the order of the clusters over time across subjects. In an attempt to explain these clusters, we also use a random forest classifier and a 2 layer fully connected network to predict cluster labels using explainable features – mean, median, standard deviation, range, and power transforms of the vital signs. Finally, we also show robustness of our encoder by 4-fold cross-validation (by subject) in Figure5.
Fig. 2 shows our latent embedding clusters plotted over time and split by subject. When clustering with 2-3 groups ((a) and (b)), we see a cluster that corresponds to the healthy state (light blue), shifting into another cluster that corresponding to an unhealthy (bleeding) state (brown and dark blue). If the algorithm considers both inter-subject and intra-subject differences (a), using just two clusters is impractical as one cluster is fit entirely to the laboratory blood draws (the clusters near the black dots). This may be due to the model having to learn extra information compared to the model only considering intra-subject differences. One interesting observation is that the transition from the first cluster to the second cluster is not immediate – the delay in observed physiologic state is presumably, because self-defense response mechanisms triggered by the subject’s physiology early in the hemorrhage make this change more subtle to detect.
When clustering with 10 groups ((c) and (d)), we receive many clusters, yet, they have little overlap in time (compared to 11 or 12 groups). We are also able to discern differences in cluster progression throughout the bleed between different subjects; however, all subjects generally end up in the same eventual state. This follows clinical intuition that the hemodynamic presentations among subjects tend to show substantial heterogeneity when they are stable, but they become more homogeneous the stress escalates. Additionally, we are able to detect blood draws. They show up as noise (the cluster that corresponds with the black dots) that occurs regularly approximately every 30 minutes in the plot of the latent clusters.
Another interesting observation is that even when trained to only distinguish differences from a single subject at a time, the model still learns that there are clusters of subjects that are more similar in the initial state than other subjects. We also found that hemodynamic responses between subjects are not universal, as some subjects skip clusters or start their response to bleed in different clusters compared to other subjects. For example, subjects 10 and 12 start off in the same cluster, but 11 is in a different cluster for both sampling methods in Fig 2 (c),(d). We also see that subjects can go through as many as 5 and as low as 2 distinct clusters in the process.
The confusion matrix for the Random Forest Classifier in Fig.7 shows that the classifier was best able to predict the blue and orange clusters in Fig. 2 (c),(d) which correspond to the start and end of the vital sign sequences for both training methodologies. However, the classifier was unable to predict the intermediate clusters with high accuracy while the simple neural network was able to. This may suggest that the encoder and the 2 layer network were able to learn more sophisticated non-linear relationship between the features compared to the random forest. Considering both intra-subject and inter-subject differences look better in terms of noise in Fig. 2 (c) and also in mean cluster prediction accuracy. This may be because the model absorbs more useful insights in the vital signs to produce a more separated latent space when allowed to consider the differences between subjects. Additionally, we show that our model is also robust. Fig. 5 shows that the encodings predicted on unseen data are also generally clustered similarly to the training subjects. This shows that the model is learning consistent, potentially meaningful, embeddings.
We presented a proof-of-concept method of discerning patterns of hemodynamic stress response in the raw vital sign data. We found clusters that generally correspond to the start, intermediate, and end of the bleed. Additionally, we found that the initial clusters usually vary among subjects when they are stable but become more homogeneous as the subjects undergo subsequently escalating stress, which corresponds to clinical intuition. When considering 2 clusters only, we found that the shift between the first and second cluster does not occur immediately after the start of bleed. This makes sense as a significant shift in vital signs (may represent a substantial shift in hemodynamic response) to the slow rate of bleed should not be immediate. Further research is necessary to validate the identified clusters, including more quantitatively or theoretically rigorous evaluation metrics for the clusters we observed.
Appendix A Appendix
a.1 Additional Methdology
The data consists of health metrics of 93 pigs in total. These pigs are separated into 4 groups, which are then bleed at different rates - 60mL/min, 20mL/min, 5mL/min, and 0mL/min respectively. Each pig was monitored for 11 vital signs at 250 Hz (synchronized): arterial and venous blood pressures (CVP, arterial pressure fluid filled and millar, pulmonary pressure), arterial and venous oxygen saturations (SpO2, SvO2), EKG, Plethysmograph, CCO, stroke volume variation (Vigeleo), and airway pressure. The data collection methodology is similar to pinsky1984instantaneous’s.
Additionally, we also explored other clustering techniques that we weren’t able to show due to page constraints. However, the results we obtained from these methods were not completely different than graphs that we showed previously in Fig. 2. With the time embeddings, repeating of the clusters over time practically disappeared. We explored:
Clustering methods (All of these are implemented in sklearn scikit-learn)
Weighted Time Embeddings (scaled to be less than 2 standard deviations of the embeddings) added to latent embeddings. The type of time embedding is taken from attention transfomers, from vaswani2017attention. Note that adding time information allowed training without considering intra-subject, and inter-subject differences to not have repeat clusters over time, but there are no guarantees that the model isn’t overfitting to the time embeddings.
No time information added
Adding time information for full length of the sequence
Adding time information only from the start of bleed (Since we know the exact location of the start of bleed from the physician annotations, we only add temporal information to the embeddings obtained after the subject starts bleeding. The prebleed embeddings are left alone. This is effectively adding information about the amount of blood lost, since bleed speed is constant after the subject starts bleeding).
64, 128, and 256 dimensional latent embeddings. We chose 128 dimensional embeddings since clusters looked the best compared to 64 dimensional–wasn’t able to learn more than 3 clusters per subject, and 256 dimensional–too noisy.
Training schemes. These can be implemented through the sampling algorithm 1 by passing in various batch sizes.
Allowing the encoder to discriminate between subjects (i.e. allowing sampling negative samples from different subject bleed time series in the batch, this allows model to learn inter-subject differences as well as being able to learn differences the raw data over time).
Disallowing the encoder to discriminate between subjects (negative samples will only be sampled from the same subject, model is restricted to only being able to learn differences in the raw data over time).
Number of clusters: In addition to all of these methods, we also explore 11 different numbers of clusters that we pass into the clustering algorithms (from 2 to 12 clusters).
a.2 Triplet Loss
We will use triplet loss to train our encoder as specified by franceschi2019unsupervised
. Triplet loss is a loss function with a natural intuition as its basis–similar things should embeddings that are closer together than embeddings from unsimilar things. This is reflected in the following loss function. Letbe our encoder that obtains latent vectors from the time series data. Let be the reference time series, a positive time series example, and a negative time series example (see Alg.1 and Fig. 4). Let K be the number of negative samples to take. Then, the loss is shown in Eqn. 1.
Triplet loss is popular in natural language processing - Word2vecmikolov2013distributed as it is effective in training unsupervised models that obtain latent vectors from words that encode some semantic meaning. franceschi2019unsupervised
. demonstrated that this is useful for unsupervised learning of useful embeddings of general multivariate time sequences as well.
a.3 Sampling methodology
We use a modified version of franceschi2019unsuperviseds sampling algorithm to obtain choices of reference , positive example , and negative example . This is different from the original implementation from franceschi2019unsupervised. since the negative samples are only choosen randomly when there can be no overlap with the reference time series; thus, this guarantees that a negative example can’t be a positive example as well. This should allow the model to learn better as there is a clearer difference between positive and negative samples. Algorithm 1 and Fig. 4 shows proposed sampling algorithm for one gradient update. Once our samples have been taken, we pass them into our dilated causal cnn to obtain embeddings and then update the weights of the network using the loss function in Equation 1.
a.4 Additional Results
See Fig. 6 for a clustering with K-means. Fig. 5 shows a 4-fold cross-validation to show the robustness of our method, Table 1 shows and example of the accuracies that we use for the confusion matrix. Fig. 7 shows the confusion matrix for our Random Forest and 2-layer Neural Network models.
a.5 Future work
In future work, we aim at exploring other encoder architectures, such as Variational Autoencodersfabius2014variationalvincent2008extracting, BERT devlin2018bert
. We can also find clusters or separations using Hidden Markov Models (HMMs) or change point detection methods known from multivariate forecasting. Layer-wise Relevance Propagationsamek2017explainable, a method that enables the model to explain the output in terms of its input, could allow us to interpret the latent embeddings in terms of the raw input time series. Finally, our goal is to have physicians carefully analyze the validity of the different clusters that we find.