The performance of a predictive model is largely dependent on the availability of training data. As of 2014, more than of invasive, therapeutic surgeries in the US take place in hospitals with either medium or small numbers of beds (Steiner et al., 2017; Wen, 2016). These institutions may lack sufficient data or computational resources to train accurate models. Further, patient privacy considerations mean larger institutions are unlikely to publicly release their patients’ data, leaving many institutions on their own. In the face of this insufficiency, one natural way to train performant models is transfer learning, which has already shown success in medical images as well as clinical text (Tajbakhsh et al., 2016; Ravishankar et al., 2016; Lv et al., 2014). Particularly with the popularization of wearable sensors used for health monitoring (Majumder et al., 2017), transfer learning is underexplored for physiological signals, which account for a significant portion of the hundreds of petabytes of currently available worldwide health data (Roski et al., 2014; Orphanidou, 2019).
Drawing on parallels from computer vision (CV) and natural language processing (NLP), exemplars of representation learning, physiological signals are well suited to neural network embeddings (i.e., transformations of the original inputs into a space more suitable for making predictions). In particular, CV and NLP share two notable traits with physiological signals. The first isconsistency. For CV, the domain has consistent features: edges, colors, and other visual attributes (Raina et al., 2007; Shin et al., 2016). For NLP, the domain is a particular language with semantic relationships consistent across bodies of text (Conneau et al., 2017). For sequential signals, we argue that physiological patterns are consistent across individuals. The second attribute is complexity. Across these domains, each one is complex enough such that learning embeddings is non-trivial. Together, consistency and complexity suggest that distinct research scientists spend time learning embeddings that may ultimately be quite similar. In order to avoid this negative externality, NLP and CV have made great progress on standardizing and evaluating their embeddings; in health, physiological signals are a natural next step.
Furthermore, physiological signals have unique properties that make them better suited to representation learning. First, physiological signals are prevalent in health data, which is constrained by patient privacy concerns. These concerns make sharing data between hospitals difficult; however, sharing models between hospitals does not directly expose patient information. Second, a key component to successful transfer learning is a community of researchers that work on related problems. According to Faust et al. (2018), there were at least 53 research publications using deep learning methods for physiological signals in the past ten years. We discuss additional examples of neural networks for physiological signals in Section 1.1.
Broadly, the goal of this manuscript is to address heterogeneous domain adaptation in a physiological signal setting. Domain adaptation is a field that aims to transfer models trained in a source domain to a target domain with a different underlying distribution (i.e., distribution divergence) (Daumé III, 2009). Heterogeneous domain adaptation seeks to address the same problem, but the source and target domains may have different sets of features. This setting is of particular importance for physiological signal data, because different hospitals often measure different physiological signals.
Work discussing domain adapation for healthcare applications exists: Choi et al. (2016)
investigated graph-based attention models for representation learning in healthcare,Choi et al. (2017) investigated multi-layer representation learning for medical concepts. These methods do not focus on physiological signals and instead focus primarily on medical codes and concepts from temporally ordered electronic health record (EHR) visit data. Alternatively, Wiens et al. (2014)
investigate transfer learning in a cross-hospital setting, with the assumption of heterogeneity of features across hospitals. This approach differs to ours in that they do not utilize physiological signals and focus on linear classifiers instead of deep networks. Another related piece of work,Gupta et al. (2018)
investigate transferring a single network in a clinical time series data set using recurrent neural networks. We differ in that we create several per-signal embeddings that can address heterogeneous feature sets in a cross-hospital setting.
Our approach, PHASE (PHysiological SignAl Embeddings), learns an embedding function (feature extractor) for each physiological signal. Then, utilizing these embeddings in a target data set, we train a downstream prediction model for a different, but related task (Figure 1). In particular, our contributions are the following:
We demonstrate that in comparison to the raw signals or naive transformations of the signals (exponential moving averages and variances), using LSTM feature extractors for the signals significantly improves predictive accuracy for XGB downstream models (Section3.2.2) and for MLP downstream models (Appendix Section 7.3) used to forecast five downstream tasks in an operating room: hypoxemia (low blood oxygen), hypotension (low blood pressure), hypocapnia (low end tidal carbon dioxide), hypertension (high blood pressure), and phenylephrine administration.
We demonstrate that transferring fixed LSTM models across two distinct hospitals (a general academic medical center and a level one trauma center) yields performant downstream models. Furthermore, we show that in a heterogeneous setting, transferring from an ICU data set to both OR data sets is performant.
We demonstrate that fine tuning transferred models reduces computational cost relative to training from scratch and consistently improves performance.
We demonstrate that despite using embedding models, PHASE still allows meaningful explanation of downstream models because it uses a single LSTM for each signal.
1.1. Related work
1.1.1. Representation learning in the health domain
One particularly natural instance of representation learning in the health domain is medical image analysis, e.g., mammography analysis, kidney detection in ultrasound images, optical coherence tomography image analysis, diagnosing pneumonia using chest X-ray images, lung pattern analysis, otitis media image analysis, and more (Arevalo et al., 2016; Ravishankar et al., 2016; Kermany et al., 2018; Liao et al., 2013; Christodoulidis et al., 2016; Shie et al., 2015). Outside of image analysis, additional examples of transfer learning in the medical domain include Lv et al. (2014), Wiens et al. (2014), Brisimi et al. (2018), Choi et al. (2017), Choi et al. (2016), and Che et al. (2016). Even within physiological signals, some examples of embedding learning are beginning to sprout up, including Wu et al. (2013), who utilize kNNs to perform transfer learning for brain-computer interaction. Comparatively, PHASE transfers neural networks as embedding functions.
1.1.2. Neural networks for physiological signals:
To our knowledge, our work is among the first to transfer deep neural networks for embedding physiological signals. One caveat is that supervised deep learning can be said to inherently learn embeddings. In physiological signals, there are several examples of particular supervised learning tasks with neural networks. Tasks ranging from detecting biological/mental phenomena from physiological signals(Srinivasan et al., 2007; Koike and Kawato, 1995; Guo et al., 2010; Wilson and Russell, 2003; Wagner et al., 2005; Chanel et al., 2006; Yang and Hsieh, 2016)
to machine learning tasks such as reconstruction of missing signals(Sullivan et al., 2010). In the vein of embedding learning, Martinez et al. (2013)
applied autoencoders to blood volume pulse and skin conductance measured from 36 people and used the encodings to predict affective state. Based on this substantive community of research scientists working on physiological signals, there is a clear opportunity to unify independent research by appropriately using partially supervised feature embedding learning.
Two pieces of work that utilize transfer learning for physiological signals include Gupta et al. (2019) and Tan et al. (2018). Gupta et al. (2019) transfers a single network in a clinical time series data set. Tan et al. (2018) investigates transferring networks for EEG data by characterizing the data using EEG optical flow. We differ from these previous approaches by utilizing per-signal networks (one for each physiological signal) to create embeddings. Furthermore, we evaluate our method in a real world cross-hospital, cross-department setting.
1.1.3. Forecasting health outcomes
We are interested in forecasting five health outcomes. The first is hypoxemia (low blood oxygen). At one time, hypoxemia was the leading cause of anesthesia-related mortality prior to the adoption of pulse oximetry and anesthesia monitoring standards (Cooper et al., 1978, 1984). Today, hypoxemia remains an important cause of anesthesia-related morbidity, precipitated by acute heart failure (Cross et al., 1963), acute renal failure (Brezis and Rosen, 1995), and harmful effects on nearly every end organ in a variety of animal models (Korner, 1959; Ehrenfeld et al., 2010). The next three outcomes we consider are hypocapnia (low blood carbon dioxide), hypotension (low blood pressure), and hypertension (high blood pressure). Negative physiological effects associated with hypocapnia include reduced cerebral blood flow and reduced cardiac output (Pollard and Gibb, 1977). Prolonged episodes of perioperative hypotension have been associated with postoperative myocardial ischemic events and other adverse postoperative outcomes (Lienhart et al., 2006; Chang et al., 2000) and hypotension has been tied to increased mortality risk in traumatic brain injury patients (Jeremitsky et al., 2003). It is generally known that hypertension is linked to premature death by heart disease, ischemia, and stroke (Janeway, 1913; Pickering, 1972). In perioperative settings, hypertension has been tied to increased risk of postoperative intracranial hemorrhage in craniotomies (Basali et al., 2000) and end organ dysfunction (Varon and Marik, 2008). The final outcome we are interested in forecasting is the administration of phenylephrine. Phenylephrine is a medication frequently used to address hypotension during anesthesia administration (Kee et al., 2004). Predicting phenylephrine would provide useful information about a patient’s hypotension and its response to phenylephrine. It also serves to further evaluate PHASE because it represents a “clinical decision” rather than just patient physiology.
most recently proposed an approach which achieved state-of-the-art performance forecasting hypoxemia. They show that an XGBoost model with exponential moving average/variance features (analogous toema in Figure 2
) outperforms many other baselines including practicing anesthesiologists in a simulated setting. One goal of PHASE is to replace the manual feature extraction inLundberg et al. (2018) with a deep learning approach to further improve performance. Additionally, there are papers that discuss forecasting hypotension (Chen et al., 2009; Mancini et al., 2008) and hypertension (Ma and Wang, 2010); however, these papers do not consider transfer learning and focus purely on classification.
2.1. Data cohort
|Age (yr) Mean||51.859||48.701||63.956|
|Age (yr) Std.||16.748||18.419||17.708|
|Weight (lb) Mean||185.273||181.608||176.662|
|Weight (lb) Std.||54.042||54.194||55.448|
|Height (in) Mean||66.913||67.502||66.967|
|Height (in) Std.||8.268||8.607||6.181|
|ASA Code I||11.58%||16.57%||-|
|ASA Code II||41.16%||43.93%||-|
|ASA Code III||39.52%||31.57%||-|
|ASA Code IV||7.54%||7.30%||-|
|ASA Code V||0.19%||0.48%||-|
|ASA Code VI||0.01%||0.16%||-|
|ASA Code Emergency||7.65%||15.31%||-|
|# Hypoxemia Samples|
|Hypoxemia Base Rate||1.09%||2.19%||3.93%|
|# Hypocapnia Samples||-|
|Hypocapnia Base Rate||9.76%||8.06%||-|
|# Hypotension Samples||-|
|Hypotension Base Rate||7.44%||3.53%||-|
|# Hypertension Samples||-|
|Hypertension Base Rate||1.70%||1.66%||-|
|# Phenylephrine Samples||-|
|Phenylephrine Base Rate||7.23%||9.15%||-|
The operating room (OR) data sets were collected via the Anesthesia Information Management System (AIMS), which includes static information as well as real-time measurements of physiological signals sampled minute by minute. OR is drawn from an academic medical center and OR is drawn from a trauma center. Two clear differences between the patient distributions of OR and OR are the gender ratio (57% females in the academic medical center versus 38% in the trauma center) and the proportion of ASA codes that are classified as emergencies (7.65% emergencies versus 15.31%). ICU is a sub-sampled version of the publicly available MIMIC data set from PhysioNet, which contains data obtained from an intensive care unit (ICU) in Boston, Massachusetts (Johnson et al., 2016). Although the ICU data contains several physiological signals sampled at a high frequency, we solely use a minute by minute SaO signal for our experiments because other physiological signals had a substantial amount of missingness. Furthermore, the ICU
data contained neonatal data that we filtered out. For all three data sets, any remaining missing values in the signal features are imputed by the mean and each feature is standardized to have unit mean and variance for training neural networks. Additional details about the distributions of patients in all three data sets are in Table1 and a list of the prevalent diagnoses are in Appendix Section 7.1.
In the OR data sets, we utilize fifteen physiological signals:
SAO2 - Blood oxygen saturation
ETCO2 - End-tidal carbon dioxide
NIBP[S—M—D] - Non-invasive blood pressure (systolic, mean, diastolic)
FIO2 - Fraction of inspired oxygen
ETSEV/ETSEVO - End-tidal sevoflurane
ECGRATE - Heart rate from ECG
PEAK - Peak ventilator pressure
PEEP - Positive end-expiratory pressure
PIP - Peak inspiratory pressure
RESPRATE - Respiration rate
TEMP1 - Body temperature
In addition, we utilize six static features: Height, Weight, ASA Code, ASA Code Emergency, Gender, and Age.
2.2. Downstream Prediction Tasks
In order to validate our embeddings, we focus on health forecasting tasks; forecasting tasks facilitate preventative healthcare by enabling healthcare providers to mitigate risk preemptively (Soyiri and Reidpath, 2013). In particular, we consider the following five tasks:
Hypoxemia: is blood oxygen less than 93 in the next five minutes of surgery ()?
Hypocapnia: is end tidal carbon dioxide less than 35 in the next five minutes of surgery ()?
Hypotension: is mean blood pressure less than 60 in the next five minutes of surgery ()?
Hypertension: is mean blood pressure higher than 110 in the next five minutes of surgery ()?
Phenylephrine: is phenylephrine administered in the next five minutes of surgery ()?
For more detailed information regarding our labelling schemes, refer to Section 7.2.
In this section, we describe PHASE. Informally, PHASE first trains one neural network (upstream embedding model) for each physiological signal in a source data set (Figure 1). Then, the network without its final output layer serves as the embedding function, because neural networks implicitly learn representations of their inputs in intermediate layers. This gives a mapping from the original signal space to a latent space that is easier to make predictions in. After this, we evaluate the embedding models by training a downstream prediction model on target data pre-processed by the upstream embedding models. This confers advantages including better performance and fewer computational resources.
More formally, we train our upstream embedding models as follows: one neural network for each signal where we draw batches from a source data set ( is the number of minutes and is the number of signals). Because these embedding models are neural networks, we can define
, to return a vector of the activations prior to the final dense layer of the network( is the number of hidden nodes of the penultimate layer). Then, the embedding of a given signal can be written as . Then, we train our downstream prediction model on data from a target domain . Here, the inputs are in because we use the upstream embedding models to extact features from all signal data from and concatenate with the static features .
2.3.1. Upstream embedding model (LSTM)
Our embedding models are LSTMs trained with the last sixty minutes of each signal as an input. We choose LSTMs because our physiological signal data is time series in nature. Although training the LSTMs is relatively straightforward, an important design decision is the choice of source task for training the LSTM. One trivial option is a source task that matches the downstream task exactly. In other words, if you know you want to forecast hypoxemia, simply train all embedding models to forecast hypoxemia (hypo in Figure 1a). Analogously, for hypocapnia, train all embedding models to forecast hypocapnia and for hypotension, train all embedding models to forecast hypotension. One clear downside of this approach is that the resultant embeddings will be specific to the chosen source task and likely be less helpful for forecasting other outcomes (task adaptation). Furthermore, in Section 3.2.2 we find that using such a specific source task is unnecessary and in Section 3.2.3 we find that it does not transfer well.
In order to choose a loss function for the LSTM, we train the LSTMs with different source tasks/outputs (i.e., the task used to train the embeddings) in order of distance to the target task (outlined in Figure2a).
The first output we try to predict is exactly the same as the the downstream task, all fifteen LSTMs are trained to forecast either hypoxemia, hypocapnia, or hypotension using as the input (hypo; e.g., , , or ).
For the next three source tasks, we train with an output from the signal that is used as input. In particular, we start by specifying our interest in forecasting low signals (min; e.g., ).
Then we omit the minimum function and focusing on forecasting (next; e.g., ).
Then, as a baseline we omit our emphasis on forecasting (auto; e.g., ).
Finally, we omit the training process altogether (rand).
The less similar the source task is to the target task, the more likely the features extracted by the LSTM will generalize to other tasks. Conversely, if the source task is more similar to the target task, we would expect a downstream model trained on the extracted features to be more performant. Once we have trained these models in a given source domain , we obtain embeddings of the physiological signals by passing them through to the final hidden layer.
The next, min, and auto embeddings are unsupervised in the sense that training them requires only one signal (at different time steps). Yet, they are supervised because we find that a fully unsupervised approach (auto) is not performant for forecasting adverse outcomes (Section 3.2.2). We recommend the utilization of the next task as the source task on the basis of prediction performance and robustness to downstream tasks (Section 3).
The aforementioned tasks use a source domain that is the same as the target domain (). We have two additional variants:
The first is denoted by an apostrophe (). This signifies that if the target data set (where the XGBoost model is trained) is OR, the source data set (where the LSTMs are trained) is OR (and vice versa).
The second is denoted by a superscript P (). This signifies that the LSTM for SAO2 is from ICU and the remaining fourteen LSTMs are from the target operating room data set. This constitutes a heterogenous setting where we determine whether the SAO2 feature extractor from an ICU can be used in conjunction with ones from an OR.
One reason we use per-signal networks (trained on a single signal) is heterogeneity in features. By using per-signal networks, the source data set and target data set can have different sets of features. The traditional transfer learning approach of transferring a single end to end model in heterogenous settings often leads to a drop in performance due to missing features. For PHASE, one can simply find LSTMs for all the signals in their data set and flexibly use them as feature extractors for a downstream model. Another reason for per-signal networks is that data in a single hospital is often collected at different points in time, or new measurement devices may be introduced to data collection systems. For traditional pipelines, it may be necessary to re-train entire machine learning pipelines when new features are introduced. With per-signal networks, pre-existing embedding learners would not necessarily need to be re-trained. A final reason for using per-signal networks is that it makes explanation of downstream models easier (Section 3.3).
2.3.2. Downstream prediction model (XGB)
PHASE can use any prediction model for the target task. In this paper, we focus on gradient boosting machine trees because prior work found that they outperform several other models for the operating room data we use(Lundberg et al., 2018). Gradient boosting machines were introduced by Friedman (2001). This technique creates an ensemble of weak prediction models in order to perform classification/regression tasks in an iterative fashion. In particular, we utilize XGBoost, a popular implementation of gradient boosting machines that uses additive regression trees (Chen and Guestrin, 2016). In Kaggle, a platform for predictive modeling competitions, seventeen out of twenty nine challenge winning solutions used XGBoost in 2015 (Chen and Guestrin, 2016). For PHASE, we find that utilizing embeddings of time series signals provides stronger features for the ultimate prediction with XGB (Section 3.2.1).
3.1. Experimental Setup
3.1.1. LSTM (upstream embedding) architecture and training
We utilize LSTMs with forget gates, introduced by Gers et al. (2000)auto and min embeddings) or classification (hypo
) objectives. For regression, we optimize using Adam with an MSE loss function. For classification we optimize using RMSProp with a binary cross-entropy loss function (additionally, we upsample to maintain balanced batches during training). Our model architectures consist of two hidden layers, each with 200 LSTM cells with dense connections between all layers. We found that important steps in training LSTM networks for our data are to impute missing values by the training mean, standardize data, and to randomize sample ordering prior to training. To prevent overfitting, we utilized dropouts between layers as well as recurrent dropouts for the LSTM nodes. We utilized a learning rate of 0.001. Hyperparameter optimization was done by manual coordinate descent. The LSTM models were each run for 200 epochs and the final model was selected according to validation loss. In order to train these models, we utilize three GPUs (GeForce GTX 1080 Ti graphics cards).
3.1.2. GBM (downstream prediction) architecture and training
We train GBM trees in Python using XGBoost, an open source library for gradient boosting trees. XGBoost works well in practice in part due to it’s ease of use and flexibility. Imputing and standardizing are unnecessary because GBM trees are based on splits in the training data, so that scale does not matter and missing data is informative as is. We train the GBM trees with embedding features from 15 physiological signals, resulting in a total of 3000 features for DeepPHASE methods. In addition, we concatenate static features to the signal features to train and evaluate the models. We found that a learning rate of 0.02 for hypoxemia (0.1 for hypotension and hypocapnia), a max tree depth of 6, subsampling rate of 0.5, and a logistic objective gave us good performance. We fix hyperparameter settings across experiments so that we can focus on comparing different representations our signal data. All XGB models were run until their validation accuracy was non-improving for five rounds of adding estimators (trees). In order to train these models, we utilize 72 CPUs (Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz)
3.2. Evaluating the performance of PHASE
In the following section we describe the application of PHASE to the problem of forecasting adverse outcomes. Our metric of evaluation is area under a precision recall curve, otherwise known as average precision (AP), which is often preferable for binary predictions with low base rates. As a brief overview of the following sections:
First, we establish baselines by comparing downstream models (LSTM to XGB) and a method comparable to practicing anesthesiologists (Lundberg et al. (2018)) (Section 3.2.1). Building upon these baselines, we evaluate PHASE, which uses a combination of LSTM feature extractors and a downstream XGB model, for three “hypo” tasks (hypoxemia, hypocapnia, and hypotension). To show that PHASE embedding functions work for a variety of downstream models, we also benchmark using a downstream MLP model in Appendix Section 7.3.
In order to find the best task to train our LSTMs with, we explore a setting where the LSTMs are learned in the same target domain as the downstream XGB model (Section 3.2.2) and then in a setting where the LSTMs are learned in a different source domain to the target domain the XGB model is trained in (Section 3.2.3). For these sections, we focus on forecasting hypoxemia, hypocapnia, and hypotension.
To further evaluate PHASE, we utilize two non-“hypo” downstream tasks, namely hypertension and phenylephrine (Section 3.2.4).
Finally, we examine how fine tuned versions of the next LSTMs (trained in one OR data set and fine tuned in the other) converge and perform compared to models trained from scratch (Section 3.2.5).
3.2.1. Establishing our baselines
In Figure 2b, we make a number of comparisons. Our first goal is to compare two downstream models: XGBoost and an LSTM (both trained on raw signal data). In order to obtain the best LSTM baseline, we hyperparameter tuned over the number of LSTM layers (1,2,3), number of nodes (100,200,300), optimizers (RMSProp, SGD, Adam), learning rates (0.01,0.001,0.0001), and dropout rates (0,0.5) for forecasting hypoxemia on OR and used the best model for subsequent analyses. Comparing these models we see that an end to end LSTM model marginally helps for hypoxemia forecasting and hurt for hypocapnia and hypotension. This result is surprising given that one might expect LSTMs to be more suitable to time series data. In comparison, we aim to evaluate PHASE which utilizes both model types to further improve performance (Section 3.2.2).
Next, we examine a natural baseline: ema - a simple yet surprisingly effective representation. In fact, Lundberg et al. (2018) previously showed that XGBoost with ema features outperformed practicing anesthesiologists for forecasting hypoxemia in a simulated setting. However, while the ema representation encodes much of the information from the original data, it does not significantly improve performance compared to raw (hypoxemia, hypotension) and at times even hurts performance (hypotension).
3.2.2. PHASE improves downstream model performance
In this section we evaluate PHASE, which uses LSTMs to extract features for each signal, that are fed into XGB downstream models. These features are concatenated to static patient information and used to train a downstream XGBoost model. In this section, the LSTMs are trained in the target dataset () (next, min, and hypo in Figure 2b). We compare to four baselines: raw, ema, rand, and auto. The first two baselines have been discussed in Section 3.2.1. The rand representation uses an LSTM with random weights. The auto representation uses an LSTM trained to predict the last 60 minutes of a signal (the same as the input). As a final note, all the downstream XGBoost models are trained with identical hyperparameters in order to fairly compare different representations of the input data.
First, rand transforms the data in a manner that makes it harder to perform the eventual prediction, serving as a lower bound for the performance of LSTM embeddings. Second, similarly to ema, auto does not appear to consistently improve or impair performance relative to raw. Third, it appears that next, raw, and hypo consistently yield better performing models for all three target tasks. Contrasting this result to auto appears to suggest that for our target tasks (forecasting binary “hypo” outcomes), incorporating the future in the source task is crucial (as in next), although taking the minimum (min) and thresholding (hypo) does not further improve performance. Finally, PHASE embeddings improve over ema, a method on par with practicing anesthesiologists (Lundberg et al. (2018)).
3.2.3. PHASE transfers across hospitals successfully
In Figure 2c, we show that the PHASE LSTM feature extractors transfer successfully across hospitals and departments. We see one trend consistent across all three target tasks. As one might expect, training the LSTMs on a source data set that is different than the target data set (; denoted as and ) has lower downstream performance compared to LSTMs trained in the target data set (; denoted as next, min, hypo).
Despite this, across the three target tasks, next and min significantly outperform raw, particularly in comparison to hypo. These results suggest that the choice of source task is extremely important and having a source task that is identical to the target task (hypo) is not the best for generalization. Finally, transferal from ICU () is equally successful for forecasting hypoxemia in comparison to transference between OR data sets (), suggesting successful transference from an intensive care unit data set to both operating room data sets. This is particularly exciting for two reasons: 1.) because intensive care units likely serve very different populations of patients compared to operating rooms and 2.) we used an LSTM feature extractor for SAO2 from ICU in conjunction with ones from the operating room data sets. For forecasting “hypo” outcomes, it appears that the best LSTM source task is either next or min. Of these two, next is preferable because it is likely to adapt to more downstream prediction tasks.
In summary, although transferring fixed models is consistently more performant than raw embeddings, it is not more performant than training the LSTMs in the target domain to begin with. However, one important advantage of transferring fixed models is that the end user the target domain can use these fixed models to improve their predictions at no additional training cost. End users that may lack either computational resources or deep learning expertise to train their own models from scratch can instead use an off the shelf, fixed embedding model. Given that machine learning is often not the primary concern of hospitals, fixed models could offer a straightforward way to improve the performance of many models trained on physiological signal data.
3.2.4. Applying PHASE to additional tasks
One potential criticism of the results highlighted in Figure 2b and 2c is that they exclusively pertain to forecasting low signals (“hypo” outcomes). In Figure 2d, we evaluate PHASE on two tasks that are not “hypo” tasks. For hypertension (high blood pressure) we empirically demonstrate that as we would expect, next representations are better then min representations. For phenylephrine, we see improved performance from both the next and min models. This makes sense because phenylephrine is typically administered in response to low blood pressure. Finally, we can see that, as before, transferring fixed next and min embeddings significantly improves over raw, ema, and auto.
3.2.5. Fine-tuning embedding models
In this section, we describe how to improve performance of transferred models (Section 3.2.3) by fine tuning them in the target domain. Because next performed and generalized well in previous sections, we focus on this source task for the following experiment. In Figure 3, we aim to evaluate the convergence and performance of fine tuning the LSTM embedding models. Firstly, in Figure 3a we show the convergence of fine tuned models. In the top eight plots, we fix OR to be the source data set. In green we show the convergence of a randomly initialized LSTM trained for each signal. In light green we show the convergence of an LSTM initialized using weights from the best model in OR. In the bottom eight plots we show the analogous plots when OR is the source data set. From these plots we can see that fine tuning LSTMs rather than training them from scratch consistently leads to much faster convergence; this suggests that end users that are capable of fine tuning LSTMs should always do so. In Figure 3b, we see that LSTMs obtained from this fine tuning approach (next) consistently improve downstream model (XGB) performance in comparison to LSTMs trained in a single source data set (next, next’). Overall, fine tuning the LSTM feature extractors improves performance at lower cost to end users in the target domain.
3.3. Explanation of downstream models
In the following section we interpret the downstream models we used to evaluate our embeddings. The goal of the section is twofold: 1.) to validate that the downstream models are sensible and 2.) to demonstrate that downstream models trained using PHASE embeddings are explainable. Being able to interpret downstream models is important for ensuring models are fair/unbiased, trustable, valuable to scientific understanding, and more (Doshi-Velez and Kim, 2017). This is especially true when the results are used to make critical decisions involving human health.
3.3.1. Interpretation using SHAP values
In order to obtain explanations, we utilize Interventional Tree Explainer which provides exact SHAP values (feature attributions with game-theoretic properties) for complex tree-based models (Lundberg et al., 2019; Lundberg and Lee, 2017). SHAP values or feature attributions indicate how much each feature contributed to each prediction. We use SHAP values to explain which features were important for each sample’s prediction in our downstream XGB models.
In Figure 4, we plot summary plots for the top five most important features for both the raw and next XGBoost models. In these summary plots, each point represents a feature’s attribution for a given sample. The coloring denotes the value of the feature. In the embedding case, the feature value is arbitrarily chosen by the embedding model and in the aggregated attributions (c), the colors are the sum of all values from aggregated features. For both cases, the coloring of the signal features is generally not very informative.
3.3.2. Aggregating feature attributions by signal (Hypoxemia)
In Figure 4a we plot attributions for XGB models trained with raw and next data. For raw, the attributions are in the original signal space, where each number corresponds to the minute when the value was recorded (relative to the current time step). For next, the attributions are in the embedding space, where each attribution corresponds to an LSTM hidden node.
In this plot we can see a few trends. The first is that in the raw model, most of the important features are the minute from a given signal, which corresponds to the last minute of data before the prediction occurred. This makes sense because signals have temporal locality: in order to forecast the future five minutes, the most recent time points are the most helpful. Comparing across both models, we can see that SAO2 is naturally the most important feature for both models (since we are forecasting low blood oxygen). Note that in contrast to the raw model, we can see that for the important SAO2 embeddings, higher values correspond to positive predictions. Note that because these features are pre-processed, the embedding features’ values are not as naturally interpretable as minutes from SAO2. Next, we can see a consistent trend across both models in the static Weight feature: low weights are protective and high weights contribute to higher hypoxemia risk predictions. Another interesting point is that in the next model, three SAO2 features are the most important as opposed to one in raw. This suggests that the LSTM trained to forecast the next five minutes of SAO2 learned multiple representations of SAO2 that serve to forecast hypoxemia risk even though we do not specifically train it to do so. These representations give the downstream model more discriminative power.
Although having the per-feature attributions are useful, in most end-user scenarios, understanding which signal contributes most to the prediction is of primary importance. Because PHASE utilizes single-signal LSTM embedding functions, we are able to obtain per-signal feature attributions despite using representation learning. In Figure 4b, we sum the attributions for each signal from all features that correspond to . We can see a good deal of overlap between the top features from the raw and next models. Finally, the union of the important variables from both models are all variables logically connected to blood oxygen. SAO2, ETCO2, and FIO2 are all variables associated with the respiratory system. TV and PIP are both variables tied to mechanical ventilation and are naturally linked to blood oxygen (Kiiski et al., 1992; Dreyfuss et al., 1988). Finally, the negative correlation between BMI and lung capacity (Jones and Nzekwu, 2006), the increased difficulty of ventilating heavier patients (De Jong et al., 2017), and the association of obesity with many downstream effects including cardiovascular function (Vasan, 2003) and ischemic heart disease (Thomsen and Nordestgaard, 2014) could justify the importance of Weight we see in Figures 4a and 4b could be that .
3.3.3. Top features for downstream tasks
In Figure 4c, we plot the aggregated attributions for next models trained to predict the remaining four outcomes. Once again, we can verify that the top features are appropriate. Firstly, for hypocapnia it is natural that ETCO2 is the most important feature. Furthermore, it makes sense to utilize FIO2, RESPRATE, PIP, and TV to forecast hypocapnia because these variables are all related to either ventilation or respiration. As one would expect, for hypotension and hypertension, the most important variables are generally the three non-invasive blood pressure measurements: NIBPM, NIBPD, NIBPS. Furthermore, there are a number of studies validating the importance of ECGRATE (heart rate measured from ECG signals) to forecasting hypotension and hypertension (Palatini, 2011; Morcet et al., 1999). Finally, phenylephrine administration during surgery is typically done in response to hypotension, thus validating the importance of NIBPS, NIBPM, and ECGRATE. Similarly, older age being more important to forecasting Phenelyphrine may be tied to age being predictive of hypotension as well as the heightened vigilance anesthesiologists have for hypotension in this higher-risk population (Lonjaret et al., 2014).
This paper presents PHASE, an approach to transfer learning in the domain of physiological signals. Transfer learning for physiological signals potentially has far-reaching impacts, because neural networks inherently create an embedding before the final output layer. In light of the quantity of researchers working on neural networks for physiological signals and the lack of exploration of transfer learning in this domain, PHASE offers a potential method of collaboration that can address domain and task adaptation. PHASE embeddings offer a number of benefits including improved performance, successful transference (in fixed and fine-tuned settings), lower computational cost, and preserved explainability. Potential future work includes handling multiple sampling rates for signals in addition to addressing more signals (e.g., electrocardiograms).
5. IRB Statement
The electronic data for this study was retrieved from institutional electronic medical record and data warehouse systems after receiving approval from the Institutional Review Board (University of Washington Human Subjects Division, Approval #46889). Protected health information was excluded from the data set that was used for machine learning methods.
We would like to thank Joseph D. Janizek, Alex Okeson, and Nicasia Beebe-Wang for their feedback on the manuscript. We would also like to thank all of the members of Professor Su-In Lee’s lab for their feedback on the project. In addition, this material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1762114.
Representation learning for mammography mass lesion classification with convolutional neural networks. Computer methods and programs in biomedicine 127, pp. 248–257. Cited by: §1.1.1.
- Relation between perioperative hypertension and intracranial hemorrhage after craniotomy. Anesthesiology: The Journal of the American Society of Anesthesiologists 93 (1), pp. 48–54. Cited by: §1.1.3.
- Hypoxia of the renal medulla—its implications for disease. New England Journal of Medicine 332 (10), pp. 647–655. Cited by: §1.1.3.
- Federated learning of predictive models from federated electronic health records. International journal of medical informatics 112, pp. 59–67. Cited by: §1.1.1.
- Emotion assessment: arousal evaluation using eeg’s and peripheral physiological signals. In International workshop on multimedia content representation, classification and security, pp. 530–537. Cited by: §1.1.2.
- Adverse effects of limited hypotensive anesthesia on the outcome of patients with subarachnoid hemorrhage. Journal of neurosurgery 92 (6), pp. 971–975. Cited by: §1.1.3.
- Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, Vol. 2016, pp. 371. Cited by: §1.1.1.
- XGBoost: a scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 785–794. External Links: Cited by: §2.3.2.
- Forecasting acute hypotensive episodes in intensive care patients based on a peripheral arterial blood pressure waveform. In 2009 36th Annual Computers in Cardiology Conference (CinC), pp. 545–548. Cited by: §1.1.3.
- Multi-layer representation learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1495–1504. Cited by: §1.1.1, §1.
- GRAM: graph-based attention model for healthcare representation learning. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 787–795. Cited by: §1.1.1, §1.
- Multi-source transfer learning with convolutional neural networks for lung pattern analysis. arXiv preprint arXiv:1612.02589. Cited by: §1.1.1.
- Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364. Cited by: §1.
- An analysis of major errors and equipment failures in anesthesia management: considerations for prevention and detection.. Anesthesiology 60 (1), pp. 34–42. Cited by: §1.1.3.
- Preventable anesthesia mishaps: a study of human factors.. Anesthesiology 49 (6), pp. 399–406. Cited by: §1.1.3.
- Understanding and forecasting hypoxia using machine learning algorithms. Journal of Hydroinformatics 13 (1), pp. 64–80. Cited by: §1.1.3.
- Effects of arterial hypoxia on the heart and circulation: an integrative study. American Journal of Physiology-Legacy Content 205 (5), pp. 963–970. Cited by: §1.1.3.
- Frustratingly easy domain adaptation. arXiv preprint arXiv:0907.1815. Cited by: §1.
- Mechanical ventilation in obese icu patients: from intubation to extubation. Critical Care 21 (1), pp. 63. Cited by: §3.3.2.
- Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §3.3.
- High inflation pressure pulmonary edema: respective effects of high airway pressure, high tidal volume, and positive end-expiratory pressure. American Review of Respiratory Disease 137 (5), pp. 1159–1164. Cited by: §3.3.2.
- The incidence of hypoxemia during surgery: evidence from two institutions. Canadian Journal of Anesthesia/Journal canadien d’anesthésie 57 (10), pp. 888–897. Cited by: §1.1.3.
- Deep learning for healthcare applications based on physiological signals: a review. Computer methods and programs in biomedicine. Cited by: §1.
- Greedy function approximation: a gradient boosting machine.. Ann. Statist. 29 (5), pp. 1189–1232. External Links: Cited by: §2.3.2.
- Learning to forget: continual prediction with lstm. Neural Computation 12 (10), pp. 2451–2471. Cited by: §3.1.1.
- Epileptic seizure detection using multiwavelet transform based approximate entropy and artificial neural networks. Journal of neuroscience methods 193 (1), pp. 156–163. Cited by: §1.1.2.
- Transfer learning for clinical time series analysis using deep neural networks. arXiv preprint arXiv:1904.00655. Cited by: §1.1.2.
- Transfer learning for clinical time series analysis using recurrent neural networks. arXiv preprint arXiv:1807.01705. Cited by: §1.
- A clinical study of hypertensive cardiovascular disease. Transactions of the Association of American Physicians 28, pp. 333. Cited by: §1.1.3.
- Harbingers of poor outcome the day after severe brain injury: hypothermia, hypoxia, and hypoperfusion. Journal of Trauma and Acute Care Surgery 54 (2), pp. 312–319. Cited by: §1.1.3.
- MIMIC-iii, a freely accessible critical care database. Scientific Data 3, pp. 160035. External Links: Cited by: §2.1.
- The effects of body mass index on lung volumes. Chest 130 (3), pp. 827–833. Cited by: §3.3.2.
- Prophylactic phenylephrine infusion for preventing hypotension during spinal anesthesia for cesarean delivery. Anesthesia & Analgesia 98 (3), pp. 815–821. Cited by: §1.1.3.
- Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172 (5), pp. 1122–1131. Cited by: §1.1.1.
- Effect of tidal volume on gas exchange and oxygen transport in the adult respiratory distress syndrome. American Review of Respiratory Disease 146, pp. 1131–1131. Cited by: §3.3.2.
- Estimation of dynamic joint torques and trajectory formation from surface electromyography signals using a neural network model. Biological cybernetics 73 (4), pp. 291–300. Cited by: §1.1.2.
- Circulatory adaptations in hypoxia. Physiological reviews 39 (4), pp. 687–730. Cited by: §1.1.3.
- Representation learning: a unified deep learning framework for automatic prostate mr segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 254–261. Cited by: §1.1.1.
- Survey of anesthesia-related mortality in france. Anesthesiology: The Journal of the American Society of Anesthesiologists 105 (6), pp. 1087–1097. Cited by: §1.1.3.
- Optimal perioperative management of arterial blood pressure. Integrated blood pressure control 7, pp. 49. Cited by: §3.3.3.
- Explainable ai for trees: from local explanations to global understanding. arXiv preprint arXiv:1905.04610. Cited by: §3.3.1.
- A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774. Cited by: §3.3.1.
- Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nature biomedical engineering 2 (10), pp. 749. Cited by: §1.1.3, §2.3.2, item 1, §3.2.1, §3.2.2.
- Transfer learning based clinical concept extraction on data from multiple sources. Journal of Biomedical Informatics 52, pp. 55 – 64. Note: Special Section: Methods in Clinical Research Informatics External Links: Cited by: §1.1.1, §1.
- The application of artificial neural network in the forecasting on incidence of a disease. In 2010 3rd International Conference on Biomedical Engineering and Informatics, Vol. 3, pp. 1269–1272. Cited by: §1.1.3.
- Wearable sensors for remote health monitoring. Sensors 17 (1), pp. 130. Cited by: §1.
- Short term variability of oxygen saturation during hemodialysis is a warning parameter for hypotension appearance. In 2008 Computers in Cardiology, pp. 881–884. Cited by: §1.1.3.
- Learning deep physiological models of affect. IEEE Computational Intelligence Magazine 8 (2), pp. 20–33. Cited by: §1.1.2.
- Associations between heart rate and other risk factors in a large french population. Journal of hypertension 17 (12), pp. 1671–1676. Cited by: §3.3.3.
- A review of big data applications of physiological signal data. Biophysical reviews 11 (1), pp. 83–87. Cited by: §1.
- Role of elevated heart rate in the development of cardiovascular disease in hypertension. Hypertension 58 (5), pp. 745–750. Cited by: §3.3.3.
- Hypertension: definitions, natural histories and consequences. The American journal of medicine 52 (5), pp. 570–583. Cited by: §1.1.3.
- Some adverse physiological effects of hypocarbia and methods of maintaining normocarbia during controlled ventilation—a review. Anaesthesia and intensive care 5 (2), pp. 113–121. Cited by: §1.1.3.
- Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pp. 759–766. Cited by: §1.
- Understanding the mechanisms of deep transfer learning for medical images. In Deep Learning and Data Labeling for Medical Applications, pp. 188–196. Cited by: §1.1.1, §1.
- Creating value in health care through big data: opportunities and policy implications. Health affairs 33 (7), pp. 1115–1122. Cited by: §1.
- Transfer representation learning for medical image analysis. In Engineering in Medicine and Biology Society (EMBC), 2015 37th Annual International Conference of the IEEE, pp. 711–714. Cited by: §1.1.1.
- Deep convolutional neural networks for computer-aided detection: cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35 (5), pp. 1285–1298. Cited by: §1.
- An overview of health forecasting. Environmental health and preventive medicine 18 (1), pp. 1. Cited by: §2.2.
- Approximate entropy-based epileptic eeg detection using artificial neural networks. IEEE Transactions on information Technology in Biomedicine 11 (3), pp. 288–295. Cited by: §1.1.2.
- Surgeries in hospital-based ambulatory surgery and hospital inpatient settings, 2014. HCUP Statistical Brief. Cited by: §1.
- Reconstruction of missing physiological signals using artificial neural networks. In Computing in Cardiology, 2010, pp. 317–320. Cited by: §1.1.2.
- Convolutional neural networks for medical image analysis: full training or fine tuning?. IEEE Transactions on Medical Imaging 35 (5), pp. 1299–1312. External Links: Cited by: §1.
- Deep transfer learning for eeg-based brain computer interface. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 916–920. Cited by: §1.1.2.
- Myocardial infarction and ischemic heart disease in overweight and obesity with and without metabolic syndrome. JAMA internal medicine 174 (1), pp. 15–22. Cited by: §3.3.2.
- Perioperative hypertension management. Vascular health and risk management 4 (3), pp. 615. Cited by: §1.1.3.
- Cardiac function and obesity. BMJ Publishing Group Ltd. Cited by: §3.3.2.
- From physiological signals to emotions: implementing and comparing selected methods for feature extraction and classification. In Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, pp. 940–943. Cited by: §1.1.2.
- An all-payer view of hospital discharge to postacute care, 2013. HCUP Statistical Brief. Cited by: §1.
- A study in transfer learning: leveraging data from multiple hospitals to enhance hospital-specific predictions. Journal of the American Medical Informatics Association 21 (4), pp. 699–706. Cited by: §1.1.1, §1.
- Real-time assessment of mental workload using psychophysiological measures and artificial neural networks. Human factors 45 (4), pp. 635–644. Cited by: §1.1.2.
- Collaborative filtering for brain-computer interaction using transfer learning and active class selection. PloS one 8 (2), pp. e56624. Cited by: §1.1.1.
- Classification of acoustic physiological signals based on deep learning neural networks with augmented features. In Computing in Cardiology Conference (CinC), 2016, pp. 569–572. Cited by: §1.1.2.
- Accuracy of symptoms and signs in predicting hypoxaemia among young children with acute respiratory infection: a meta-analysis. The International Journal of Tuberculosis and Lung Disease 15 (3), pp. 317–325. Cited by: §1.1.3.
7.1. Top diagnoses for our data
Top ten diagnoses (OR):
Calculus of Kidney
Complications due to other internal orthopedic device implant and graft
Senile Cataract NOS
Senile Cataract Unspecified
Carpal Tunnel Syndrome
CMP NEC D/T ORTH DEV NEC
Top ten diagnoses (OR):
Malignant Neoplasm of Breast (Female) Unspecified
Malignant Neoplasm of Breast NOS
Calculus of Kidney
Malignant Neoplasm of Prostate
Malignant Neoplans of Bladder Part Unspecified
PREV C-SECT NOS-DELIVER
End stage renal disease
Top ten diagnoses (ICU):
Congestive Heart Failure
Coronary Artery Disease
Altered Mental Status
For hypoxemia, a particular time point is labelled to be one if the minimum of the next five minutes is hypoxemic (). All points where the current time step is currently hypoxemic are ignored (). Additionally we ignore time points where the past ten minutes were all missing or the future five minutes were all missing. hypocapnia, hypotension, and hypertension have slightly stricter label conditions. We label the current time point to be one if () and the minimum of the next five minutes is ”hypo” (). We label the current time point to be zero if () and the minimum of the next ten minutes is not ”hypo” (). For Hypertension, we use rather than and an analogous filtering procedure. All other time points were not considered. For hypocapnia, the threshold and the signal is ETCO2. For hypotension the threshold and the signal is NIBPM. For hypertension the threshold and the signal is NIBPM. Additionally we ignore time points where the past ten minutes were all missing or the future five minutes were all missing. As a result, we have different sample sizes for different prediction tasks (reported in Table 1). For phenylephrine, we filter out procedures where phenylephrine is not administered because we would have too many negative samples otherwise.
7.3. MLP downstream model
One potential criticism of our evaluations in previous are that we evaluate with only one downstream model type: XGBoost. There are clearly a number of benefits to using tree-based methods such as ease of training, exact SHAP value attribution methods, performance on par with LSTMs for our data sets, and more. However, in order to show that PHASE embeddings improve performance and transference for a variety of downstream model types, we replicate Figure 2b and 2c using MLP downstream models in Figure 5. We see that, as with downstream XGB models, the PHASE embeddings offer a substantial improvement over raw embeddings for downstream MLP models.
We utilize multi-layer perceptrons, implemented in the Keras library with a Tensorflow back-end. We train the MLPs with embedding features from 15 physiological signals, resulting in a total of 3000 features for DeepPHASE methods. In addition, we concatenate static features to the signal features to train and evaluate the models. The model’s architecture consists of the following: a dense layer with 100 nodes (with a relu activation) followed by a dropout layer with dropout rate 0.5 followed by a dense layer with 100 nodes (with a relu activation) followed by a dropout layer with dropout rate 0.5 followed by the dense output layer with one node and sigmoid activation function. We utilize a learning rate of 0.00001, adam optimizer, and binary cross entropy loss. We found that 200 epochs was sufficient for the downstream models to converge. We fix hyperparameter settings across experiments so that we can focus on comparing different representations our signal data. In order to train these models, we utilize 72 CPUs (Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz)
7.4. Full summary plots
In this section, we show the full summary plots (Figures 6-10) for the per-feature and aggregated attributions for raw and next models trained from XGBoost models trained in target data set OR. We can see more relationships between each of the five downstream tasks and the top 20 features sorted by the mean absolute SHAP values for each feature.