Health status prediction (e.g., mortality risk prediction, disease prediction) is of great interest to physicians. For inpatients or patients with chronic diseases who face severe life threats and receive long-term treatments, their health conditions are complex and continually changing over time. By predicting patient’s health status, physicians can select personalized follow-up treatments, prevent adverse outcomes, assign medical resources effectively, and reduce the medical cost. Normally, some biomarkers, such as blood albumin and blood glucose, are recorded through the treatment trajectories and further have been taken into consideration for the predication. In a practical diagnosis process, physicians need to comprehensively evaluate the health of patients by identifying the high-risk factors. The precise risk prediction requires a high level of clinical expertise and experience.
Nowadays, electronic healthcare information systems are widely used in various healthcare institutions, which can precisely record the lab test results and health information of patients in terms of Electronic Medical Records (EMR). As depicted by Figure 1, EMR can be seen as a type of multivariate time series data and provide essential healthcare information for the data-driven healthcare prediction.
Recently, due to the remarkable representation learning ability of deep neural networks, many deep learning-based models have been developed to tackle such prediction tasks by using EMR data, including mortality prediction[ma2020concare], disease diagnosis prediction [lee2018diagnosis], and patient phenotype identification [baytas2017patient]. Usually, those models first embed the EMR data into low-dimensional feature space to learn the dense representation of the patients’ health status and then perform specific clinical analysis tasks based on such representation. However, there are still some issues that are not yet fully resolved by existing research works, i.e., how to effectively embed temporal health information comprehensively, and how to assure the trustworthiness of the representation learning model in terms of providing verifiable interpretations. The issues are summarized as follows:
: Historical variation pattern of the biomarker in different time scales. Not only the value of lab test results but also its variation pattern contains essential information on the health status. The long-term variation trend and short-term abnormal variation of different biomarkers both show the health status of patients from different aspects. Thus, it is vital that the variation of different biomarkers can be captured at different time scales. For example, for end-stage renal disease (ESRD) patients, the long-term descending trend of blood albumin is a strong indicator of malnutrition and health deterioration [bharadwaj2016malnutrition]. On the contrary, the short-term abnormal rising of hypersensitive C-reactive protein indicates the high-risk clinical event (e.g., peritonitis) [Ma2018Predictors]. Although several research works try to utilize the convolutional operator to extract temporal patterns of clinical events [cheng2016risk, ma2018health], none of them can capture such patterns in multi-level time scales simultaneously.
: Adaptively making use of the clinical features for patients in diverse conditions. Key factors that strongly indicate the health risk are different among patients [valko2010feature]. So it is critical that the model can utilize the features adaptively when learning the health status representation and performing prediction for patients in different conditions. The model adaptability can be specifically expressed in two aspects:
The importance of features in different time-scales varies among different patients. For example, for the patient suffering from chronic disease, the feature extracted in the long term may be more representative for depicting the health status. On the contrary, for the patient diagnosed with acute disease, the short-term feature describes the health risk more precisely.
The importance of features should also be adaptive to different characteristics. For example, the model should pay more attention to creatinine and urea when they rise 111Plasma concentrations of creatinine and urea are usually associated with systemic manifestations (uremia) for chronic kidney disease patients [msdmanuals_ckd]., and the model should concentrate more on diastolic blood pressure when the patient has been diagnosed with cerebrovascular disease222The level of diastolic blood pressure is usually associated with cerebrovascular disease [rabkin1978predicting].
: Interpretability for various patients. Medical experts need to understand how a certain decision is made by a model to a particular patient at different visits. So that the prediction results are trustworthy for developing individualized intervention and extracting medical knowledge [tangri2011determining]. For example, if the model triggers a health warning by taking the creatinine and urea as key factors, the physician is alerted to assess such patients for possible systemic manifestations. Moreover, it will also remind the physicians of the previously unknown correlation between the biomarker and death reason. However, most of the existing works can only provide visit-level or disease-feature-level interpretability by attention mechanism [ma2017dipole, bai2018interpretable]. As far as we know, RETAIN [choi2016retain] is the only work that can provide reasonable biomarker-feature-level interpretability by utilizing two-level attention as an end-to-end model, but its prediction accuracy is unsatisfactory [ma2018health, ma2018risk]. To bridge this research gap, our model AdaCare can provide a fine-grained feature-level interpretation for the model prediction, but also achieves a state-of-the-art prediction accuracy.
By jointly considering the above research issues in clinical practice, we propose a clinical health status representation learning model via scale-adaptive feature extraction and recalibration, AdaCare333We release our code and case studies at GitHub https://github.com/Accountable-Machine-Intelligence/AdaCare. It monitors biomarkers in long and short time scales simultaneously to extract temporal variation patterns, depicting the health status comprehensively for patients in diverse conditions (e.g., diagnosed with chronic diseases or acute diseases). AdaCare models the high relationship between clinical features to abstract the input. At each visit, AdaCare selects the most indicative medical features to build health status representation. Empirical studies show AdaCare boosts the performance and meanwhile offers key features that lead to the prediction. Our main contributions are summarized as follows:
We build a general health status representation learning model, AdaCare, to effectively embed the health status and provide reasonable interpretability for patients in diverse conditions. AdaCare uses the dilated convolution with multi-scale receptive fields to capture the long and short-term variation patterns of biomarkers as clinical features and depict patient health status more comprehensively (addressing ).
We build the scale-adaptive feature recalibration module, which explicitly and adaptively models the feature relationship based on squeeze-and-excitation block [hu2018squeeze] to selectively enhances high-risk features and meanwhile suppress the useless ones (addressing ). And thus, AdaCare can provide interpretability on health status representation learning for patients in diverse health conditions as an end-to-end model, and further remind the physicians with the precursor of health risk (addressing ). Such interpretability is indicative of understanding how the model utilizes EMR data to make the assessment and extract valuable medical knowledge.
We conduct two prediction tasks (i.e., decompensation prediction and mortality prediction) on two real-world datasets (i.e., MIMIC-III dataset and end-stage renal disease dataset) respectively to verify the performance. The results show that AdaCare outperforms the baseline approaches in both tasks. The interpretability of AdaCare is demonstrated by an overall observation of feature recalibration. Besides, the obtained medical knowledge has been positively confirmed by clinical experts.
Over the past ten years, there has been a massive explosion in the amount of digital information stored in electronic medical records, which opens a door for researchers to make secondary use of these records for various clinical applications. Deep learning-based models have shown the capability to perform mortality prediction, patients subtyping, and diagnosis prediction. Though the medical tasks vary from each other, their essences are usually learning the health status representations of patients. There are two essential concerns among deep-learning-based EMR analysis researches:
Temporal Medical Feature Extraction
Some researches tried to extract high-level temporal clinical features by convolution modules as well as performing healthcare prediction [cheng2016risk, ma2018health]. However, according to medical experience, the variation of biomarkers should be evaluated in different time scales simultaneously when evaluating the patient’s condition. To the best of our knowledge, there has not been any research extracting clinical features in multiple time scales effectively.
Interpretability of EMR Analysis
On the one hand, the interpretability shown in most of the existing works mainly focuses on visit-level attention. For example, some researches proposed RNN-based models with attention mechanisms to measure the relationships of different visits [ma2017dipole, lee2018diagnosis].
On the other hand, several researches have also explored the interpretability in the medical-feature-level. Timeline [bai2018interpretable] utilizes self-attention to generate clinical visit embedding, but can only identify disease code-level importance. Some researches show the importance of features via adversarial attack, which is not an end-to-end framework [sun2018identify]. RETAIN [choi2016retain] is more closely related to our work in terms of interpretability, which achieves feature-level interpretability by using attention mechanisms. However, the prediction performance of RETAIN is limited [ma2018health, ma2018risk], due to the deficiency of effective high-level clinical feature extraction. Existing studies still cannot capture the importance of biomarkers dynamically and meanwhile gain a performance boost in an end-to-end deep learning-based healthcare predictive model.
A Motivating Example
We take the health status prediction of end-stage renal disease (ESRD) patients as the motivating example. Currently, many people are suffered from ESRD in the world [tangri2011determining, isakova2011fibroblast]. They face severe life threats and need lifelong treatments with periodic visits to the hospitals for multifarious tests (e.g., blood routine examination). The whole procedure needs a dynamic patient health risk prediction to help patients recover smoothly and prevent the adverse outcome, based on the medical records collected along with the visits. The core task of AdaCare is to learn the health status representation of the patient and perform the healthcare prediction.
|Groundtruth of prediction target at -th visit|
|Prediction result at -th visit|
|Multivariate visit record at -th visit|
|Feature recalibration weight of|
|Weighted input record|
|Extracted convolutional embedding of|
|Scale-adaptive recalibration weight of|
|Weighted convolutional embedding|
|Visit embedding at -th timestep|
|Hidden state of GRU at -th timestep|
We assume that a patient overall visits clinic times, generating time-ordered EMR records denoted as . Each EMR record contains features such as different lab test results. Thus the prediction problem in this paper can be formulated as, given historic EMR data of a patient, i.e., , , , how to predict the patient’s healthcare status
which is the probability of suffering from the specific risk (e.g., mortality risk, disease diagnosis, decompensation). The next section will detail our solutionAdaCare.
Figure 4 shows the model structure of AdaCare
, a Gated Recurrent Units (GRU) based architecture is used to embed the health status at each clinical visit and perform the healthcare prediction. The visit record sequence is embedded by the GRU to obtain the hidden state. On the one hand, if the model depends on the latest record alone, it would be overly sensitive to abnormal values of brought by the missing data and noise of EMR, thus the prediction may lack robustness. On the other, if the model only depends on the historical characteristics, the alertness of its prediction will be compromised. In AdaCare, both of the historical characteristics extracted by multi-scale dilated convolutional module and the are taken into consideration to build the visit embedding . Finally, we use to predict . In summary, the novelty of AdaCare lies in the following two model components:
As Figure 3 shows, we develop a dilated convolution [yu2015multi] with multi-scale receptive fields to capture the variation characteristics of biomarkers (addressing ).
As illustrated in Figure 4, we extend the squeeze and excitation block [hu2018squeeze] to dynamically capture the clinical features which strongly indicate the health risk (addressing ).
Multi-Scale Dilated Convolution
One of our goals is to capture the dynamic variations of biomarkers over time and extract such local patterns as additional clinical features. But RNN module alone is difficult to achieve this. The work [yu2015multi] demonstrated the effectiveness of to extract local patterns on images. Thus, AdaCare adopts a similar idea by adding a convolution filter before GRUs. But different to [yu2015multi], we extend the dilated convolutional layers with different time spans (i.e., receptive fields), as depicted in Figure 4. By doing so, AdaCare demonstrates the remarkable capability to capture both long-term trends and short-term abnormal variations of biomarkers simultaneously.
Mathematically, the is a convolution applied to input with defined gaps:
where is the input biomarkers of records, is the output feature map, denotes the convolutional filter of length , and corresponds to the dilation rate. We use multiple filters to generate different filter maps, and the number of filter maps is . We concatenate the multiple filter maps to get the final convolution output, denoted as . Figure 5 shows how dilated convolutional Layers with different dilation rates work.
, the dilated convolutional module is extended with multi-scale receptive fields consists of multiple parallel convolutional branches with the same filter size and stride but different dilation rates of, , , . For a given dilation rate, each layer takes multiple records of a time span, and the filters scan across the records to generate feature maps, which are concatenated to represents the long and short-term variation by
. Moreover, to prevent the leakage of follow-up records, we utilize causal padding[yu2015multi].
Scale-Adaptive Clinical Feature Recalibration
As we extract the sophisticated variation patterns of biomarkers, the multi-scale dilated convolutional module also introduces redundancy into the model inevitably. Besides, some of the clinical features recorded in EMR have a high correlation with each other or even contribute little to the prediction target. It will reduce the interpretability and robustness of the learned representation if such redundant information is fed into the network. To improve the adaptability of AdaCare in terms of feature utilization, we design the scale-adaptive clinical feature recalibration module based on [hu2018squeeze]. This module is trained to model the nonlinear dependencies between clinical features explicitly. For a particular patient, it can selectively give more weights to the representative and predictive features but suppresses the unimportant features. For patients with different disease conditions (e.g., suffering from chronic disease), the representative and predictive features in the corresponding time scale (e.g., the long-term dilated convolutional feature) would be enhanced. Concretely, we design two fully-connected layers to learn the abstract weight representation and then re-scale it to the original dimension.
where is the input of the abstraction operation; parameter matrix , ; is the compress ratio that determines the abstraction degree of features; denotes the mapping matrix which rescales the input into -dimensional; is the activation function. Then the learned weight can be applied to the original features with an element-wise multiplication:
The original input vectoris filtered to be sparser, and the redundancy of the network is reduced. Such feature recalibration can be adjusted adaptively and dynamically through the visits according to the particular health condition.
Besides, in AdaCare, both the raw features and the features captured by the dilated convolutional layers are used to represent the current health status together via a recalibrated skip-connection. The selectively enhanced predictive features can be treated as a precursor of health risk for the given patient.
The weighted raw features and the weighted convolutional features are concatenated together: . The visit embedding
are fed into GRU to obtain hidden representations:. The attention mechanism can be easily adopted on the hidden representation here, but it is not the primary concern in this paper. Finally, the healthcare prediction is obtained as:
We conduct the decompensation prediction experiment on the MIMIC-III dataset and mortality prediction experiment on the ESRD (i.e., end-stage renal disease) dataset. The effectiveness of dynamic feature recalibration is investigated by an overall observation. In order to intuitively show the implication and prediction process of AdaCare, we also develop a simple visualization prototype. The source code of AdaCare, statistics of datasets, case studies, and the visualization prototype are available at the GitHub repository444https://github.com/Accountable-Machine-Intelligence/AdaCare.
Data Preprocessing and Prediction Tasks
MIMIC-III Dataset 555https://mimic.physionet.org. We use ICU data from the publicly available Medical Information Mart for Intensive Care (MIMIC-III) database [johnson2016mimic]. We perform the detection of physiologically decompensating patients, which is formulated as a binary classification problem based on patients’ clinical events produced during ICU stays [harutyunyan2017multitask]. Physiologic decompensation is formulated as a problem of predicting if a patient would die within the next 24 hours by continuously monitoring the patient within fixed time-windows. Decompensation labels were curated based on the occurrence of the patient’s date of death (DOD) within the next 24 hours, and only about 4.2% of samples are positive in the benchmark. There are 1,203 patients (about 2.89%) with overlong sequences (i.e., ). Without loss of fairness, we truncate the length of samples to a reasonable limit (i.e., 400). Eventually, a cohort of 41, 602 unique patients with a total of 3,431,622 samples (i.e., records) is used in our dataset. We fix a test set of 15% of patients and divide the rest of the dataset into the training set and validation set with a proportion of 85%:15%. We resample the test set 1000 times using the bootstrap method [harutyunyan2017multitask]
and calculate the standard deviation of the results.
We perform the mortality risk prediction on a real-word end-stage renal disease dataset. In this study, all end-stage renal disease patients who received therapy from January 1, 2006, to March 1, 2018, in a real-world hospital are included to form this dataset. We select the features that are observed in more than 60% patients’ records. For missing values, we fill the missing front cells with the data backward to prevent the leakage of future information. If the backward record of a patient is missing, we impute it with the first front observed record of the patient. The cleaned dataset consists of 656 patients with static baseline information and 13,091 dynamic records. There are 1196 records with positive labels (i.e., died within 12 months) and 10,804 records with negative labels. We evaluate the models with a 10-fold cross-validation strategy and report the average performance, similar to[ma2018health].
Similar to related researches, we assess performance using area under the precision-recall curve (AUPRC) [keilwagen2014area]
, the minimum of precision and sensitivity Min(Se,P+), and area under the receiver operating characteristic curve (AUROC)[hanley1982meaning]. The Min(Se,P+) is calculated as the maximum of min(sensitivity, precision) on the precision-recall curve.
|Mortality Prediction on ESRD||Decompensation Prediction of MIMIC|
|AUPRC||min(Se, P+)||AUROC||AUPRC||min(Se, P+)||AUROC|
|GRU||27.14% (.025)||31.66% (.030)||80.66% (.013)||27.84% (.003)||32.60% (.004)||89.83% (.003)|
|RETAIN||26.18% (.021)||29.98% (.033)||79.25% (.027)||25.97% (.004)||29.00% (.005)||87.64% (.002)|
|T-LSTM||27.84% (.019)||33.37% (.028)||81.13% (.021)||26.11% (.003)||31.86% (.004)||89.44% (.002)|
|SAnD||26.31% (.033)||29.94% (.037)||79.54% (.032)||25.24% (.003)||28.99% (.004)||88.25% (.003)|
|27.66% (.025)||30.55% (.039)||79.64% (.028)||28.11% (.002)||32.71% (.003)||89.77% (.003)|
|30.77% (.021)||33.45% (.022)||81.22% (.012)||28.37% (.003)||33.10% (.003)||89.81% (.002)|
|30.98% (.025)||33.31% (.033)||80.61% (.019)||28.95% (.004)||34.23% (.004)||89.93% (.002)|
|31.79% (.020)||34.46% (.030)||81.51% (.017)||30.37% (.004)||34.29% (.004)||90.04% (.003)|
Implementation Details and Baselines
Several models share part of the similar insights with AdaCare to learn the representation of patient status, some of which are taken as baseline approaches and listed as follows. We conduct a grid search over hyper-parameters space for the models.
GRU is the standard Gated Recurrent Unit network.
RETAIN [choi2016retain] utilizes a two-level neural attention mechanism to detect influential visits and significant variables, which provide interpretability.
handles irregular time intervals by enabling time decay. We modify it into a supervised learning model.
SAnD [song2018attend] models clinical time-series data solely based on self-attention. We re-implement SAnD by using to build input embedding at the measurement position (i.e., causal padding [van2016wavenet]), instead of the one proposed in the original paper , to avoid the violation of causality. The kernel size of convolutional embedding is set to 1.
We also compare AdaCare with the variants of our approaches. Subscript in Table 2 denotes the multi-scale dilated convolution. Subscripts and denote the raw feature recalibration module learned with activation function and respectively.
For AdaCare, we set the hidden units of AdaCare to 64 for the ESRD dataset and 128 for MIMIC-III dataset. We use 64 filters for convolutional layers, the kernel size is set to 2, and the dilation rate is set to 1,2,3/1,3,5 for ESRD/MIMIC dataset, respectively. For the feature recalibration block, we set the compression ratio to 2/4 for ESRD/MIMIC dataset, respectively. We also utilize the dropout strategy (the dropout rate is 0.5) between the RNN layer and the final output layer for all approaches. We utilize Adam optimizer [kingma2014adam] with the mini-batch of 128 patients, and the learning rate is set to . The training is done on a machine equipped with CPU: Intel Xeon E5-2630, 256GB RAM, and GPU: Nvidia Titan V. We implement AdaCare
with Pytorch 1.1.0.
Results of Healthcare Prediction
Table 2 shows that the performance of all approaches on two datasets: MIMIC-III and the ESRD dataset. AdaCare
outperforms all baseline models across both datasets in all evaluation metrics, especially AUPRC, which is the most informative and the primary evaluation metric when dealing with a highly imbalanced and skewed dataset[davis2006relationship, choi2018mime] like the real-world EMR data. Compared to the best baseline model, AdaCare achieves relative improvements of 14.2% and 9.1% for AUPRC in ESRD and MIMIC dataset respectively. Although RETAIN can provide interpretability, its performance is worse than the basic GRU model on both two datasets, which is consistent with the results reported in [ma2018risk].
(i.e., with multi-scale dilated convolution) outperforms the baselines mentioned above. It confirms our assumption that extracting the historical variation pattern of the biomarker in different time scales can depict the health status more comprehensively. (i.e., multi-scale dilated convolution and feature recalibration with activation function) outperforms the baseline approaches including . It suggests that the feature recalibration module can enhance the predictive feature to build the representation effectively, and improve the performance.
To further verify the effectiveness of the model when clearly providing the most high-risk clinical feature in latest visit for the physician, we also test the performance of , which utilizes as the activation function of the raw feature recalibration. Such recalibration enhances only a few most predictive features and suppresses most of the features. When performing the mortality prediction on the ESRD dataset, the performance of is slightly worse than , and consistently better than most of the comparative approaches. The result indicates that the is still reliable when performing health prediction on the ESRD dataset.
Interpretability and Implications
The case study of a specific sample is usually used to verify the interpretability in the EMR analysis researches, but it is still not convincing enough due to the contingency of case studies. In order to quantitatively identify the reasonability of feature recalibration from an overall perspective, we calculate the average importance weights of biomarkers on different causes of death on ESRD validation sets. The feature-death reason importance are visualized on Figure 6.
Some of the essential medical knowledge learned by AdaCare are summarized as follows:
Serum albumin is strongly related to adverse outcomes of ESRD patients, especially for the Peritoneal Dialysis-Associated Peritonitis (PDAP). This is consistent with the medical researches [blake1993serum, spiegel1993serum, cheng2008relationship, meijers2008review], which figure out that the abnormal value and decreasing trend of serum albumin usually indicate that the patient may suffer from inflammation and fluid overload.
Urea is related to gastrointestinal (Gl) disease and cachexia. According to medical research[honda2006serum], the abnormal value and decreasing trend of Urea usually indicate that the patient may suffer from low protein intake and malnutrition.
Serum chlorine (Cl) is strongly related to adverse outcomes of ESRD patients, especially for the cachexia and infection. According to medical experience, the serum chlorine level reflects the renal function of the patient to some extent. However, the relationship between infection, cachexia, and serum chlorine has not been fully explored by existing medical researches. This noteworthy medical finding has already raised the interest of medical experts.
We conduct application-grounded evaluation [doshi2017towards] by inviting 12 experienced medical practitioners (with 5-15 years practicing time) from nephrology departments of 5 different hospitals, to evaluate the agreement degree of interpretability generated by AdaCare. The interpretability provided by AdaCare is highly consistent with the practice experience of human experts. Some of the extracted medical knowledge has already been introduced as the ESRD management aid by physicians. More details about the experiment are described at our GitHub repository.
In this paper, AdaCare is proposed to learn the clinical health status representation. Specifically, we utilize the dilated convolutions with multi-scale receptive fields to capture the long and short-term historical variation of biomarkers. AdaCare models the nonlinear dependencies of features by extending SE-block. Such a feature re-calibration process selectively enhances predictive features extracted in proper time scales and the most high-risk factors in the latest visit. It builds effective health status representation and provide reasonable interpretability. Experiment results on MIMIC-III dataset and ESRD dataset show that AdaCare outperforms the baseline approaches with powerful interpretability. Medical knowledge learned by AdaCare has been positively confirmed by human medical experts and related medical literature.
This work is supported by National Science and Technology Major Project (No. 2018ZX10201002) and the fund of Peking University Health Science Center (BMU20160584). WR is supported by ORCA PRF Project (EP/R026173/1).