For each admitted patient, hospital intensive care units record large amounts data in electronic health records (EHRs). Clinical staff routinely chart vital signs during hourly rounds and when patients are unstable. EHRs record lab test results and medications as they are ordered or delivered by physicians and nurses. As a result, EHRs contain rich sequences of clinical observations depicting both patients’ health and care received. We would like to mine these time series to build accurate predictive models for diagnosis and other applications. Recurrent neural networks (RNNs) are well-suited to learning sequential or temporal relationships from such time series. RNNs offer unprecedented predictive power in myriad sequence learning domains, including natural language processing, speech, video, and handwriting. Recently,Lipton et al. (2016) demonstrated the efficacy of RNNs for multilabel classification of diagnoses in clinical time series data.
However, medical time series data present modeling problems not found in the clean academic datasets on which most RNN research focuses. Clinical observations are recorded irregularly, with measurement frequency varying between patients, across variables, and even over time. In one common modeling strategy, we represent these observations as a sequence with discrete, fixed-width time steps. Problematically, the resulting sequences often contain missing values (Marlin et al., 2012)
. These values are typically not missing at random, but reflect decisions by caregivers. Thus, the pattern of recorded measurements contain potential information about the state of the patient. However, most often, researchers fill missing values using heuristic or unsupervised imputation(Lasko et al., 2013), ignoring the potential predictive value of the missingness itself.
In this work we extend the methodology of Lipton et al. (2016) for RNN-based multilabel prediction of diagnoses. We focus on data gathered from the Children’s Hospital Los Angeles pediatric intensive care unit (PICU). Unlike Lipton et al. (2016), who approach missing data via heuristic imputation, we directly model missingness as a feature, achieving superior predictive performance. RNNs can realize this improvement using only simple binary indicators for missingness. However, linear models are unable to use indicator features as effectively. While RNNs can learn arbitrary functions, capturing the interactions between the missingness indicators the sequence of observation inputs, linear models can only learn substitution values. For linear models, we introduce an alternative strategy to capture this signal, using a small number of simple hand-engineered features.
Our experiments demonstrate the benefit modeling missing data as a first-class feature. Our methods improve the performance of RNNs, multilayer perceptrons (MLPs), and linear models. Additionally we analyze the predictive value of missing data information by training models on the missingness indicators only. We show that for several diseases,what tests are run
can be as predictive as the actual measurements. While we focus on classifying diagnoses, our methods can be applied to any predictive modeling problem involving sequence data and missing values, such as early prediction of sepsis(Henry et al., 2015) or real-time risk modeling (Wiens et al., 2012).
It is worth noting that we may not want our predictive models to rely upon the patterns of treatment, as argued by Caruana et al. (2015). Once deployed, our models may influence the treatment protocols, shifting the distribution of future data, and thus invalidating their predictions. Nonetheless, doctors at present often utilize knowledge of past care, and treatment signal can leak into the actual measurements themselves in ways that sufficiently powerful models can exploit. As a final contribution of this paper, we present a critical discussion of these practical and philosophical issues.
Our dataset consists of patient records extracted from the EHR system at CHLA (Marlin et al., 2012; Che et al., 2015) as part of an IRB-approved study. In all, the dataset contains PICU episodes. Each episode describes the stay of one patient in the PICU for a period of at least 12 hours. In addition, each patient record contains a static set of diagnostic codes, annotated by physicians either during or after each PICU visit.
In their rawest representation, episodes consist of irregularly spaced measurements of 13 variables: diastolic and systolic blood pressure, peripheral capillary refill rate, end-tidal CO (ETCO), fraction of inspired O (FIO), total Glascow coma scale, blood glucose, heart rate, pH, respiratory rate, blood oxygen saturation, body temperature, and urine output. To render our data suitable for learning with RNNs, we convert to discrete sequences of hourly time steps, where time step covers the interval between hours and , closed on the left but open on the right. Because actual admission times are not recorded reliably, we use the time of the first recorded observation as time step . We combine multiple measurements of the same variable within the same hour window by taking their mean.
Vital signs, such as heart rate, are typically measured about once per hour, while lab tests requiring a blood draw (e.g., glucose) are measured on the order of once per day (see appendix B for measurement frequency statistics). In addition, the timing of and time between observations varies across patients and over time. The resulting sequential representation have many missing values, and some variables missing altogether.
Note that our methods can be sensitive to the duration of our discrete time step. For example, halving the duration would double the length of the sequences, making learning by backpropagation through time more challenging(Bengio et al., 1994). For our data, such cost would not be justified because the most frequently measured variables (vital signs) are only recorded about once per hour. For higher frequency recordings of variables with faster dynamics, a shorter time step might be warranted.
To better condition our inputs, we scale each variable to the interval, using expert-defined ranges. Additionally, we correct for differences in heart rate, respiratory rate, (Fleming et al., 2011) and blood pressure (NHBPEP Working Group 2004) due to age and gender using tables of normal values from large population studies.
2.2 Diagnostic labels
In this work, we formulate phenotyping (Oellrich et al, 2015) as multilabel classification of sequences. Our labels include distinct diagnosis codes from an in-house taxonomy at CHLA, similar to ICD-9 codes (World Health Organization, 2004) commonly used in medical informatics research. These labels include a wide range of acute conditions, such as acute respiratory distress, congestive heart failure, and sepsis. A full list is given in appendix A. We focus on the most frequent, each having at least 50 positive examples in our dataset. Naturally, the diagnoses are not mutually exclusive. In our data set, the average patient is associated with diagnoses. Additionally, the base rates of the diagnoses vary widely (see appendix A).
3 Recurrent Neural Networks for Multilabel Classification
While our focus in this paper is on missing data, for completeness, we review the LSTM RNN architecture for performing multilabel classification of diagnoses introduced by Lipton et al. (2016). Formally, given a series of observations , we desire a classifier to generate hypotheses of the true labels , where each input and the output . Here, denotes the input dimension, denotes the number of labels, indexes sequence steps, and for any example, denotes the length of that sequence.
. As output, we use a fully connected layer followed by an element-wise logistic activation function. We apply log loss
(binary cross-entropy) as the loss function at each output node.
The following equations give the update for a layer of memory cells , where stands for the previous layer at the same sequence step (a previous LSTM layer or the input ) and stands for the same layer at the previous sequence step:
In these equations, stands for an element-wise application of the logistic function, stands for an element-wise application of the function, and is the Hadamard (element-wise) product. The input, output, and forget gates are denoted by , , and respectively, while is the input node and has a tanh activation.
The loss at a single sequence step is the average log loss calculated across all labels:
To overcome the difficulty of learning to pass information across long sequences, we use the target replication strategy proposed by Lipton et al. (2016), in which we replicate the static targets at each sequence step providing a local error signal. This technique is also motivated by our problem: we desire to make accurate predictions even if the sequence were truncated (as in early-warning systems). To calculate loss, we take a convex combination of the final step loss and the average of the losses over predictions at all steps :
where is a hyper-parameter determining the relative importance of performance on the intermediary vs. final targets. At inference time, we consider only the output at the final step.
4 Missing Data
In this section, we explain our procedures for imputation, missing data indicator sequences, engineering features of missing data patterns.
To address the missing data problem, we consider two different imputation strategies (forward-filling and zero imputation), as well as direct modeling via indicator variables. Because imputation and direct modeling are not mutually exclusive, we also evaluate them in combination. Suppose that is “missing.” In our zero-imputation strategy, we simply set whenever it is missing. In our forward-filling strategy, we impute as follows:
If there is at least one previously recorded measurement of variable at a time , we perform forward-filling by setting .
If there is no previous recorded measurement (or if the variable is missing entirely), then we impute the median estimated over all measurements in the training data.
This strategy is motivated by the intuition that clinical staff record measurements at intervals proportional to rate at which they are believed or observed to change. Heart rate, which can change rapidly, is monitored much more frequently than blood pH. Thus it seems reasonable to assume that a value has changed little since the last time it was measured.
4.2 Learning with Missing Data Indicators
Our indicator variable
approach to missing data consists of augmenting our inputs with binary variablesfor every , where if is imputed and 0 otherwise. Through their hidden state computations, RNNs can use these indicators to learn arbitrary functions of the past observations and missingness patterns. However, given the same data, linear models can only learn hard substitution rules. To see why, consider a linear model that outputs prediction , where . With indicator variables, we might say that where are the weights for each . If is set to and to , whenever the feature is missing, then the impact on the output is exactly equal to the contribution for some . In other words, the linear model can only use the indicator in a way that depends neither on the previously observed values (), nor any other evidence in the inputs.
Note that for a linear model, the impact of a missing data indicator on predictions must be monotonic. In contrast, the RNN might infer that for one patient heart rate is missing because they went for a walk, while for another it might signify an emergency. Also note that even without indicators, the RNN might learn to recognize filled-in vs real values. For example, with forward-filling, the RNN could learn to recognize exact repeats. For zero-filling, the RNN could recognize that values set to exactly were likely missing measurements.
4.3 Hand-engineered missing data features
To overcome the limits of the linear model, we also designed features from the indicator sequences. As much as possible, we limited ourselves to features that are simple to calculate, intuitive, and task-agnostic. The first is a binary indicator for whether a variable was measured at all
. Additionally, we compute the mean and standard deviation of the indicator sequence. The mean captures the frequency with which each variable is measured which carries information about the severity of a patient’s condition. The standard deviation, on the other hand, computes a non-monotonic function of frequency that is maximized when a variable is missing exactly 50% of the time. We also compute the frequency with which a variable switches from measured to missing or vice versa across adjacent sequence steps. Finally, we add features that capture the relative timing of the first and last measurements of a variable, computed as the number of hours until the measurement divided by the length of the full sequence.
We now present the training details and empirical findings of our experiments. Our LSTM RNNs each have hidden layers of LSTM cells each, non-recurrent dropout of , and weight decay of . We train on of data, setting aside each for validation and testing. We train each RNN for epochs, retaining the parameters corresponding to the epoch with the lowest validation loss.
We compare the performance of RNNs against logistic regression and multilayer perceptrons (MLPs). We applyregularization to the logistic regression model. The MLP has hidden layers with
nodes each, rectified linear unit activations, and dropout (with probability of 0.5), choosing the number of layers and nodes by validation performance. We train the MLP using stochastic gradient descent with momentum.
We evaluate each baseline with two sets of features: raw and hand-engineered. Note that our baselines cannot be applied directly to variable-length inputs. For the raw features, we concatenate three 12-hour subsequences, one each from the beginning, middle, and end of the time series. For shorter time series, these intervals may overlap. Thus raw representations contain features. We train each baseline on five different combinations of raw inputs: (1) measurements with zero-filling, (2) measurements with forward-filling, (2) measurements with zero-filling + missing data indicators, (4) forward-filling + missing data indicators, and (5) missing data indicators only.
Our hand-engineered features capture central tendencies, variability, extremes, and trends. These include the first and last measurements and their difference, maximum and minimum values, mean and standard deviation, median and 25th and 75th percentiles, and the slope and intercept of least squares line fit. We also computed the 8 missing data features described in section 4. We improve upon the baselines in Lipton et al. (2016) by computing the hand-engineered features over different windows of time, giving them access to greater temporal information and enabling them to better model patterns of missingness. We extract hand-engineered features from the entire time series and from three possibly overlapping intervals: the first and last 12 hours and the interval between (for shorter sequences, we instead use the middle 12 hours). This yields a total of and hand-engineered measurement and missing data features, respectively. We train baseline models on three different combinations of hand-engineered features: (1) measurement-only, (2) indicator-only, and (3) measurement and indicator.
We evaluate all models on the same training, validation, and test splits. Our evaluation metrics include area under the ROC curve (AUC) and F1 score (with threshold chosen based on validation performance). We report both micro-averaged (calculated across all predictions) and macro-averaged (calculated separately on each label, then averaged) measures to mitigate the weaknesses in each(Lipton et al., 2014). Finally we also report precision at , whose maximum is because we have on average diagnoses per patient. This metric seems appropriate because we could imagine this technology would be integrated into a diagnostic assistant. In that case, its role might be to suggest the most likely diagnoses among which a professional doctor would choose. Precision at 10 evaluates the quality of the top 10 suggestions.
|Classification performance for 128 ICU phenotypes|
|Model||Micro AUC||Macro AUC||Micro F1||Macro F1||P@10|
|Log Reg - Zeros||0.8108||0.7244||0.2149||0.0999||0.1014|
|Log Reg - Impute||0.8201||0.7455||0.2404||0.1189||0.1038|
|Log Reg - Zeros & Indicators||0.8143||0.7269||0.2239||0.1082||0.1017|
|Log Reg - Impute & Indicators||0.8242||0.7442||0.2467||0.1234||0.1045|
|Log Reg - Indicators Only||0.7929||0.6924||0.1952||0.0889||0.0939|
|MLP - Zeros||0.8263||0.7502||0.2344||0.1072||0.1048|
|MLP - Impute||0.8376||0.7708||0.2557||0.1245||0.1031|
|MLP - Zeros & Indicators||0.8381||0.7705||0.2530||0.1224||0.1067|
|MLP - Impute & Indicators||0.8419||0.7805||0.2637||0.1296||0.1082|
|MLP - Indicators Only||0.8112||0.7321||0.1962||0.0949||0.0947|
|LSTM - Zeros||0.8662||0.8133||0.2909||0.1557||0.1176|
|LSTM - Impute||0.8600||0.8062||0.2967||0.1569||0.1159|
|LSTM - Zeros & Indicators||0.8730||0.8250||0.3041||0.1656||0.1215|
|LSTM - Impute & Indicators||0.8689||0.8206||0.3027||0.1609||0.1196|
|LSTM - Indicators Only||0.8409||0.7834||0.2403||0.1291||0.1074|
|Models using Hand-Engineered Features|
|Log Reg HE||0.8396||0.7714||0.2708||0.1327||0.1118|
|Log Reg HE Indicators||0.8472||0.7752||0.2841||0.1376||0.1165|
|Log Reg HE Indicators Only||0.8187||0.7322||0.2287||0.1001||0.1020|
|MLP HE Indicators||0.8669||0.8160||0.2954||0.1610||0.1202|
|MLP HE Indicators Only||0.8371||0.7682||0.2351||0.1179||0.1028|
The best overall model by all metrics (micro AUC of ) is an LSTM with zero-imputation and missing data indicators. It outperforms both the strongest MLP baseline and LSTMs absent missing data indicators. For the LSTMs using either imputation strategy, adding the missing data indicators improves performance in all metrics. While all models improve with access to missing data indicators, this information confers less benefit to the raw input linear baselines, consistent with theory discussed in subsection 4.2.
The results achieved by logistic regression with hand-engineered features indicates that our simple hand-engineered missing data features do a reasonably good job of capturing important information that neural networks are able to mine automatically. We also find that LSTMs (with or without indicators) appear to perform better with zero-filling than with with imputed values. Interestingly, this is not true for either baseline. It suggests that the LSTM may be learning to recognize missing values implicitly by recognizing a tight range about the value zero and inferring that this is a missing value. If this is true, perhaps imputation interferes with the LSTM’s ability to implicitly recognize missing values. Overall, the ability to implicitly infer missingness may have broader implications. It suggests that we might never completely hide this information from a sufficiently powerful model.
6 Related Work
This work builds upon research relating to missing values and machine learning for medical informatics. The basic RNN methodology for phenotyping derives fromLipton et al. (2016), addressing a dataset and problem described by Che et al. (2015). The methods rely upon LSTM RNNs (Hochreiter and Schmidhuber, 1997; Gers et al., 2000) trained by backpropagation through time (Hinton et al., 2006; Werbos, 1988). A comprehensive perspective on the history and modern applications of RNNs is provided by Lipton et al. (2015), while Lipton et al. (2016) list many of the previous works that have applied neural networks to digital health data.
While a long and rich literature addresses pattern recognition with missing data(Cohen and Cohen, 1975; Allison, 2001)
, most of this literature addresses fixed-length feature vectors(García-Laencina et al., 2010; Pigott, 2001). Indicator variables for missing data were first proposed by Cohen and Cohen (1975), but we could not find papers that combine missing data indicators with RNNs. Only a handful of papers address missing data in the context of RNNS. Bengio and Gingras (1996) demonstrate a scheme by which the RNN learns to fill in the missing values such that the filled-in values minimize output error. In 2001, Parveen and Green (2001) built upon this method to improve automatic speech recognition. Barker et al. (2001) suggests using a mask of indicators in a scheme for weighting the contribution of reliable vs corrupted data in the final prediction. Tresp and Briegel (1998) address missing values by combining an RNN with a linear state space model to handle uncertainty. This paper may be one of the first to engineer explicit features of missingness patterns in order to improve discriminative performance. Also, to our knowledge, we are the first to harness patterns of missing data to improve the classification of critical care phenotypes.
Data processing and discriminative learning have often been regarded as separate disciplines. Through this separation of concerns, the complementarity of missing data indicators and training RNNs for classification has been overlooked. This paper proposes that patterns of missing values are an underutilized source of predictive power and that RNNs, unlike linear models, can effectively mine this signal from sequences of indicator values. Our hypotheses are confirmed by empirical evidence. Additionally, we introduce and confirm the utility of a simple set of features, engineered from the sequence of missingness indicators, that can improve performance of linear models. These techniques are simple to implement and broadly applicable and seem likely to confer similar benefits on other sequential prediction tasks, when data is missing not at random. One example might include financial data, where failures to report accounting details could suggest internal problems at a company.
7.1 The Perils and Inevitability of Modeling Treatment Patterns
For medical applications, the predictive power of missing data raises important philosophical concerns. We train models with supervised learning, and verify their utility by assessing the accuracy of their classifications on hold-out test data. However, in practice, we hope to make treatment decisions based on these predictions, exposing a fundamental incongruity between the problem on which our models are trained and those for which they are ultimately deployed. As articulated inLipton (2016), these supervised models, trained offline, cannot account for changes that their deployment might confer upon the real world, possibly invalidating their predictions. Caruana et al. (2015) present a compelling case in which a pneumonia risk model predicted a lower risk of death for patients who also have asthma. The better outcomes of the asthma patients, as it turns out, owed to the more aggressive treatment they received. The model, if deployed, might be used to choose less aggressive treatment for the patients with both pneumonia and asthma, clearly a sub-optimal course of action.
On the other hand, to some degree, learning from treatment signal may be inevitable. Any imputation might leak some information about which values are likely imputed and which are not. Thus any sufficiently powerful supervised model might catch on to some amount of missingness signal, as was the case in our experiments with the LSTM using zero-filled missing values. Even physiologic measurements contain information owing to patterns of treatment, possibly reflecting the medications patients receive and the procedures they undergo.
Sometimes the patterns of treatments may be a reasonable and valuable source of information. Doctors already rely on this kind of signal habitually: they read through charts, noting which other doctors have seen a patient, inferring what their opinions might have been from which tests they ordered. While, in some circumstances, it may be problematic for learning models to rely on this signal, removing it entirely may be both difficult and undesirable.
7.2 Complex Models or Complex Features?
Our work also shows that using only simple features, RNNs can achieve state of the art performance classifying clinical time series. The RNNs far outperform linear models. Still, in our experience, there is a strong bias among practitioners toward more familiar models even when they require substantial feature engineering.
In our experiments, we undertook extensive efforts to engineer features to boost the performance of both linear models and MLPs. Ultimately, while RNNs performed best on raw data, we could approach its performance with an MLP and significantly improve the linear model by using hand-engineered features and windowing. A question then emerges: how should we evaluate the trade-off between more complex models and more complex features? To the extent that linear models are believed to be more interpretable than neural networks, most popular notions of interpretability hinge upon the intelligibility of the features (Lipton, 2016). When performance of the linear model comes at the price of this intelligibility, we might ask if this trade-off undermines the linear model’s chief advantage. Additionally, such a model, while still inferior to the RNN, relies on application-specific features less likely to be useful on other datasets and tasks. In contrast, RNNs seem better equipped to generalize to different tasks. While the model may be complex, the inputs remain intelligible, opening the possibility to various post-hoc interpretations (Lipton, 2016).
7.3 Future Work
We see several promising next steps following this work. First, we would like to validate this methodology on tasks with more immediate clinical impact, such as predicting sepsis, mortality, or length of stay. Second, we’d like to extend this work towards predicting clinical decisions. Called policy imitation in the reinforcement literature, such work could pave the way to providing real-time decision support. Finally, we see machine learning as cooperating with a human decision-maker. Thus a machine learning model needn’t always make a prediction/classification; it could also abstain. We hope to make use of the latest advances in mining uncertainty information from neural networks to make confidence-rated predictions.
Zachary C. Lipton was supported by the Division of Biomedical Informatics at the University of California, San Diego, via training grant (T15LM011271) from the NIH/NLM. David Kale was supported by the Alfred E. Mann Innovation in Engineering Doctoral Fellowship. The VPICU was supported by grants from the Laura P. and Leland K. Whittier Foundation. We acknowledge NVIDIA Corporation for Tesla K40 GPU hardware donation and Professors Charles Elkan and Greg Ver Steeg for their support and advice.
- Allison (2001) Paul D Allison. Missing data. 2001.
- Barker et al. (2001) Jon Barker, Phil Green, and Martin Cooke. Linking auditory scene analysis and robust asr by missing data techniques. 2001.
- Bengio and Gingras (1996) Yoshua Bengio and Francois Gingras. Recurrent neural networks for missing or asynchronous data. 1996.
- Bengio et al. (1994) Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 1994.
- Caruana et al. (2015) Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In KDD, 2015.
- Che et al. (2015) Zhengping Che, David C. Kale, Wenzhe Li, Mohammad Taha Bahadori, and Yan Liu. Deep computational phenotyping. In KDD. ACM, 2015.
- Cohen and Cohen (1975) Jacob Cohen and Patricia Cohen. Applied multiple regression/correlation analysis for the behavioral sciences. 1975.
- Fleming et al. (2011) Susannah Fleming, Matthew Thompson, Richard Stevens, Carl Heneghan, Annette Plüddemann, Ian Maconochie, Lionel Tarassenko, and David Mant. Normal ranges of heart rate and respiratory rate in children from birth to 18 years: A systematic review of observational studies. The Lancet, 2011.
- García-Laencina et al. (2010) Pedro J García-Laencina, José-Luis Sancho-Gómez, and Aníbal R Figueiras-Vidal. Pattern classification with missing data: a review. Neural Computing and Applications, 2010.
- Gers et al. (2000) Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 2000.
- Gers et al. (2003) Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. Learning precise timing with LSTM recurrent networks. JMLR, 2003.
- Henry et al. (2015) Katharine E Henry, David N Hager, Peter J Pronovost, and Suchi Saria. A targeted real-time early warning score (trewscore) for septic shock. Science Translational Medicine, 2015.
- Hinton et al. (2006) Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 2006.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 1997.
- Lasko et al. (2013) Thomas A. Lasko, Joshua C. Denny, and Mia A. Levy. Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data. PLoS ONE, 2013.
- Lipton (2016) Zachary C Lipton. The mythos of model interpretability. ICML Workshop on Human Interpretability in Machine Learning, 2016.
- Lipton et al. (2014) Zachary C Lipton, Charles Elkan, and Balakrishnan Naryanaswamy. Optimal thresholding of classifiers to maximize F1 measure. In Machine Learning and Knowledge Discovery in Databases. 2014.
- Lipton et al. (2015) Zachary C. Lipton, John Berkowitz, and Charles Elkan. A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019, 2015.
- Lipton et al. (2016) Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzell. Learning to diagnose with lstm recurrent neural networks. ICLR, 2016.
- Marlin et al. (2012) Ben M. Marlin, David C. Kale, Robinder G. Khemani, and Randall C. Wetzel. Unsupervised pattern discovery in electronic health care data using probabilistic clustering models. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, 2012.
- National High Blood Pressure Education Program Working Group on Children and Adolescents (2004) National High Blood Pressure Education Program Working Group on Children and Adolescents. The fourth report on the diagnosis, evaluation, and treatment of high blood pressure in children and adolescents. Pediatrics, 2004.
- Oellrich et al (2015) Anika Oellrich et al. The digital revolution in phenotyping. Briefings in Bioinformatics, 2015.
- Parveen and Green (2001) Shahla Parveen and P Green. Speech recognition with missing data using recurrent neural nets. In NIPS, 2001.
- Pigott (2001) Therese D Pigott. A review of methods for missing data. Educational research and evaluation, 2001.
- Tresp and Briegel (1998) Volker Tresp and Thomas Briegel. A solution for missing data in recurrent neural networks with an application to blood glucose prediction. In NIPS. 1998.
- Werbos (1988) Paul J Werbos. Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1988.
- Wiens et al. (2012) Jenna Wiens, Eric Horvitz, and John V Guttag. Patient risk stratification for hospital-associated c. diff as a time-series classification task. In NIPS, 2012.
World Health Organization (2004)
World Health Organization.
International statistical classification of diseases and related health problems. World Health Organization, 2004.
Appendix A Per Diagnosis Classification Performance
In this appendix, we provide per-diagnosis AUC and F1 scores for three representative LSTM models trained with imputed measurements, with imputation plus missing indicators, and with indicators only. By comparing performance on individual diagnoses, we can gain some insight into the relationship between missing values and different conditions. Rows are sorted in descending order based on the F1 score of the imputation plus indicators model. It is worth noting that F1 scores are sensitive to threshold, which we chose in order to optimize per-disease validation F1, sometimes based on a very small number of positive cases. Thus, there are cases where one model will have superior AUC but worse F1.
|Classifier Performance on Each Diagnostic Code, Sorted by F1|
|Msmt.||Msmt. + indic.||Indic.|
|Diabetes mellitus with ketoacidosis||0.0125||1.0000||0.8889||0.9999||0.9333||0.9906||0.7059|
|Asthma with status asthmaticus||0.0202||0.9384||0.6800||0.8907||0.6383||0.8652||0.5417|
|Renal transplant, status post||0.0122||0.9667||0.2963||0.9544||0.4762||0.9490||0.5600|
|Liver transplant, status post||0.0106||0.7534||0.3158||0.8283||0.4762||0.8271||0.2581|
|Acute respiratory distress syndrome||0.0193||0.9696||0.3590||0.9705||0.4557||0.9361||0.3333|
|End stage renal disease (on dialysis)||0.0241||0.8548||0.2778||0.8800||0.4186||0.9043||0.4255|
|Acute respiratory failure||0.0981||0.8414||0.4128||0.8391||0.3835||0.8358||0.4542|
|Respiratory distress, other||0.0716||0.8411||0.3873||0.8502||0.3719||0.7857||0.2143|
|Intracranial injury, closed||0.0525||0.8886||0.2817||0.9002||0.3711||0.8442||0.3208|
|Seizures, status epilepticus||0.0348||0.8381||0.4158||0.8505||0.3704||0.8440||0.3226|
|Pneumonia due to adenovirus||0.0123||0.8604||0.1250||0.9065||0.3077||0.8792||0.1818|
|Leukemia (acute, without remission)||0.0287||0.8585||0.2783||0.8845||0.3059||0.8551||0.2703|
|Dissem. intravascular coagulopathy||0.0099||0.9556||0.5000||0.9532||0.2857||0.9555||0.2500|
|Congestive heart failure||0.0133||0.8748||0.1429||0.8756||0.2703||0.8326||0.1364|
|Upper airway obstruc. (UAO), other||0.0378||0.8206||0.2564||0.8573||0.2542||0.8350||0.1964|
|Diabetes mellitus type I, stable||0.0064||0.7105||0.0000||0.9625||0.2500||0.9356||0.3333|
|Cerebral palsy (infantile)||0.0262||0.8230||0.2609||0.8359||0.2500||0.6773||0.0980|
|UAO, ENT surgery, post-status||0.0302||0.9059||0.4058||0.8733||0.2400||0.8364||0.1975|
|Acute renal failure, unspecified||0.0191||0.9242||0.2381||0.9510||0.2381||0.9507||0.3291|
|Hepatic fail. (acute necrosis of liver)||0.0176||0.8489||0.2222||0.9239||0.2308||0.8598||0.1935|
|Craniosynostosis (anomalies of skull)||0.0064||0.7824||0.0000||0.9267||0.2286||0.8443||0.0315|
|Prematurity (37 weeks gestation)||0.0321||0.7520||0.1548||0.7542||0.2245||0.7042||0.1266|
|Hydrocephalus, other (congenital)||0.0381||0.7118||0.2099||0.7500||0.2241||0.7065||0.1961|
|Congenital muscular dystrophy||0.0121||0.8427||0.2500||0.8491||0.2143||0.7460||0.0800|
|Classifier Performance on Each Diagnostic Code, Sorted by F1|
|Msmt.||Msmt. + indic.||Indic.|
|Tumor, disseminated or metastatic||0.0180||0.7178||0.0938||0.7415||0.1967||0.6837||0.1062|
|Child abuse, suspected||0.0065||0.9544||0.2222||0.8642||0.1818||0.8227||0.0870|
|Pneumonia, bacterial (pneumococ.)||0.0186||0.8876||0.1333||0.8806||0.1600||0.8616||0.0000|
|Pneumonia due to inhalation||0.0078||0.7917||0.1111||0.8602||0.1538||0.8268||0.0566|
|Metabolic or endocrine disorder||0.0095||0.7718||0.0364||0.6929||0.1538||0.6319||0.2000|
|Disorder of kidney and ureter, other||0.0204||0.8486||0.2857||0.8650||0.1500||0.8238||0.2500|
|Urinary tract infection||0.0137||0.7478||0.1154||0.7402||0.1481||0.7229||0.0588|
|Cardiac arrest, outside hospital||0.0118||0.8932||0.0976||0.8791||0.1379||0.8881||0.0556|
|Suspected septicemia, rule out||0.0143||0.7378||0.0923||0.7402||0.1029||0.6769||0.0000|
|(Benign) intracranial hypertension||0.0099||0.8494||0.0000||0.9018||0.1020||0.8586||0.1224|
|Atrial septal defect||0.0107||0.7766||0.0727||0.7765||0.0909||0.7197||0.0000|
|Hematologic disorder, other||0.0114||0.6736||0.0800||0.6898||0.0870||0.8074||0.0800|
|Gastrointestinal bleed, other||0.0064||0.8325||0.0541||0.7974||0.0741||0.7996||0.0909|
|Classifier Performance on Each Diagnostic Code, Sorted by F1|
|Msmt.||Msmt. + indic.||Indic.|
|Respiratory syncytial virus||0.0130||0.8876||0.2857||0.9145||0.0645||0.8716||0.0930|
|(Hereditary) hemolytic anemia, other||0.0088||0.7582||0.0548||0.8544||0.0645||0.9125||0.0513|
|Obstructive sleep apnea||0.0185||0.7564||0.0613||0.8200||0.0645||0.8087||0.1111|
|Trauma, long bone injury||0.0096||0.8757||0.0952||0.9085||0.0597||0.7946||0.1176|
|Bowel (intestinal) obstruction||0.0104||0.7512||0.0984||0.6559||0.0597||0.6936||0.0424|
|Neurologic disorder, other||0.0288||0.7628||0.1481||0.6978||0.0588||0.5971||0.0769|
|Spinal cord lesion||0.0133||0.7298||0.0585||0.7052||0.0488||0.8168||0.0414|
|Pneumonia, other (mycoplasma)||0.0188||0.8589||0.1613||0.8792||0.0476||0.8424||0.1164|
|Obstructed ventriculoperitoneal shunt||0.0073||0.6824||0.0267||0.7114||0.0331||0.7516||0.0667|
|Ventricular septal defect||0.0119||0.6641||0.1081||0.5680||0.0294||0.5593||0.0444|
|Croup Syndrome, UAO||0.0069||0.9418||0.2222||0.9834||0.0000||0.9682||0.2222|
|Sickle-cell anemia, unspecified||0.0080||0.6262||0.0000||0.9627||0.0000||0.8661||0.1250|
|Metabolic acidosis (¡7.1)||0.0083||0.9475||0.1818||0.9046||0.0000||0.9143||0.1538|
|Immunologic disorder, other||0.0094||0.9539||0.1500||0.8868||0.0000||0.8969||0.1212|
|Pulmonary hypertension, other||0.0112||0.9259||0.2500||0.8826||0.0000||0.8098||0.0000|
|Spinal muscular atrophy||0.0052||0.9666||0.0000||0.8658||0.0000||0.8362||0.0000|
|Bone marrow transplant, status post||0.0097||0.8161||0.5217||0.8562||0.0000||0.8505||0.1695|
|Gastrointestinal bleed, upper||0.0063||0.8388||0.0000||0.8078||0.0000||0.7256||0.0000|
|Arrhythmia, supraventricular tachy.||0.0055||0.8178||0.0385||0.7867||0.0000||0.8199||0.0000|
|Congenital central alveolar hypovent.||0.0057||0.7067||0.0000||0.7716||0.0000||0.7282||0.0000|
|Tetralogy of fallot||0.0061||0.5759||0.0000||0.7614||0.0000||0.7637||0.0000|
|Cardiac disorder, other||0.0071||0.7229||0.0519||0.7552||0.0000||0.6287||0.0000|
|Hydrocephalus, shunt failure||0.0083||0.7715||0.0000||0.7542||0.0000||0.7986||0.0635|
|Cerebral infarction (CVA)||0.0058||0.6766||0.0000||0.7495||0.0000||0.7148||0.1333|
|Congenital heart disorder, other||0.0084||0.7590||0.0000||0.7277||0.0000||0.7803||0.0583|
|Gastrointestinal disorder, other||0.0139||0.6755||0.0336||0.6821||0.0000||0.6465||0.1026|
|UAO, extubation, status post||0.0085||0.8295||0.0672||0.6063||0.0000||0.6128||0.0000|
Appendix B Missing
In this appendix, we present information
about the sampling rates and missingness characteristics of our 13 variables.
The first column lists the average number of measurements per hour in all episodes with at least one measurement (excluding episodes where the variable is missing entirely). The second column lists the fraction of episodes in which the variable is missing completely (there are zero measurements). The third column lists the missing rate in the resulting discretized sequences.
|Variable||Msmt./hour||Missing entirely||Frac. missing|
|Diabstolic blood pressure||0.5162||0.0135||0.1571|
|Systolic blood pressure||0.5158||0.0135||0.1569|
|Peripheral capillary refall rate||1.0419||0.0140||0.5250|
|Fraction inspired O||1.3004||0.1545||0.7873|
|Total glasgow coma scale||1.0394||0.0149||0.5250|