In 2016 the average daily cost of an inpatient stay was $3,421 in California, and has continued to rise since then (AHA, 2016). As hospitals try to lower costs and improve quality care, operational goals such as the timely discharge of patients who are ready to leave the hospital have become increasingly important. Timely discharge provides many benefits to patients and health care systems by freeing scarce beds for other patients and reducing exposure to iatrogenic conditions (J. Graham Atkinson, 2014), in addition to reducing costs per patient, improving post-discharge care (Colwell, 2014), and lowering risk of readmission (Kaboli et al., 2012). In this work, we present preliminary work on predicting which patients who will be discharged in the next 24 hours. Accurate identification of such patients can help hospitals reduce unnecessary delays that incur additional days in the hospital for discharge ready patients by prioritizing these patients for critical services that must be done prior to discharge (e.g., imaging, lab tests, arranging for transportation) and load balancing so that each discharge ready patient is under the care of different clinical staff.
We approach this task by formulating a supervised learning problem, taking advantage of the abundance of patient data available in Electronic Health Records (EHR). Such data have been successfully applied to a variety of clinical tasks, such as medical imaging diagnosis(Gulshan et al., 2016), mining medical notes (Nguyen et al., 2017), and disease onset prediction (Liu et al., 2018)
. We compare the performance of various machine learning models on the task of predicting 24 hour discharge. Our best performing model, a gradient boosted tree model, achieves good discrimination and calibration on held out test data. We also perform a preliminary analysis of the marginal utility gains possible relative to trivial classifiers. We find that there is some evidence that these models, with appropriately designed down stream interventions, may offer some benefit.
2 Related Work
Widespread adoption of EHR systems following the passage of the Affordable Care Act (Adler-Milstein et al., 2015) has fueled increasing interest in the secondary use of data collected during routine clinical care as input to machine learning algorithms to solve clinical problems. Prior work has focused on predicting diverse clinical outcomes such as prolonged length of stay, unplanned readmissions, inpatient mortality and diagnoses (Rajkomar et al., 2018; Harutyunyan et al., 2017; Suresh et al., 2018; Wang et al., 2014). There is also substantial prior work focusing on forecasting length of stay and discharge, but this work has typically been limited to specific patient populations. (Chaou et al., 2017) modelled Emergency Department LOS using accelerated failure time model and a relatively specific set of features, including triage level index as assigned by a specialized nurse, while (Yakovlev et al., 2018) focused on predicting LOS for acute coronary syndrome patients. (Barnes et al., 2015) sought to produce daily predictions of discharge for a single inpatient unit. Similar attempts were made to study the same problem for cardiac patients in (Hachesu et al., 2013). In contrast, our present work encompasses the entire inpatient population.
3.1 Data Source
3.2 Cohort Selection
EHR data on approximately 1 million adult patients at SHC were extracted from encounters occurring between January 1, 2010 and February 10, 2018. Nested encounters (encounters that happened during other encounters) were removed. Encounters from the same patient that happened within twelve hours were merged into a single encounter.
To approximate real-world deployment in our model evaluation, patients with inpatient encounters before Jan 1, 2017 were used for training, while those with inpatient encounters on or after Jan 1, 2017 were randomly split 50-50 into validation and test sets. For patients with multiple inpatient visits, only one visit was randomly selected in the validation and test set while all visits were preserved in the training set. To further safeguard the integrity of test data and avoid potential data leakage, we removed all the patients that are included in the testing set from the training set. The prevalence of positive cases - inpatient discharge in the next day - is around 18% in all split data sets (Table 1).
|Time Period||Number of visits||Number of patients||Days in hospital|
|Training||Jan 1, 2010 - Jan 1, 2017||6,997,831||83,797||7,081,628|
|Validation||Jan 1, 2017 - Feb 10, 2018||852,554||7,212||859,766|
|Testing||Jan 1, 2017 - Feb 10, 2018||876,572||7,211||883,783|
We formulated a supervised learning problem in which the task is to map health care data for a given patient collected in the EHR prior to a specific day to the probability of the patient being discharged in the next 24 hours.
4.1 Patient Representation
4.1.1 Data Elements
We used diagnosis codes, procedure codes, medication codes, lab codes, and the corresponding lab results (categorized as normal, abnormal, or panic) from encounter records, procedure records, medication orders, and lab result records respectively. Diagnosis and procedure codes were ICD-9 (for Disease Control et al., 2013) and CPT (Association, 2007) codes respectively. Codes with fewer than 100 occurrences in the training set were excluded. Patient age, gender, race, ethnicity, insurance type, and whether the visit is for surgery were also included as background information.
4.1.2 Fixed Length Patient Representations
Many machine learning models, such as logistic regression and random forests, require a fixed length vector of input data. For such models we must summarize the longitudinal medical history of each patient into such a fixed length vector. For each day of an inpatient visit, the clinical codes and lab results from the day before were counted and considered recent data, while those that occurred 2 days to 6 months before were aggregated and considered historical data. Historical events (diagnosis codes, procedure codes, medication codes, lab codes, and lab results) were aggregated by summing up occurrences over the entire historical record, discarding their temporal ordering (Fig.1).The two sets of past data, together with the patient demographics, were used as input features for models requiring fixed length representations.
4.1.3 Data Inputs for Recurrent Neural Networks
In contrast, recurrent neural nets are able to handle variable length inputs, and in particular, variable length longitudinal medical records. For each patient, the same set of features described above were grouped by day, and fed into RNN as an ordered sequence, representing the timeline of that patient’s interaction with the hospital (Fig. 2). For more efficient learning updates, patients with similar total length of stay were batched together during training.
4.2.1 Random Forest
Random Forests pool the responses of an ensemble of decision tree classifiers, each trained on bootstrapped versions of the original dataset. After hyperparameter-tuning using the validation set, we obtained a Random Forest with 2,000 classification trees. All classification trees were grown until all leaves contained fewer than two samples. Random Forests were constructed using the scikit-learn library (version 0.19.0) in Python (version 3.6.1).
4.2.2 Gradient Boosting
Gradient Boosting Machines (GBM) are another form of decision tree ensemble, but it builds the ensemble by iterative functional gradient descent. There are many implementations of gradient boosting, each of which uses various heuristics to both regularize and increase computational efficiency. In this study, we used two implementations - scikit-learn’s standard gradient boosted classifier, and XGBoost. For the former, we used 500 component trees, a sub-sampling fraction of 0.8, and considered 482 features at each split. Each classification tree was grown to a maximum depth of 50 or a a minimum of 3 samples per leaf node, and we used a learning rate of 0.1 For XGBoost, we used an ensemble of 2000 classifiers, with a minimum of 2 samples per leaf node and a learning rate of 0.3.
4.2.3 Recurrent Neural Nets
We used a recurrent neural net comprising one recurrent layer of 64 Gated Recurrent Unit (GRU)(Cho et al., 2014)
hidden units. Diagnosis codes, procedure codes, medication codes, lab codes, and encounter types were first passed through an embedding layer. We experimented with two different approaches for embedding: (i) embed each feature separately as a vector of size 25, (ii) dividing the features into 2 groups based on ”internal” factors (diagnosis, lab) and ”external” factors (medication, encounter) affecting the patient’s conditions, and embed each group as a vector of size 50. The outputs of the embedding layer at each time step were averaged and then concatenated with demographics and visit features before being fed as inputs into the recurrent layer. We used a fully connected layer before a sigmoid activation function at the output layer. The model was trained with the Adam optimizer for 20 epochs, a learning rate of 0.003 and a dropout probability of 0.2 across the network. No weight decay was used. We also used auxiliary targets, i.e., multitask learning, to improve learning. At each time step, we updated gradients based on cross entropy loss from one of the three prediction tasks: (i) whether the patient is discharged in the next 24 hours, (ii) whether the patient was inpatient at the existing time step, (iii) whether the patient would be inpatient in the next time step. The last 2 auxiliary tasks used data from all patients instead of just inpatients. They were chosen because it was not difficult to integrate them into the existing data processing and loading pipeline, and we hypothesized that they would help generate more robust representations that correlate with 24-hour discharge.
5.1 Evaluation Metrics
The primary metric used for model selection is Area Under the Receiver Operating Characteristic Curve (AUROC), a measure of model’s discrimination ability between patients who will be discharged and those who will not, at different classification thresholds. We also compared models by the Area Under the Precision-Recall Curve (AUPRC). For the best-performing model, we evaluated its calibration by calculating the Brier Score, which measures how close the predicted probabilities are to the true probabilities. This is estimated calculating the concordance between the empirical probability of discharge in bins of predicted probabilities output by the model.
5.2 Prediction Results on All Patients
Results across all models is shown in table 2. The models achieved comparable performance, with XGBoost achieving the best AUROC (0.85 vs 0.84 for the second best model) and AUPRC (0.53 vs 0.50 for the second best model). All models were well calibrated, with Brier scores ranging from 0.11 to 0.102 for XGBoost. Fig.5 shows the calibration plot for the XGBoost model. Given a classification threshold of 0.5, the probability estimates were conservative for all the positive discharge predictions. Thus, the imperfect calibration would not affect the ranking of patient priority and should not pose a concern during clinical deployment.
5.3 Prediction Results on Service Lines
We worked with clinicians from SHC to identify specific service lines where discharge protocols are standardized and straightforward, and thus where our model’s predictions are most actionable. The XGBoost model performance for these service lines is shown in Fig. 3 and Fig. 4. We observe that the predictions in Orthopaedic Surgery and Neurosurgery have the highest AUROC and AUPRC respectively.
5.4 Analysis of Marginal Utility Gains
In this section, we estimate whether our best model provides any benefit in terms of marginal expected utility. We fix the prevalence of 24 hour discharge as 18% and make the following assumptions about the utility of false positives and true positives. For false positives, we assume that a reasonable range of the cost of false positives relative to true negatives is -$10 to -$100. For true positives, we assume that the benefit of true positives relative to false negatives is in the range $250 to $2500, with the upper bound set to approximately the average cost of an inpatient day in California. Given these assumptions, we can calculate the expected utility of our best model under four scenarios, one for each combination of utility differences. Fig. 6 shows expected utility as a function of decision threshold under these four scenarios. Under the most optimistic scenarios C and D, in which benefit of a true positive is large ($2500), the expected utility is positive and large at all thresholds. Even under the most pessimistic scenario, B, which corresponds to the case where the cost of a false positive is high (-$100) and the benefit of a true positive is low ($250), this analysis indicates that we can set our decision threshold to a value that achieves a positive marginal utility.
In this work, we compare machine learning models for predicting inpatients who will be discharged within 24 hours. Such models may be used to increase the rate of timely discharge by prioritizing them for discharge related services and other interventions that help ensure timely discharge. The best performing model, XGBoost, achieved an AUROC of 0.85 and was well calibrated. A preliminary analysis of the utility gains possible with this model suggests that it could, with appropriate downstream actions, lead to a gain in expected utility. Note that our models were developed and evaluated on data from a single medical center, and we thus cannot claim that such results are possible elsewhere. We also acknowledge that our cost benefit analysis relies on possibly unrealistic assumptions. These limitations notwithstanding, we believe these results are encouraging and suggest that predictive models for 24 hour discharge have potential to provide benefit to the health care system. In future work, we will elicit estimates of the relevant utilities from domain experts, and perform simulations in order to obtain better estimates of the utility of our models and suggest optimal decision thresholds for our models prior to a prospective trial at Stanford Hospital.
- Adler-Milstein et al. (2015) Julia Adler-Milstein, Catherine M DesRoches, Peter Kralovec, Gregory Foster, Chantal Worzala, Dustin Charles, Talisha Searcy, and Ashish K Jha. Electronic health record adoption in us hospitals: progress continues, but challenges persist. Health affairs, 34(12):2174–2180, 2015.
- AHA (2016) AHA. Hospital adjusted expenses per inpatient day, 2016. URL https://www.kff.org/health-costs/state-indicator/expenses-per-inpatient-day/.
- Association (2007) American Medical Association. Current procedural terminology: CPT. American Medical Association, 2007.
- Barnes et al. (2015) Sean Barnes, Eric Hamrock, Matthew Toerper, Sauleh Siddiqui, and Scott Levin. Real-time prediction of inpatient length of stay for discharge prioritization. Journal of the American Medical Informatics Association, 23(e1):e2–e10, 2015.
- Chaou et al. (2017) Chung-Hsien Chaou, Hsiu-Hsi Chen, Shu-Hui Chang, Petrus Tang, Shin-Liang Pan, Amy Ming-Fang Yen, and Te-Fa Chiu. Predicting length of stay among patients discharged from the emergency department—using an accelerated failure time model. PloS one, 12(1):e0165756, 2017.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
- Colwell (2014) Janet Colwell. Length of stay: Timing it right. strategies for achieving efficient, high-quality care. ACP Hospitalist, 2014. URL https://www.theguardian.com/world/2017/mar/12/netherlands-will-pay-the-price-for-blocking-turkish-visit-erdogan.
- for Disease Control et al. (2013) Centers for Disease Control, Prevention, et al. International classification of diseases, ninth revision, clinical modification (icd-9-cm). Atlanta, Georgia, USA. Available on: http://www. cdc. gov/nchs/icd/icd9cm. htm, 2013.
Gulshan et al. (2016)
Varun Gulshan, Lily Peng, Marc Coram, Martin C Stumpe, Derek Wu, Arunachalam
Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge
Cuadros, et al.
Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.Jama, 316(22):2402–2410, 2016.
- Hachesu et al. (2013) Peyman Rezaei Hachesu, Maryam Ahmadi, Somayyeh Alizadeh, and Farahnaz Sadoughi. Use of data mining techniques to determine and predict length of stay of cardiac patients. Healthcare informatics research, 19(2):121–129, 2013.
- Harutyunyan et al. (2017) Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017.
- J. Graham Atkinson (2014) D.Phil. J. Graham Atkinson. The relationship between length of stay and the probability of incurring a hospital complication: a two-way interaction. 2014.
- Kaboli et al. (2012) Peter J Kaboli, Jorge T Go, Jason Hockenberry, Justin M Glasgow, Skyler R Johnson, Gary E Rosenthal, Michael P Jones, and Mary Vaughan-Sarrazin. Associations between reduced hospital length of stay and 30-day readmission rate and mortality: 14-year experience in 129 veterans affairs hospitals. Annals of internal medicine, 157(12):837–845, 2012.
- Liu et al. (2018) J. Liu, Z. Zhang, and N. Razavian. Deep EHR: Chronic Disease Prediction Using Medical Notes. ArXiv e-prints, August 2018.
- Lowe et al. (2009) Henry J Lowe, Todd A Ferris, Penni M Hernandez, and Susan C Weber. Stride–an integrated standards-based translational research informatics platform. In AMIA Annual Symposium Proceedings, volume 2009, page 391. American Medical Informatics Association, 2009.
- Nguyen et al. (2017) Phuoc Nguyen, Truyen Tran, Nilmini Wickramasinghe, and Svetha Venkatesh. Deepr: A convolutional net for medical records. IEEE journal of biomedical and health informatics, 21(1):22–30, 2017.
- Rajkomar et al. (2018) Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, et al. Scalable and accurate deep learning with electronic health records. npj Digital Medicine, 1(1):18, 2018.
- Suresh et al. (2018) Harini Suresh, Jen J Gong, and John Guttag. Learning tasks for multitask learning: Heterogenous patient populations in the icu. arXiv preprint arXiv:1806.02878, 2018.
- Wang et al. (2014) Xiang Wang, Fei Wang, and Jianying Hu. A multi-task learning framework for joint disease risk prediction and comorbidity discovery. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 220–225. IEEE, 2014.
- Yakovlev et al. (2018) Alexey Yakovlev, Oleg Metsker, Sergey Kovalchuk, and Ekaterina Bologova. Prediction of in-hospital mortality and length of stay in acute coronary syndrome patients using machine-learning methods. Journal of the American College of Cardiology, 71(11):A242, 2018.