Anesthesiologist-level forecasting of hypoxemia with only SpO2 data using deep learning

12/02/2017 ∙ by Gabriel Erion, et al. ∙ University of Washington 0

We use a deep learning model trained only on a patient's blood oxygenation data (measurable with an inexpensive fingertip sensor) to predict impending hypoxemia (low blood oxygen) more accurately than trained anesthesiologists with access to all the data recorded in a modern operating room. We also provide a simple way to visualize the reason why a patient's risk is low or high by assigning weight to the patient's past blood oxygen values. This work has the potential to provide cutting-edge clinical decision support in low-resource settings, where rates of surgical complication and death are substantially greater than in high-resource areas.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over 5 billion patients worldwide lack safe, affordable access to necessary surgical and anesthetic care (Meara et al., 2015). While anesthesia-related mortality has declined substantially since 1940 in high-Human Development Index (HDI) countries, it has actually increased in low-HDI countries. A recent meta-analysis showed two to three times greater anesthesia-related mortality in low-HDI countries, but such mortality in some major hospitals in Sub-Saharan Africa is up to 400 times greater. (Pollach, 2013; Walker and Wilson, 2008; Cherian et al., 2010)

An important component of mortality and morbidity surrounding a surgery (perioperative mortality and morbidity) is hypoxemia, or low oxygen levels in the blood. Perioperative hypoxemia is common; prolonged hypoxemia can cause serious adverse cardiac and neurological effects. (Strachan and Noble, 2001) The risks posed by hypoxemia may be one reason that pulse oximeters (devices that measure blood oxygen) are promoted by nonprofit organizations such as (Enright et al., 2016) as a way to improve surgical safety in low-HDI countries. Shortages of essential equipment (such as pulse oximeters) are indeed a major factor driving differences in surgical risk between low- and high-HDI countries. However, a lack of trained anesthesiologists is also a major cause (one study found 96 percent of anesthesia providers in Uganda are non-physicians) that equipment alone cannot fix (Cherian et al., 2010). This lack of training may be particularly important because delay in recognition and treatment of adverse events during surgery is an important component of perioperative mortality and morbidity (Stiegler and Tung, 2014).

Our paper addresses the dangers of delay in hypoxemia recognition by developing a system which can predict hypoxemia as well as or better than physician anesthesiologists, using pulse oximetry data alone. We believe this system can, at no additional cost, enable pulse oximeters currently being distributed in low-resource settings to provide advance warning of hypoxemia.

Related WorkMachine learning for clinical prediction is growing more popular, and several recent methods have achieved doctor-level prediction performance on certain tasks. Two recent papers showed that Inception-based convolutional neural networks performed as accurately as physicians in diagnosing images of possible skin cancer and diabetic retinopathy (Esteva et al., 2017; Gulshan et al., 2016)

. In non-image prediction, another paper successfully classified heart arrhythmias as accurately as cardiologists by doing 1-dimensional convolutions over time-series electrocardiogram recordings

(Rajpurkar et al., 2017). Finally, (Lundberg et al., 2017)

used all data gathered in the operating room during tens of thousands of surgeries to train a gradient boosting classifier that predicted perioperative hypoxemia more accurately than anesthesiologists. Inspired by this clinical task, and by the fact that

(Lundberg et al., 2017)’s feature importance method assigned by far the most importance to patient SpO (blood oxygen percentage), this paper attempts to outperform anesthesiologists using a subset of the same data consisting only of SpO values.

2 Operating-Room Hypoxemia Prediction

2.1 Data

Our data came from an academic medical center’s Anesthesia Information Management System (AIMS), which records all data measured in the operating room in real time during surgery. This includes demographic data (age, sex, height, weight), diagnosis and procedure codes, and free text in the medical record, as well as real-time measurements of vital signs, laboratory results, and drug doses as they are given. We trained on data from 57,173 cases, split into 8,088,523 training time points, and tuned our models using a 90-10 train-validation split. Time points from all cases were shuffled and used as independent datums; investigating per-patient effects and dependency between time points could be a useful future goal. We compared all models’ performances on a set of 1,053,116 held-out testing time points from 7,569 separate cases. Finally, we compared the best-performing model against anesthesiologists’ predictions from a user study on an entirely separate 523 time points.

While other work with this system has used all available AIMS data, we limited our data at each time point to 60 SpO

measurements, one per minute for the past hour of surgery. We imputed missing values and normalized each column to 0 mean and unit variance. The label at each time point was zero if hypoxemia, measured as a drop in SpO

to 92 percent or lower, did not occur in the next five minutes. If SpO did drop to 92 percent or lower, the label was one. Points where SpO has dropped below 95 percent in the past ten minutes were not considered. In the 523 anesthesiologist comparison time points, cases where SpO did not drop to 92 percent or lower in the next five minutes but did in the next ten were excluded, so that doctors were only tested on clearly positive or negative cases. This shift in data distribution may explain the difference in ROC between LSTMs in the model comparison (Figure 1, left) and anesthesiologist performance comparison (Figure 1, right).

2.2 Models

We considered three complex models for hypoxemia prediction, as well as base rate, ARIMA(1,0,0) and logistic regression predictors (Figure 1, left), and compared the best-performing model against doctors (Figure 1, right). Because the base rate of hypoxemia in this data was very low (1.7 percent), we decided to compare models using area under the precision-recall curve (AU-PRC), which best distinguishes model performance with imbalanced classes. We trained the gradient boosted model on the full data as one batch, but fed data to the neural nets in balanced batches to improve convergence.

Gradient boosted trees As in (Lundberg et al., 2017)

we used the fast XGBoost implementation of gradient boosting.

(Chen and Guestrin, 2016) In tuning on the validation set, we found that, with a learning rate of , using 4400 trees of depth 6 resulted in the best performance. Turning from 0.1 to 0.01 resulted in longer training time but improved performance. The AU-PRC on the test data was 0.22642.

Convolutional neural network We also considered a 1-dimensional convolutional network modeled on (Rajpurkar et al., 2017)

. The first convolutional layer consisted of a convolution followed by batch normalization and a ReLU activation. Five more layers were added, with the structure (batch normalization, ReLU, dropout, convolution). The final layer had the structure (batch normalization, ReLU, dense, sigmoid). Every convolution had kernel size 6. The first two convolutions used 64 filters; all subsequent ones used 128. Unlike in

(Rajpurkar et al., 2017)residual connections did not improve performance on the validation data – likely due to the shallow network – and were not used. A shallower net seems justified as (Rajpurkar et al., 2017) used 6000 measurements per time point while we have only 60. The network was trained with the Adam optimizer Kingma and Ba (2015). Final AU-PRC on the test data was 0.22202, less than the XGBoost and LSTM models and somewhat surprising given the recent popularity of convolutional networks for time series classification. It is possible that a deeper network or more parameter tuning was needed.

Long short-term memory network

Our final model was a long short-term memory network (LSTM), whose recurrent structure is a natural fit for time series data. The LSTM consisted of a 200-node LSTM layer with recurrent dropout on top of the input data; this layer output a sequence which was fed to another 200-node LSTM layer with recurrent dropout. This layer produced a single output, which was fed through a single dense node to a sigmoid output. There was dropout between all layers, and the network was optimized with Rmsprop

Tieleman and Hinton (2012). The final AU-PRC on test data was 0.23142.

Base Rate 0.017313 0.49938
ARIMA(1,0,0) 0.022153 0.50529
Logistic Regression 0.12918 0.74703
Convolutional net 0.22202 0.86134
Gradient boosting 0.22641 0.86363
LSTM 0.23139 0.86571
Anesthesiologist performance comparison
Figure 1: Left: Performance of all models on 1,053,116 test points, measured as area under precision-recall and ROC curves. Right: Comparison of deep learning LSTM model and anesthesiologist predictions of hypoxemia on 523 time points. LSTM was given access only to 60 minutes of oxygen data, while anesthesiologists were given access to all data recorded in the OR as plots and notes in a web interface. The performance difference is significant with P<0.0001.
Model comparison and selection

2.3 Comparison with Doctors

We used the receiver operator characteristic (ROC) curve, a widespread standard for medical diagnostics, to compare the best-performing model, the LSTM, against anesthesiologists. Because a precision-recall curve had been used to choose the best-performing model, we verified that the ranking of models (LSTM>XGBoost>CNN) was the same under both AU-PRC and AU-ROC (Figure 1, left). The LSTM model had noticeably better AU-ROC (0.731) than doctors’ pooled AU-ROC of 0.659, with P<.0001 calculated by bootstrap (Figure 1, right). The LSTM curve also dominates the doctors’ curve; for every possible given false positive rate, it achieves a higher true positive rate.

2.4 Model Interpretation

We agree with the observation in (Lundberg et al., 2017) that explaining the predictions of a clinical model is essential for doctors to trust and use the model. Unlike in (Lundberg et al., 2017)

, our model only uses 60 sequential values per prediction, so it is actually possible to visualize the entire data input and the importance of each feature. Recent methods have been developed that estimate feature importances for individual predictions in each of the models we use: Tree SHAP for XGBoost

(Lundberg and Lee, 2017) and Integrated Gradients for LSTM/CNN (Sundararajan et al., 2017; Hiranuma, 2017). In Figure 2, we show both the data for a single case and the feature importances for each model at each time point. In general, the models behave as one would predict; most features have little contribution to risk but drops in oxygen have a large positive one. The closer a drop in oxygen is to the present, the larger its contribution to the risk. The convolutional network sometimes exhibits confusing behavior, handling slow increases in SpO by creating large periodic waves of risk contribution (see row 3, columns 1 and 2). This could be due to the fixed-size window of the convolutions trying to average out to a long term trend, or it could be due to instability from the fact that the convolutional model is the deepest. Some of this effect can occasionally be seen in the LSTM (row 4, column 2), though it appears overall more stable.

Figure 2: Model explanations for all models. Each column is 60 minutes of SpO data from a test time point shown to doctors (60 minutes ago at left, current time at right). Top row is observed SpO data over 60 minutes. Subsequent rows are the importance of each minute’s SpO value for the XGBoost, CNN, and LSTM models. Predicted risk is shown above each model’s explanation plot. Note that XGBoost predicts lower risks than neural nets, as it was not fed data in balanced batches.
Model explanations

3 Discussion

We have presented a method that builds on previous work by giving advance warning of hypoxemic events using only easily-measured SpO data. The system outperforms anesthesiologists who have access to all the data recorded in a modern operating room (SpO plus other vital signs, demographics, medical record, drugs given, etc). Because the user study was done on computers, anesthesiologists did not have physical access to the patient; however, they still enjoyed a substantial data advantage over our model. The model’s success raises the interesting question of whether doctors might have been better able to focus on the most important variable and make more accurate predictions if given a more limited dataset – for example, just SpO. It is also worth noting that anesthesiologists do not train to be experts in predicting hypoxia but in clinical management of an unconscious, non-breathing patient (while radiologists, a common comparison for machine learning methods, do indeed train to be experts in diagnosis based on images). This does not detract from the clinical value of our problem; in fact, algorithms to supplement clinician judgment on ancillary tasks like hypoxemia prediction may free up attention for the many other jobs an anesthesiologist must do during surgery.

Another important contribution of this work is demonstrating that inexpensive sensors like pulse oximeters can have great predictive power in the operating room (building on (Rajpurkar et al., 2017)’s use of a single-lead EKG sensor to classify arrhythmias). This implies two future research directions: First, other perioperative adverse events may be predictable with other simple sensors. Hypocapnia (low expired CO, measured with an end-tidal CO sensor) is associated with longer ICU stays after surgery, while hypotension (low blood pressure, measured with a non-invasive blood pressure cuff) is associated with increased mortality (Jeremitsky et al., 2003). Predictive models for these end points would likely have substantial clinical value. Second, waveform sensors like oximeters (and ETCO capnography) record signals at 100Hz or greater; AIMS stores a far lower-resolution signal. Access to the raw waveform signal would augment our data by a factor of 6000 and almost certainly lead to more accurate predictions.

Finally, this work has important implications for clinical decision support in low-resource settings. While an AIMS system costs tens to hundreds of thousands of dollars (Ehrenfeld and Rehman, 2011), pulse oximeters distributed by nonprofit organizations cost as little as $250 (Enright et al., 2016) and may often be the only monitoring equipment available. Algorithms like ours have the potential to, at no extra cost, turn these oximeters into a source of decision support as effective (or more so) than a trained physician in anticipating adverse events. We hope that such a tool would contribute to reducing the dramatic disparity in perioperative risk faced in low-resource settings and helping to make surgery safer worldwide.


  • Meara et al. (2015) John G Meara, Andrew JM Leather, Lars Hagander, Blake C Alkire, Nivaldo Alonso, Emmanuel A Ameh, Stephen W Bickler, Lesong Conteh, Anna J Dare, Justine Davies, et al. Global surgery 2030: evidence and solutions for achieving health, welfare, and economic development. The Lancet, 386(9993):569–624, 2015.
  • Pollach (2013) Gregor Pollach. Anaesthetic-related mortality in sub-saharan africa. The Lancet, 381(9862):199, 2013.
  • Walker and Wilson (2008) Isabeau A Walker and Iain H Wilson. Anaesthesia in developing countries—a risk for patients. The Lancet, 371(9617):968–969, 2008.
  • Cherian et al. (2010) Meena Cherian, Shelly Choo, Iain Wilson, Luc Noel, Mubashar Sheikh, Manuel Dayrit, and Steffen Groth. Building and retaining the neglected anaesthesia health workforce: is it crucial for health systems strengthening through primary health care? Bulletin of the World Health Organization, 88(8):637–639, 2010.
  • Strachan and Noble (2001) L Strachan and DW Noble. Hypoxia and surgical patients–prevention and treatment of an unnecessary cause of morbidity and mortality. Journal of the Royal College of Surgeons of Edinburgh, 46(5):297–302, 2001.
  • Enright et al. (2016) Angela Enright, Alan Merry, Isabeau Walker, and Iain Wilson. Lifebox: a global patient safety initiative. A&A Case Reports, 6(12):366–369, 2016.
  • Stiegler and Tung (2014) Marjorie Podraza Stiegler and Avery Tung. Cognitive processes in anesthesiology decision making. Anesthesiology: The Journal of the American Society of Anesthesiologists, 120(1):204–217, 2014.
  • Esteva et al. (2017) Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115–118, 2017.
  • Gulshan et al. (2016) Varun Gulshan, Lily Peng, Marc Coram, Martin C Stumpe, Derek Wu, Arunachalam Narayanaswamy, Subhashini Venugopalan, Kasumi Widner, Tom Madams, Jorge Cuadros, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama, 316(22):2402–2410, 2016.
  • Rajpurkar et al. (2017) Pranav Rajpurkar, Awni Y Hannun, Masoumeh Haghpanahi, Codie Bourn, and Andrew Y Ng. Cardiologist-level arrhythmia detection with convolutional neural networks. arXiv preprint arXiv:1707.01836, 2017.
  • Lundberg et al. (2017) Scott M Lundberg, Bala Nair, Monica S Vavilala, Mayumi Horibe, Michael J Eisses, Trevor Adams, David E Liston, Daniel King-Wai Low, Shu-Fang Newman, Jerry Kim, et al. Explainable machine learning predictions to help anesthesiologists prevent hypoxemia during surgery. bioRxiv, page 206540, 2017.
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
  • Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, 2015.
  • Tieleman and Hinton (2012) T. Tieleman and G. Hinton. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
  • Lundberg and Lee (2017) Scott M Lundberg and Su-In Lee. Consistent feature attribution for tree ensembles. arXiv preprint arXiv:1706.06060, 2017.
  • Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, pages 3319–3328, 2017.
  • Hiranuma (2017) Naozumi Hiranuma. Integrated gradients., 2017.
  • Jeremitsky et al. (2003) Elan Jeremitsky, Laurel Omert, C Michael Dunham, Jack Protetch, and Aurelio Rodriguez. Harbingers of poor outcome the day after severe brain injury: hypothermia, hypoxia, and hypoperfusion. Journal of Trauma and Acute Care Surgery, 54(2):312–319, 2003.
  • Ehrenfeld and Rehman (2011) Jesse M Ehrenfeld and Mohamed A Rehman. Anesthesia information management systems: a review of functionality and installation considerations. Journal of clinical monitoring and computing, 25(1):71–79, 2011.