Unifying Cardiovascular Modelling with Deep Reinforcement Learning for Uncertainty Aware Control of Sepsis Treatment

01/21/2021 ∙ by Thesath Nanayakkara, et al. ∙ 40

Sepsis is the leading cause of mortality in the the ICU, responsible for 6 of all hospitalizations and 35 there is no universally agreed upon strategy for vasopressor and fluid administration. It has also been observed that different patients respond differently to treatment, highlighting the need for individualized treatment. Vasopressors and fluids are administrated with specific effects to cardiovascular physiology in mind and medical research has suggested that physiologic, hemodynamically guided, approaches to treatment. Thus we propose a novel approach, exploiting and unifying complementary strengths of Mathematical Modelling, Deep Learning, Reinforcement Learning and Uncertainty Quantification, to learn individualized, safe, and uncertainty aware treatment strategies. We first infer patient-specific, dynamic cardiovascular states using a novel physiology-driven recurrent neural network trained in an unsupervised manner. This information, along with a learned low dimensional representation of the patient's lab history and observable data, is then used to derive value distributions using Batch Distributional Reinforcement Learning. Moreover in a safety critical domain it is essential to know what our agent does and does not know, for this we also quantity the model uncertainty associated with each patient state and action, and propose a general framework for uncertainty aware, interpretable treatment policies. This framework can be tweaked easily, to reflect a clinician's own confidence of of the framework, and can be easily modified to factor in human expert opinion, whenever it's accessible. Using representative patients and a validation cohort, we show that our method has learned physiologically interpretable generalizable policies.



There are no comments yet.


page 10

page 12

page 13

page 14

page 18

page 25

Code Repositories


Reinforcement Learning based 'Learning' of Dynamic Sepsis Treatment Strategies

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sepsis is an overwhelming and extreme response to infection which can result in tissue damage, organ damage and death. Sepsis has been described as the leading cause of mortality in U.S. hospitals [liu2014hospital]. Moreover sepsis has been attributed to approximately 6 % of all hospitalizations and 35% of all in-hospital deaths in U.S [rhee2017incidence].

Reinforcement Learning (RL) [sutton1998rli]

is a sub-field of machine learning and control engineering dealing with sequential decision making problems. RL, especially when used in conjunction with deep neural network-based function approximation (Deep Reinforcement Learning/DRL) has achieved super-human performance in various domains

[mnih2015human] [silver2016mastering], [fuchs2020super]. It also has many desirable properties for healthcare problems, including applications to personalized medicine [liu2020reinforcement], [yu2019reinforcement]. But there remains considerable challenges in applications to practical medical decision making.

Herein, we identify major challenges in applying RL algorithms directly to find optimal treatment for sepsis and leverage distinct strengths of mechanistic physiological models, deep learning and RL to propose an uncertainty aware, RL based decision support system, which addresses each of the challenges.

Consistent with previous work [komorowski2018artificial], [raghu2017deep], [peng2018improving], we focus on vasopressor and IV fluid treatments, which whilst regarded important, studies have shown significant variation among patient responses to them [lazuar2019precision],[douglas2020fluid] [rech2011use]. Further there is little agreement among medical researchers on best practices that guide fluid or vasopressor administration beyond initial resuscitation. Both have been shown to be associated with negative effects on some patients. [waechter2014interaction]

Vasopressors and fluids are intended to counteract hypovolemia (abnormally low extracellular fluid), sepsis induced vasodilation (dilatation of blood vessels) vasoplegia (decreased response to compensatory mechanisms that increase vascular tone in normal physiological states) and physiological disturbances observed in sepsis. For instance it has been observed that vasodilatation of systemic resistance vessels in severe sepsis can decrease by up to 75% [young2004heart]. Recent recommendations [marik2016rational] have argued for a hemodynamically-guided fluid resuscitation strategies for patients with severe sepsis and septic shock.

Therefore it is clear that accurate cardiovascular states should provide crucial information for deciding on proper treatment. For this purpose we introduce a novel inference structure which is in essence a denoising recurrent autoencoder

[Goodfellow-et-al-2016], implicitly regularized by constraining the decoder to be a physiologic cardiovascular model. This dynamically learns a patient specific, robust cardiovascular state representation, in an unsupervised manner.

Although Deep Reinforcement Learning (DRL) has achieved super-human performance in the online setting (where the agent could interact with environment and collect more data whilst learning), our problem belongs to the subset of Batch Reinforcement Learning [lange2012batch] (Batch RL), where we do not have access to the real-time environment and all the learning is carried out using a retrospective fixed dataset. It is also well known that DRL could perform quite poorly when it is trained using a fixed dataset, and that its generalizability depends on the training data used [fujimoto2019off]. Deep neural networks in general also have the tendency to yield unreliable and sometimes blatantly wrong results for data far from the training distribution. Therefore, we claim that it is essential to address model uncertainty and quantify what the model does know and what it doesn’t. For this purpose we propose a method to quantify model uncertainty 222Here, and throughout this work we use the term ’model uncertainty’ to mean the uncertainty associated with neural networks used for RL. This should be not be confused with the model-based vs model-free RL distinction, because (once we have inferred latent states), our approach qualifies as ’model-free’. Uncertainty Quantification literature also uses the term epistemic uncertainty for model uncertainty , and we then use this to propose uncertainty aware treatment strategies, which are more suited for safety critical domains such as healthcare.

In summary we identify three significant challenges in directly applying DRL to compute optimal, personalized treatment strategies.

  • Partial Observability : Vasopressor and fluid therapy are administrated with specific cardiovascular goals in mind, however cardiovascular states cannot be observed, and despite the richness of data collected in the ICU, the mapping between patient states and clinical observables is often ambiguous, arguably a major contributor to variation in clinical practice.

  • Model Uncertainty: Deep neural networks are usually used as black box models and the learning of function approximating networks dependent on the training data. Thus, in situations where the extent and the variety of the training data cannot possibly cover the feasible range of situations (physiologic states and actions), as is the case in complex clinical situations such as sepsis, it is critical to quantify what the model does not know.

  • Problems associated with Batch RL : When the learning is done using a fixed retrospective dataset, it has been shown that the learned policies could be sub-optimal [fujimoto2019off].

Our work extends previous work in applying DRL for sepsis treatment, addressing each of the challenges above.

  • Partial Observability : We propose a general approach to combine physiologic modelling with deep networks to learn patient specific physiological states in an unsupervised manner, and use this framework to infer cardiovascular states of ICU patients. We further encode the entire patient laboratory history using a denoising, stacked, recurrent autoencoder architecture. This representation learning is expected to provide valuable hidden information to further characterize states which can be exploited by our agent.

  • Model Uncertainty Quantification : We associate an uncertainty measure for each state-action pair which quantifies model uncertainty and the randomness introduced by training process.

  • Problems Associated with Batch RL: We use a distributional approach to RL. This approach has shown to be achieve superior performance in Batch RL [agarwal2020optimistic]. Poor results in Batch RL is usually attributed to the fact that some state-action pairs are sparsely represented in the training dataset. To counter this, we use a discrete set of actions, all of which are observed frequently in the training dataset.

Figure 1: a : Proposed decision making process : We use the compete patient history, which includes, vitals, scores, and labs, and previous treatment, to infer hidden states. These would all combine to make the state . Our trained agent, takes this state and outputs value distributions for each treatment, it’s own uncertainty, and an approximate clinician’s policy. We then factor in all 3 to propose uncertainty-aware treatment strategies. b :Complete physiology-driven autoencoder network structure

Patient history is sequentially encoded using three neural networks. A patient encoder computes initial cardiovascular state estimates using patient characteristics, a recurrent neural network (RNN) encodes the past history of vitals and scores, up to and including the current time point, and a transition network which takes the previous cardiovascular state, action and the history representation to output new cardiovascular state estimates.

c :The electrical analog of the cardiovascular model This provides a lumped representation of the resistive and elastic properties of the entire arterial circulation using just two elements, a resistance R and a capacitance C. This model is used to derive algebraic equations relating R, C, stroke volume (SV), filling time (T), to heart rate (F) and pressures. These equations define the decoder of the physiology-driven autoencoder.

2 Background

2.1 Reinforcement Learning

Reinforcement Learning [sutton1998rli]

is a general framework for optimizing sequential decision making. In its standard form, a (discounted) Markov Decision Process (MDP), consisting of a 5-tuple (

,,,,) is the framework considered. Where and are state and action spaces. is a (possibly stochastic) reward function, denotes the unknown environment dynamics, which specifies the distribution of the next state , given the state-action pair and is a discount rate applied to rewards.

In the partially observed setting there is a distinction between the observations, usually denoted as , and the true state

, and the environment dynamics include the conditional probability density

. This yields the probability density of a particular observation, given the true underlying state. This extends the MDP formalism to that of Partially Observed Markov Decision Process (POMDP).

A policy is (a possibly stochastic) mapping from to .

The agent aims to compute the policy which maximizes the expected future reward , where the expected value is with respect to the policy and the unknown environment dynamics .

The value for a state , is defined to be the expected future discounted rewards, by following policy , starting from the state .


The function is the expected future reward resulting from choosing action , and then continuing the policy .


Central to most value-based RL algorithms is the Bellman equation [bellman1965dynamic],


and the Bellman optimality equation,


(where is the optimal function, and denotes the the random next state)

Most modern DRL algorithms use parameterized neural network function approximators, to approximate the function, and learn using 3 or 4

as a loss function.

Distributional Reinforcement Learning

Distributional Reinforcement Learning [bellemare2017distributional][pmlr-v84-rowland18a][barth2018distributed], tries to learn (an approximation of) the entire return distribution associated with a state-action pair. It has been shown [agarwal2020optimistic] that distributional reinforcement learning has achieved superior performance in Batch Reinforcement Learning.

More formally, the return associated with a policy

now is considered as a random variable

, and the function is the expectation of . Following the notation from [bellemare2017distributional], the distributional Bellman equation can be written down as,


Where indicates distributional equivalence, is a random variable tuple representing next state, action.

We use the categorical distributional reinforcement learning algorithm, proposed by [bellemare2017distributional], where the value distribution is approximated by a discrete distribution with equally spaced support.

Imitation Learning

Unlike in Reinforcement Learning where the agent tries to find the optimal policy, in Imitation Learning the goal is to successfully imitate an expert.

We use Imitation Learning (more specifically behavior cloning) to learn a clinician policy which is then used, when recommending an action. Further details of this is included under methods. (section 4.5)

2.2 Unsupervised Representation Learning

Representation learning is described as learning representations of the data that make it easier to extract useful information. [bengio2013representation]

. Most of the success of modern machine learning and neural networks can be attributed learning effective representations of data. One common way of learning useful representations is to use outputs of hidden layers of a neural network trained for one, possibly easy, auxiliary task for another task. This is the path we follow, when we encode the entire lab history of a patient, using a denoising autoencoder. Our physiology-driven autoencoder is itself a representation learner. However the representations have physiological meanings.

We argue that in Batch RL, where obtaining new data is impossible, learning useful representations can be the key to learning effective and generalizable RL policies.

2.3 Uncertainty Quantification

Uncertainty Quantification (UQ) [book] is omnipresent in computational sciences, applied mathematics, engineering and machine learning. As the name suggests, UQ is the science and methods of identifying and quantifying uncertainty in wide range of computational, statistical and engineering domains.

Uncertainty is usually categorized into two categories: aleatoric uncertainty, the inherent randomness in the environment, and epistemic or model uncertainty, which reflects how uncertain a model is of its own results. It is the later that we will be focusing on this work.

3 Related Work

Reinforcement Learning for Sepsis treatment :

RL has been used for various healthcare applications. [yu2019reinforcement] and [liu2020reinforcement] provide comprehensive surveys of these applications of RL in healthcare and critical care applications respectively.

Regarding using vasopressor and intravenous fluids for treatment of sepsis, Komorowski et al. [komorowski2018artificial] used a discrete state representation created by clustering patient physiological readouts, and a 25 dimensional discrete action space to compute optimal treatment strategies using dynamic programming based methods. This work was both confirmed and criticized (e.g. Jeter [jeter2019does]). This approach was extended by the authors to alternative reward schemes more in line with current medical decision making, and continuous state spaces. Raghu et al. [raghu2017deep] used DRL (with a Duelling DDQN [wang2016dueling] algorithm) and a continuous state representation. Peng et al. [peng2018improving] considers partial observability. This approach learns a representation of the patient history using a recurrent autoencoder, and uses a mixture of experts approach combing kernel based RL, and Deep RL. The idea presented here, of combining two different policies, has some similarities with the framework we introduce in section 4.5 to factor in human behavior likelihood and uncertainty. Recent work [li2019optimizing] has focused on considering partial observability and continuous action.

Uncertainty aware Reinforcement Learning :

Various UQ methods to identify, quantify and sometimes reduce, uncertainty in machine learning have been proposed in recent years. Abdar et al [abdar2020review] reviews these methods and investigates their applications. Most of these methods use one or more of Bayesian Neural Networks (BNN), Concrete Dropout (CD), and Deep Ensembles (DE) [caldeira2020deeply].

In RL, model uncertainty has been considered previously for various safety critical learning tasks. Kahn et al. considers uncertainty aware navigation of mobile robots [kahn2017uncertainty]. They use a model-based RL algorithm based on a predictive model comprising of bootstrapped neural networks using dropout. Lotjens et al. [lotjens2019safe] uses an ensemble of collision prediction networks, bootstrapping and Monte-Carlo Dropout to provide uncertainty estimates for safe RL for pedestrian collision avoidance. Bootstrapping and Deep Q Learning were also used in combination towards efficient exploration in online learning [osband2016deep]. These authors suggest an ensemble of Q networks, which then approximate a distribution over the Q values.

Our approach belongs to the class of Deep Ensembles and is similar to previous work in computational aspects. Yet, we aim to answer a slightly different question. As we are interested in Batch RL, we wish to capture the uncertainty resulting from using a limited, fixed subset of the data distribution, the stochastic gradient based learning process, and uncertainty of the output of the neural networks.

Thus we approach the problem from a pure uncertainty quantification point of view. We do not change the training algorithm. Rather, we formalize the concept of a state, action conditioned uncertainty associated with our problem and Batch RL in general. We then carry out an uncertainty quantification step which computes a Monte-Carlo estimate of uncertainty using bootstrapping. This is explained in more detail under methods.(section 4.4)

Cardiovascular Modelling :

There is a vast amount of rich literature, on mathematical models being used to model cardiovascular physiology. The mechanistic model we use (explained in detail under methods) is used in [bighamian2014prediction], and [bighamian2013analytic] amongst other work. The overall idea of combining mechanistic models with neural networks was partially inspired by clinical bias and a recent manuscript by Raissi et al. titled "Physics informed neural networks" [raissi2019physics]

, Which uses deep networks to solve forward and inverse problems involving nonlinear partial differential equations.

4 Methods

4.1 Data sources and preprocessing

Our cohort consisted of adult patients ( who satisfied the Sepsis 3 [johnson2018comparative] criteria from the Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC-III v1.4) database [mimiciii], [mimiciiidata]. We excluded patients with more than 25 % missing values after creating hourly trajectories, and patients with no weight measurements recorded. Our starting point of trajectories is ICU admission.

We further excluded patients who got discharged from the ICU but ended up dying a few days or weeks later at the hospital as we don’t have access to their patient data after ICU release, and treating final ICU data as a terminal state would damage generalizability.

Actions were selected by considering hourly total volume of fluids (adjusted for tonicity), and norepinephrine equivalent hourly dose (mcg/kg) for vasopressors. In computing the equivalent rates of each treatment, we followed the exact same queries as Komorowski et al [komorowski2018artificial]. When different fluids were administrated, we summed up the total fluid intake within the hour, and discretized the resulting distribution. For vasopressors, we considered the maximum norepinephrine equivalent rate administrated within the hour to infer the hourly dose. We used 0.15 mcg/kg/min norepinephrine equivalent rate, and 500 ml for fluids, as the 1,2 cutoff when discretizing.333These were chosen, considering the mean, median of non zero rates and medical knowledge, We also observe that due to the low dimensional action space, there is flexibility in choosing the cutoffs. A separate 0 action for each was added to denote no treatment.

Missing vitals and lab values were imputed using a last value carried forward scheme, as long as missingness remained less than 25% of values. A detailed description on extracting, cleaning and implementation specific processing as well as additional cohort details are included in the appendices.

4.2 Models

4.2.1 Physiology-driven Autoencoder

Autoencoders are a type of neural networks which learn a useful latent, typically lower-dimensional representation of input data, while assessing the fidelity of this representation by minimizing data reconstruction error. Our autoencoder architecture provides an implicit regularization by constraining the latent states to have physiological meaning, and the decoder to be a fixed physiologic model described in the next section. We further use a denoising scheme by randomly zeroing out input with a probability of 10-25% , when feeding into the network. This random corruption forces the network to take the whole patient trajectory (prior to the current time point) and previous treatment into account when producing it’s output, because it prevents the network from memorizing the current observation. In essence we ask the inference network to predict observable blood pressures and the heart rate using corrupted versions of itself, by first projecting it into the cardiovascular latent state, and then decoding that to reconstruct.

Figure 1

(b), shows the complete architecture of our inference network. As shown in the figure, the encoder is comprised of three neural networks, a patient encoder which computes initial hidden state estimates, a Gated recurrent Unit (GRU)

[cho2014learning] based recurrent neural network to encode the past history of vitals and scores up to and including the current time point, and a transition network which takes the previous state, action and the history representation to output new cardiovascular state estimates. We train this structure end-to-end by minimizing the reconstruction loss, using stochastic gradient-based optimization. Appendices provide a detailed description of model and architecture hyper-parameters, and training details.

Cardiovascular Model

The cardiovascular model, is based on a two-element Windkessel model illustrated using the electrical analog in Figure 1 (c). This model provides a lumped representation of the resistive and elastic properties of the entire arterial circulation using just two elements, a resistance and a capacitance , which represent the systemic vascular resistance (SVR), and the elastance properties of the entire systemic circulation. Despite it’s simplicity, this model has been previously used to predict hemodynamic responses to vasopressors [bighamian2013analytic] and as an estimator of cardiac output and SVR [bighamian2014prediction].

The differential equation representing this model is:


were represents the volume of blood in the arterial system. As explained in [bighamian2013analytic], over the interval (where is the filling time of the arterial system) we can write as , where stands for Stroke Volume, the volume of blood ejected from the heart in a heartbeat. When the system is integrated over the interval we obtain the following expressions for , i.e., the systolic,diastolic, mean arterial pressures, respectively,


is the filling time and is the heart rate, which is determined by . This system of algebraic equations is used for the decoder of our autoencoder. Since heart rate can it self be affected by vasopressors and fluids, we added heart rate () as an additional cardiovascular state despite it being observable.

Therefore we have a multivariate function , represented by the equations above, and the trivial relationship (Despite the obvious relationship we used both and , for ease of training and stability.) As stated previously, to prevent it from just using the current observations, we use a denoising scheme for training. This ensures at a fixed time, the model cannot memorize the current observation and learn to invert , since there is nonzero probability of corruption. Thus it has to learn to factor in the history and the treatments when determining the cardiovascular states. Once is inferred, the cardiac output (CO), can be computed as .

Since is not one to one, typically not all states are identifiable. To arrive at a better approximation we used the latent space to only model deviations from fixed baselines. We also posit that identifiable combinations of states when trained with a denoising scheme should provide important cardiovascular representations in the POMDP setting.

4.2.2 Denoising GRU autoencoder for representing Lab history

We use another recurrent autoencoder to represent patient lab history, motivated by the fact that labs are recorded only once every 12 hours. Forward filling the same observation for 12 time points, is more than likely sub-optimal, and the patterns of change in lab history can be helpful in learning a more faithful representation. Thus we use a denoising GRU autoencoder constructed by stacking three multi-layer GRU networks on top of each other, with decreasing number of nodes in each layer, the last 10 dimensional hidden layer was used as our representation. This architecture is motivated by architectures used in speech recognition [chan2015listen].

This model was also trained by corrupting the input, where each data-point was zeroed with a probability of up to 50%. (The rate was gradually increased from 0 to 50%). As with the previous autoencoder, this provides an extra form of regularization, and forces the learned representation to encode the entire history.

Model architecture and training details and presented in the appendix.

4.2.3 Imitation Learner

We use a standard multi-layer neural network as our imitation learner. This model is trained using stochastic gradient-based optimization by minimizing the negative log-likelihood loss, between the predicted action and the observed clinician action, with added regularization to prevent overfitting.

We do mention that there are many other options that could be used as a imitation learner, including nearest neighbor-based method as in [peng2018improving].

4.3 POMDP Formulation


A state is represented by 41 dimensional real-valued vector.

  • Demographics: Age, Gender, Weight.

  • Vitals: Heart Rate, Systolic Blood Pressure, Diastolic Blood Pressure, Mean Arterial Blood Pressure, Temperature, SpO2, Respiratory Rate.

  • Scores: 24 hour based scores of, SOFA, Liver, Renal, CNS, Cardiovascular

  • Labs: Anion Gap, Bicarbonate, Creatinine, Chloride, Glucose, Hematocrit, Hemoglobin, Platelet, Potassium, Sodium, BUN, WBC.

  • Latent States: Cardiovascular states and 10 dimensional lab history representation.


To ensure each action has a considerable representation in the dataset, we discretize vasopressor and fluid administrations into 3 bins, instead of 5 as in previous work [raghu2017deep], [komorowski2018artificial] [peng2018improving]. Where 0 indicates no treatment.

This results in 9 dimensional action space.

Timestep :

1 hour

Rewards :

We use the reward structure that suggested by Raghu et. al [raghu2017deep], with a minor modification. Since lactate was very sparse amongst out cohort we only considered SOFA based intermediate rewards. Our terminal rewards were +-15 depending on survival, or ICU death. More specifically our reward is of the form,



whenever is not terminal, and if it is, , depending on survival or non-survival.

4.3.1 Training

We only mention important details of training the RL algorithms here. Representation Learning related training and implementations are detailed out in the appendices.

We train the Q networks using a weighted random sampling based experience replay, analogous to the prioritized experienced replay [schaul2015prioritized], which has resulted in superior performance in classical DRL domains such as atari games.

In particular for each batch, we sample our transitions from a multinomial distribution, with higher weights given to terminal death states, ’near death’ states (measured by time of eventual death), and terminal surviving states. We used a batch size of 100, and adjusted weights such that on average there is 1 surviving state, and 1 death state in each batch.

This does introduce bias, with respect to the existing transition dataset, however we argue that this would correspond to sampling transitions from a different data distribution, which is closer to the true patient transition distribution, we are interested in, as we are necessarily interested in reducing mortality.

A same weighting scheme was used for all ensemble networks, which are trained to estimate uncertainty.

4.4 Uncertainty

In this work, we are concerned with model uncertainty, and not the inherent environment uncertainty. Model uncertainty stems from the data used in training, neural network architectures, training algorithms and the process itself.

Inspired by statistical learning theory

[vapnik1992principles], and the associated structured risk minimization problem [Zhang2010StructuralRM], we define the model uncertainty,(conditioned on a state and a action , given our learning algorithm, and model architecture as :


Where denotes the unknown distribution of ICU patient transitions, we are ideally looking to learn our policies with respect to. is a random variable which characterizes the value distributions. (For the C51 algorithm this can be interpreted as an element in ). This is outputted by our networks trained on a dataset sampled from , for a given state action pair. This random variable is certainly dependent on the training data, and the randomness stems from the inherent stochastic gradient based optimizations, random weights initialization.

is divergence metric appropriate for comparing probability distributions. We use the Kullback–Leibler divergence

[kullback1951information] for .

4.4.1 Estimating the uncertainty measure

We find a Monte-Carlo estimate of this quantity, by bootstrapping 25 different datasets each substantially smaller than the full training dataset, and training identical distributional RL algorithms in each. This can be done efficiently due to the sample efficiency of distributional methods. And we can approximate either by the ensemble value distribution, or by the value distribution of the model trained on the full training dataset.

4.5 Uncertainty aware treatment

In previous work using distributional RL, the policy is selected with respect to the expected value. (i.e. The action which has the highest expected value among all value distributions, is selected).

In this section we describe a general framework for choosing actions, which factors in model uncertainty.

Notice that, as our reinforcement learning algorithm, learns (an approximation of) the optimal value distributions, making decisions by considering additional information does not violate any assumption underlying the learning process.

When suggesting safe treatment strategies, we want the proposed action to have high expected value, however we also would like our agent to flexible enough propose an action with less model uncertainty, if two actions have very close expected values, to each other. Another important factor to consider is how likely is an action to be taken by a human clinician, although due to the nature of our 9 dimensional discrete action set this issue is significantly mitigated, as opposed to a higher dimensional or continuous action space, we do want our propose action to be an action which is considered likely be a human clinician.

To satisfy all three goals, we propose a general framework for choosing actions, based on a action preference score, parameterized by three parameters. This general framework is flexible, yet simple, and the end user can choose the parameters to reflect their own expert knowledge, and confidence of the model. Although this framework does introduce additional parameters, and thus additional complexity, we propose suitable values for each parameter which could be used as a fully automated, uncertainty aware decision support system.

First let’s define , to be a human behavior likelihood score function. For our work we use the probabilities outputted by the imitation learning network described in section 4.2.3.

Given a state , we introduce associated with each action , given by;


where , , is the uncertainty associated with the state, action pair and is the behavior likelihood score.

penalizes, uncertainty, and a high forces the action to be close to a clinician action. We could recover the expected value criteria by setting and we could use the system as a pure behavior cloner, by setting . The first two terms of 10 belong to the interval . Therefore control how much away from the highest expected value/behavior likelihood score can the agent choose an action from.

In choosing , we also need to be cognizant of the scale of the scores used for behavior likelihood and the scale of rewards. For a neural network which outputs log probabilities or probabilities we could use either, for our work we chose the latter.

For our system, we propose . We do reiterate that , depends on the fact that we use probabilities (outputs of a softmax) as behavior scores, and thus much smaller in scale that the expected value. As , the restriction of staying close to a clinician action gets more and more relaxed.

5 Results

In presenting our results, for the most part we focus on analyzing RL results. However we briefly mention the reconstruction results of the physiology-driven autoencoder. The results for one particular validation patient trajectory are presented in Figure 2. As the figure indicates, model has been successful in reconstructing the observable outputs and their trends when the corruption probability was up to 25%. The results do falter when the corruption level is too high, illustrated by the output of our network when the trajectory was corrupted with a 50% probability and fed to the network. This was typical among all patient trajectories, both for training and validation datasets.

Figure 2: Reconstruction of a validation patient trajectory, using different levels of corruption using the physiology-driven autoencoder, We have used a lower width for higher corruption rates for better visibility.

In analysing our main work, we present results in three ways.

First as with any Machine Learning task it is essential to inquire about generalizability. We do this by investigating value distributions of validation cohort patients. We further provide heuristic explanations of our models’ results by showing the dependence on important input features, including identifiable cardiovascular states.

Then we consider individual, representative patients, (a) a survivor whose health deteriorated initially, and (b) a non-survivor from our cohort, and analyze the expected value trajectories, uncertainty and treatment strategies for these two patients.

Later we summarize global results, which include averaged treatment strategies and uncertainty patterns we could observe empirically. However we note that at it’s core our approach strives to extract and then use patient specific recurrent representations to learn personalized treatments. Therefore global analysis is unlikely to provide much insight into the intricacies that underline the decision making process. Further when analyzing the proposed treatment, it should also be noted that each action is proposed considering only the current, actual state. Therefore for a fixed patient trajectory at a fixed time, agent does not know what it has proposed previously, nor how it’s action would have impacted the patient state.

5.1 Value distributions and expected values

We first investigate if the value distributions learned are generalizable, and if they are in agreement with clinical knowledge. We further expect our networks to identify the mortality risk in advance, for patients who ended up dying at the ICU.

For this purpose, we consider value distributions outputted for each time-step, for all our validation patients. Then we group these distributions, by considering patients who died and patients who survived separately. We further group these distributions by time to eventual death or discharge and consider the average distributions. Figure 3 shows these plots for 48, 24, 12 and 1 hour from death or discharge.444For computational efficiency we only have considered a random subset of survivors, equal in size to the number of non survivors We reiterate that our network only sees the patient state at a given time, when it outputs the value distributions associated with the state. It is not provided with any information of the future, the aggregation with respect to the time to death is done purely for analysis.

Figure 3: Value distributions for validation patients averaged according to different times from death or discharge, Top Non Survivors, Bottom : Survivors
Figure 3: Value distributions for validation patients averaged according to different times from death or discharge, Top Non Survivors, Bottom : Survivors
Figure 4: Scatter plots of scaled features : Marker colors indicate if (Blue) or (Red)
Figure 5: Scatter plots of scaled features: Marker colors indicate the vasopressor treatment under expected value

From figure 3, we could observe a clear bi-modality for patients who died, even 48 hours in advance, and as the patient state gets closer to eventual death, left peaks increase in area while the right peaks decrease, which results in the shift of the expected values (centers of mass) of each distribution to the left. This agrees with patient’s health deteriorating, and at a certain point death becomes more likely an eventuality. We could also notice that the distribution associated with no treatment has a larger left peak than others, highlighting that for these patient states lack of treatment for even one hour could result in faster death.

In contrast for survivors, peaks of the distributions maintain the same areas and their expected values remain closer to the right limit. This again agrees with our expectation that when a patient is less than 2 days from eventual discharge, on average their health state should be significantly healthier than non-survivors. And value distributions between different actions are barely noticeable, which is reasonable considering a change in treatment for one hour may not significantly alter the patient state, when a patient is at a less risky state.

We, next investigate the dependence of features and inferred states on the value distributions, and thus the optimal value function , computed as . Figure 4 shows three scatter plots, between systolic blood pressure vs SOFA score, SpO2 vs Heart Rate and RC vs (CO)R. In each scatter plot we mark points corresponding to in blue and in red. (5 as a threshold was chosen arbitrarily, and we could observe similar results for any reasonable threshold) We can see a perceptible separation of red and blue in each plot, and in each case this separation is consistent with medical knowledge. For example low systolic blood pressure (hypotension) is associated with sepsis shock and higher mortality risk and high SOFA scores indicate organ failure. Similarly as explained before, systemtic vascular resistance, and cardiac output both can be dramatically reduced close to death in septic patients. Therefore we could notice that our agent has learnt to discriminate between low risk (in the ICU context) and high risk septic patients states in an explainable manner. This separation is more impressive when we consider that 89% of our patient states have the property .

Further, Figure 5 shows the same scatter plots, but now the marker color indicate the suggested vasopressor treatment under expected value. (i.e. the treatment corresponding to the distribution which has the highest expected value). These observations again agree with general goal of vasopressor treatment. Low systolic blood pressure, low resistance, and cardiac output are all used as psychologic indications for vasopressor administration.

Next, we consider representative patients from our cohort, and analyze the expected values of all 9 distributions, and model uncertainty for the patient trajectory. Top figure of Figure 6, shows the evolution of expected values of all nine actions each for a patient (ICU ID: 263969) who had died, with time. This was typical among all patients who have died, initially there’s less variability among the expected values, especially the non-vasopressor actions are very close in expected value, but as the patient’s health deteriorates the variation becomes more drastic, and there is a clear preferences towards vasopressor based actions. We anecdotally observe that the difference between vasopressor based treatment and non vasopressor based treatment becomes more significant whenever systolic blood pressure becomes low and when the identifiable cardiovascular combination (CO)R becomes low. The marker size of each plot shows, how much our agent is uncertain of it’s on results. (distribution associated with each state and action). We could also observe that the model is less certain of it’s results when the patient’s health starts deteriorating. This can be attributed to the fact that these states are uncommon in the dataset, and for septic patients the underlying cause behind health deterioration could be very different. However when the patient is within 2 hours away from death, the uncertainty does go down.

Figure 6: Top : Expected value evolution for a patient who died in the ICU. The marker size indicates the uncertainty associated with a particular action. Also shown are the standardized values of SOFA score, Systolic blood pressure, and the unidentifiable cardiovascular state (CO)R. x axis indicates the hours from ICU admission. Bottom : Expected value evolution for a patient who survived. Same comments as above
Figure 6: Top : Expected value evolution for a patient who died in the ICU. The marker size indicates the uncertainty associated with a particular action. Also shown are the standardized values of SOFA score, Systolic blood pressure, and the unidentifiable cardiovascular state (CO)R. x axis indicates the hours from ICU admission. Bottom : Expected value evolution for a patient who survived. Same comments as above

For comparison, the bottom figure of Figure 6 shows the evolution of expected values of a survivor (ICU ID: 279413 ). Here we can notice that the expected values take a downward slide at around 25 hours from admission, with the values associated with no treatment considerably lower. This coincides SOFA scores increasing, systolic blood pressure,(CO)R rapidly decreasing, clearly indicating that patient’s health has been deteriorating. However as SOFA scores improves (goes down), and pressure and (CO)R goes up, as expected, expected values do go up, and now the difference between expected values of each value distribution is considerably less. The uncertainty levels are also much lower.

The fact that expected values of different actions being close to each other at healthy patient states, can be explained by the RL framework. By equation 4, we can observe that is determined by the intermediate reward and the next state it transits to (because by definition we choose the optimal action at the next state ). Further our intermediate rewards are much smaller compared to terminal rewards. Therefore for a healthy patient state if the choice of action does not significantly change the next state in an hour, the functions would be close to each other. Another way to interpret this is that a potentially wrong action can be reversed by taking the correct action in the next hour if the patient is non critical, however if the patient is at a more critical state a wrong action can have more drastic consequences.

In the appendices, we also also present the evolution of expected values of 6 validation patients.

The results presented in this section indicate that our network, has learned generalizable value distributions, which in most cases identify the risk of death considerably before eventual death. And since our treatment strategies are based on the learnt value distributions, it we have confidence that the learned policies will also be generalizable.

5.2 Model uncertainty

In this section, we briefly mention results of uncertainty quantification.

In figure 11 (Appendix C) we provide radar plots of uncertainty measures associated with each action, for randomly generated patients, at various times throughout their ICU stay. The common pattern is that for most patients who have died, the model is less confident about its value distributions as they become closer to death. Uncertainty among each action varies from patient to patient. However for survivors this behavior is the exact opposite, as the agent is more confident of it’s results and becomes even more confident as the patient gets closer to discharge from the ICU.

Appendix C also presents summary statistics of uncertainty results for survivors and non-survivors.

As we mentioned previously and also detailed out in the appendix, this behavior is not surprising since, death states are relatively uncommon, and also there are a wide variety of ways a septic patient may face increased mortality risk. However for survivors, we do expect all of them to approach a healthy state as they approach eventual discharge.

Figure 7: Recommended actions under various combinations of , and , x axis :the time in hours from ICU admission,y axis: the discretized action a : Recommended actions under expected value and behavior cloning, b: =0.1 and is varied, c : and is varied, d : Clinician Action
Figure 8: Recommended actions under various combinations of , and , x axis :the time in hours from ICU admission,y axis: the discretized action a : Recommended actions under expected value and behavior cloning, b: =0.1 and is varied, c : and is varied, d : Clinician Action

5.3 Uncertainty aware treatment

In this sub-section we present treatment strategies for patients made by the model. As mentioned before one should be cognizant of the Markovian assumption of the agent (as with all RL agents) when interpreting the results.

First we will consider the two patients we presented in figure 6, and analyze the suggested treatment under various different parameter combinations. These results are presented in Figures 7 and 8.

First, panel (a) shows the treatments under expected value, and behavior cloning respectively.(corresponding to and , and and )

To facilitate better comparison, we fixed , and consider different levels of and . Panel (b) keeps , fixed and varies , whilst panel (c) keeping the first two fixed.

By considering Figure 7, we can notice that just as we discussed in the previous sub-section, as the patient’s health deteriorates, (evinced by the sharp fall in systolic blood pressure, (CO)R, and increase in SOFA scores shown in figure 6), vasopressors are recommended by all of the combinations.

Figure 8, whilst much harder to interpret, we could again notice, that around 20-30 hours, and 40-50 hours vasopressors are proposed, this does coincide with the steep downward slide of SBP and (CO)R.

For comparison with previous work in RL for sepsis treatment [raghu2017deep], we also provide heat-maps, of clinician and agent proposed treatments, overall, low SOFA (<5),medium SOFA ( SOFA ), and high SOFA (), and all non survivors. Figure one shows agent policies under ( and a pure expected value policy.

As expected , pushes the treatments closer, to the clinician’s recommendations.

As mentioned above however, these kind of global averaged treatment analysis may not reflect the intricacies underlying the decision recommending system, as our methods all are individualized.

Figure 9: L : Heat-plots for recommended actions, under ,, and Expected Values, Shown are cinician’s vs Agent for overall (orange), low sofa (green), medium sofa (blue), high score (purple) and non survivors (red).

5.3.1 A comment on Off Policy Evaluation

Off policy evaluation, is the quantitative or statistical evaluation of the value of a learnt policy, using another dataset, usually a validation cohort.

Although this is attractive in theory, most common unbiased off policy value estimates, depend on the behavior policy, which generated the data, being known. However this is not the case when the data was generated by human clinician’s actions. It could be argued that our imitation learner could be used as an approximate behavior policy, however it is trained to predict the likely clinician action, and unlikely to be a good approximator to be used as a behavior stochastic policy.

Further previous research have claimed at all off policy evaluation methods are unreliable in the sepsis management context [gottesman2018evaluating]. These problems are exacerbated when the policy which is to be evaluated is deterministic.

Due to the above reasons we do not consider statistical off policy evaluation in this work, however we note that developing off policy evaluation techniques suited for the critical care domain is a very important area of research to explore in the future.

6 Discussion

In this work, we present an inter-disciplinary approach which we believe takes an important and significant step towards both better

(both is terms of end goals and interpretability) vasopressor and fluid administration strategies, as well as automated medical decision support. We believe, the maximum benefit of Artificial Intelligence applied to medicine, can be realized when used in conjunction with traditional sciences and engineering, as well maximizing human expert knowledge.

We have shown that our methods, have learnt clinically interpretible value distributions, which can be computed sequentially using data routinely calculated at the ICU. These value distributions quantify the mortality risk associated with a septic patient, and thus can be used as risk score. Further we proposed a framework which enables using physiologic models in unison with Deep learning for physiologic representation learning. We anticipate that this framework can be build and improved to provide important information for clinical decision making.

We also discussed uncertainty quantification, and proposed a framework for a fully automated decision support system, which is aware of it’s own uncertainty and also factors in what it’s best guess is at a human action.

6.1 Future Work

There are other avenues we would like to explore in the future.

Model based RL with physiological models: Model based RL, aims to explicitly model the underlying environment and then use this information in various ways in figuring out control strategies. 666It could in theory be argued that our work itself is a model based and model free hybrid method. This paradigm provides a natural place to incorporate mechanistic models, which could potentially help both control and interpretability.

Reward Structure: This work our reward structure, was based on previous work. However, the reward is an essential component of any RL algorithm and is the only place where the agent can judge how good it’s actions were. This is potentially another place to include physiological knowledge. Ideally, we would want our reward structure to capture an accurate mortality risk, and an organ damage score, with each state.


7 Appendix A: Cohort Details

Our total patient cohort consists of 18472 patients, out of which 1828 were non-survivors.

% Female Mean Age Mean ICU Stay Total Population
Overall 42.33 % 66.05 7 days 15 hours 18472
Non-Survivors 42.67 % 68.8 9 days 13 hours 1828
Survivors 42.14 % 65.91 5 days 13 hours 16644
Table 1: Cohort Details

This resulted in an experience replay consisting a total of 2596604 transitions.

8 Appendix B: Neural Network architectures and Implementation Details

8.1 Physiology-driven Autoendcoder

Encoder :

  • Patient Encoder: Multi-layer feed-forward neural network, with 3 hidden layers with 64 nodes each, followed by exponential linear unit, (eLU) non-linearity applied element wise.

  • Transition: Multi-layer feed-forward neural network, with 8 hidden layers with 128 nodes each, followed by exponential linear unit,(eLU) non-linearity applied element wise.

  • RNN: Gated recurrent unit, based RNN, with 1 hidden layer, with 64 nodes.

We note that, as for inputs for the network specifically the RNN, we included all vitals, and sofa related scores, including the four dimensional observations, systolic blood pressure, diastolic blood pressure, mean blood pressure and heart rate.

For training, we used Adam [kingma2014adam]

, with a low learning rate (1e-5), the corruption is only introduced after the model has been trained for several epochs. Notice, that we feed hourly trajectory including all vitals, and scores for the RNN to learn it’s representation, but the loss is clearly minimized with respect to the pressures and heart rate.

We observe that using a low learning rate is essential for numerical stability.

8.2 Denoising Lab Autoencoder

This is comprised of three GRU networks stacked on top of each other.

  • Network 1 : 12 hidden units, with 512 nodes, outputs a 128 node vector.

  • Network 2 : 5 hidden units 128 nodes each, outputs a 10 dimensional vector.

  • Network 3: 3 hidden units, with 10 node each, the last of which is taken as our hidden lab representation.

We train this again using, Adam, corruption is gradually introduced starting from 0% to 50%. We use the network trained under 50% corrupted inputs, when inferring the hidden lab representation for RL.

We standardized all the labs before feeding into the network.

8.3 Imitation Learning

Muli-layer neural network with, 4 hidden layers : 3 with node size 512, and the last 256. All hidden (and input) layers are followed by a rectified linear unit (reLU) non-linearity.

Training was again using Adam, with a standard learning rate, and we minimized a negative log likelihood loss, which is standard in classification problems.

8.4 Bootstrapping and ensemble training

To learn each bootstrapped network, we first sampled from the all patients, to arrive at a bootstrapped patient list. Then we train the networks, for 3 epochs each, using a process identical to training the main RL algorithm.

8.5 Distributional Q learning

We use the standard C51 training algorithm as in [bellemare2017distributional]. Q network was a multi-layer neural network. Apart from the weighted sampling described in the main body of the text, training steps were all standard.

We use a target network, and update the target networks using polyak target updating with . (i.e. after every iteration/training step we set the target network weights to a linear combination of it’s own weights, weighted by (1-) an the Q network weights, weighted by ). This kind of target network is common amongst all deep Q learning, algorithms.

We summarize the hyper-parameters involved in table below.

Support size 51
Maximum value 18
Minimum value -18
Batch size 100
Number of iterations 51932
Optimizer Adam
Learning rate
Table 2: RL algorithm hyper-parameters

9 Appendix C: More Results

9.1 Representation Learning

The physiology-driven autoencoder was capable of almost perfectly reconstruction of the input when it was trained, and evaluated without corruption. The unnormalized error, on the 4 dimensional output was around 6.

With 10% corruption probability, the loss, (evaluated also with corruption) was around 45 and, with 25% corruption this went up to close to 60. We experimented with 50% corruption rates, but however results were far less impressive as indicated in the main text.

Figure 10: Expected Values of random validation patients, Top: Non-survivors, Bottom :Survivors, As with Figure 6 the blob size indicate the uncertainty
Figure 10: Expected Values of random validation patients, Top: Non-survivors, Bottom :Survivors, As with Figure 6 the blob size indicate the uncertainty

9.2 RL Results

In this section, we present further results of the distributional RL algorithm, and uncertainty quantification.

Figure 10 shows, expected value trajectories, of randomly selected validation patients. As we have mentioned previously these results indicate the generalizability of our value networks.

Next, we show results on model uncertainty. Table 3 presents average uncertainty, among all patient states, grouped by the training and validation datasets and whether the patient was a survivor or a non-survivor. As we mentioned in the main text, the uncertainty is much higher for non survivors than survivors. Further uncertainties for validation non-survivors are higher than training non-survivors. However for survivors the training and validation uncertainty are very similar on average. Figure 11 presents radar plots, of model uncertainty for various randomly drawn patients. As we can see, and in a sense as we could anticipate, there is significant variation, across patients and actions which the averaged results do not show. But in general we could see for patients who have died, the agent becomes less confident of it’s value distributions as they approach death. The action wise uncertainty however,does not exhibit an observable pattern. For survivors however, this pattern is the opposite, the agent becomes, more and more confident of it’s results as the patient, gets closer to eventual discharge.

Both Table 3 and Figure 11 agree with our expectations, because near death states, are relatively uncommon, and also there could be a lot of different ways a septic patient may have increased mortality risk. However for survivors, we do expect our agent to be confident of their survival, as their states should approach a healthy state.

Non-Survivors Training Survivors Training Non Survivors Validation Survivors Validation
Vaso 0 Fluids 0 0.1916 0.0887 0.3617 0.0861
Vaso 0 Fluids 1 0.1591 0.0855 0.3085 0.0894
Vaso 0 Fluids 2 0.1587 0.0846 0.3104 0.0879
Vaso 1 Fluids 0 0.1547 0.0820 0.3066 0.0850
Vaso 1 Fluids 0 0.1451 0.0815 0.2676 0.0847
Vaso 1 Fluids 0 0.1482 0.0827 0.2776 0.0850
Vaso 1 Fluids 0 0.1634 0.0839 0.3135 0.0853
Vaso 2 Fluids 0 0.1498 0.0831 0.2710 0.0860
Vaso 2 Fluids 0 0.1488 0.0808 0.2850 0.0832
Table 3: Mean Model Uncertainty for Survivors and Non-Survivors in training and validation datasets
Figure 11: Uncertainty at admission, 12 hours from death/discharge and 5 hours from death/discharge Top : Non survivors, Bottom : Survivors Note: To better present differences, we change the radius of each radar plot, so the radii are not uniform
Figure 11: Uncertainty at admission, 12 hours from death/discharge and 5 hours from death/discharge Top : Non survivors, Bottom : Survivors Note: To better present differences, we change the radius of each radar plot, so the radii are not uniform