Introduction
Patients in the intensive care unit (ICU) are among the sickest in the hospital, and require many different types of interventions to control and respond to their unstable physiological conditions. For instance, antibiotics are given to control infections [1], and anticoagulants are given to dialysis patients to prevent thrombosis [2]. Patients with the highest acuity may be given more aggressive and invasive interventions such as mechanical ventilation [3] as well.
In this paper, we focus on decisions to give fluid bolus therapy [4] and vasopressors [5] when treating hypotension and shock. Hypotension is associated with overall higher morbidity and mortality in across several populations, including populations with sepsis [6] and populations in the emergency department [7]. However, despite the importance of addressing this problem, decision making for hypotension management is not standardized, and treating these patients effectively is challenging. Although it has been studied extensively [8], the choice of bolus size and timing, as well as which vasopressor to use and in what dosing regimen is not well understood.
Reinforcement learning (RL), a branch of machine learning focused on learning how to make a sequence of decisions toward some desired outcome
[9], has the potential to help us use past data to assist with these decisions. Recent applications of RL to healthcare include managing sepsis [10], schizophrenia [11], mechanical ventilation [12], and heparin dosing [13]. However, as noted in [14], quantifying the quality of a proposed treatment policy is challenging. Observational data create hard limitations on the kinds of policies that can be credibly evaluated: one cannot evaluate policies that recommend treatments that were never or rarely performed, and even when the recommended treatments have support in the observed data, the value of different choices may be impossible to statistically differentiate.Thus, instead of attempting to identify a single optimal treatment policy from observational data—which is often impossible—in this work, we focus on identifying a collection of distinct, plausible policies. Having such a collection of options can provide insights into multiple versions of treatments that may be of similar efficacy, and it also provides a step toward providing personalized recommendations by creating a space of reasonable treatment options. One way to think about this approach is to note that the variation that we see in clinician actions is likely to be safe—patients are typically treated conservatively to avoid iatrogenic harm. Amid this variation, our goal is to identify a collection of treatment policies that are both distinct—that is, different from each other, so as to provide choices of options—but also likely—that is, are not too far from current practices. To this end, in this work we develop SODARL: Safely Optimized, Diverse, and Accurate Reinforcement Learning, as a technique to identify a collection of plausible highefficacy policies. By drawing potential treatment policies from the variation in current practice—that is, actions currently taken by clinicians—we ensure that our options are likely to be safe, or at least as safe as current practice.
Our results on a cohort of hypotensive ICU patients demonstrate that all three components of SODARL (Safety, Diversity, and Quality/Accuracy) are necessary. The distinct policies learned by SODARL achieve roughly the same estimated value as the observed clinician policy, and our qualitative results suggest that the different policies do indeed pick up on real underlying options for treatments.
Background
We will model the problem of hypotension management as a Markov Decision Process (MDP), a standard formalism in reinforcement learning [9]. An MDP is defined by a state space that describes the current setting of the environment (e.g. clinical variables describing a patient’s current physiological state), and an action space of possible actions that can be taken (e.g. treatments to administer such as IV fluids or vasopressors). The Markov in MDP refers to the assumption of Markovianity in the state transition distribiution. That is, we assume that at time , the next state is determined solely from the current state and action , i.e. , where and refer to the complete history of previous states and actions. To complete the specification of the MDP, we define a discount factor that balances the value of current vs. future rewards, along with a reward function that assess how good the actions being taken are. For instance, the reward function might take positive values for physiologically stable states that lead to improved patient outcomes, and take negative values for states leading to physiological instability and decompensation.
We refer to a decision making strategy as a policy, and let
indicate the probability that action
is taken when in state . In this work we focus on stochastic policies, although it is also possible to learn deterministic policies where the same action is always taken from a given state. A trajectory is a sequence of states, actions and rewards received in an interaction with the environment: ). We define the value of a policy as its expected sum of future discounted rewards:(1) 
where denotes the distribution over trajectories generated by following the policy and transitioning between states according to the distribution .
An optimal policy is one that achieves the highest possible value (eq. 1). The field of reinforcement learning (RL) provides a suite of tools for learning an optimal policy via interactions with the environment. That is, we typically do not have direct access to the transition distribution and must instead learn by trying actions and seeing their results (e.g. giving a treatment to a patient and observing the outcome). However, such experimentation is obviously both unethical and impractical in clinical domains, as unsafe actions may be recommended. The subfield of batch RL attempts to learn policies based on previously collected trajectories (e.g. from information in the electronic health record describing the clinical states and treatments given to patients).
A key question in batch RL is offpolicy evaluation, that is, how to estimate the value of a proposed policy given only a collection of trajectories collected according to some other (potentially suboptimal) policy . One class of methods for accomplishing this relies on importance sampling, a general technique for estimating properties of a distribution of interest (e.g. the distribution of rewards if we follow our policy ), given only samples generated from a different distribution (e.g. the distribution of rewards if we follow the clinician behavior policy, ). In this work, we will use a stateoftheart estimator, the Consistent Weighted PerDecision Importance Sampling (CWPDIS, [15]), to estimate the value of the policies we learn using a retrospective set of clinician behavior trajectories :
(2) 
The quality of the estimate in eq. 2 will depend on how many trajectories are retained by the reweighting by , known as the effective sample size (ESS) [16]. Informally, the ESS gauges how many samples from the true distribution of interest would provide an estimator with similar quality. Even though the number of trajectories
may be large, high variance in the distribution of the importance weights
may cause the resulting estimate to be very unreliable, and only provide nonnegligible weight on a few trajectories. For instance, if but the ESS is only , then our estimate using trajectories from to estimate the value of our policy will perform about as well as using only trajectories actually collected according to .We focus on the ESS of the CWPDIS estimator at time , the end of the trajectory:
(3) 
If all the importance weights are equal, it is easy to see that the ESS is simply . In this work, we use the ESS as an indicator of the reliability of the estimate of a proposed policy’s value. If the ESS is low, then even if the value estimate is high, the proposed policy is not trustworthy and may actually not be highquality, because that high value estimate was effectively measured from only a few trajectories.
Related Work
The batch or offpolicy RL literature generally focuses on safe and efficient learning using offpolicy evaluation techniques [17, 18, 14]. In this work, the notion of safety we use is employed by the assumption that clinicians generally perform well and very rarely make unsafe actions. This is somewhat distinct from other concepts of safety in the area of safe RL
, such as those comparing the bounds on different offpolicy evaluation metrics (e.g.
[19]). Moreover, there has been limited exploration of learning collections of distinct agents within the offpolicy RL community.Within RL more broadly, most prior work involves notions of diversity that are not aligned with the kind of efficient explorationamongstsafeoptions setting we are interested in. [20, 21]
use notions of diversity that don’t directly compare action probabilities, but rather compare features such as neural network parameter differences or the entropy in a single agent’s action probabilities. More related is
[22], who learn a policy over options and can train multiple options (in an offpolicy manner) using a rollout from a single option. Although the distinct options can give rise to agents with distinct behaviors, there is no explicit diversity component in the objective, and it is unclear how to summarize the kinds of distinct trajectories that are possible and what combination of options leads to the most interesting policies.Our motivation for seeking a collection of distinct policies in the reinforcement learning setting is aligned most closely with the end goal of [23]: presenting a broad set of representative solutions as a tool for hypothesis generation and to discover specific directions of interest for further inquiry. Their primary application focus is on malware detection, and they first learn a set of good policies followed by a posthoc clustering step to identify diverse candidates, whereas we learn diverse policies via a joint optimization. [24] learn collections of distinct policies using a divergence metric between distribution of trajectories induced by policies. However, their work focuses largely on onpolicy settings where a simulator of the environment is available and collection of policies is learned sequentially rather than jointly. In our case, we jointly optimize to find a collection of distinct, plausible alternatives from a collection of alreadycollected observational data, which can inform clinicians of multiple hypotheses for treatment strategies.
Finally, there exist several papers using data to inform decisions in the ICU. [10] and [25] also use RL to learn fluid and vasopressor treatment strategies, but specifically in septic patients, and their focus is on optimality and not safety and diversity. [8] focuses on predicting response to fluid bolus therapy, as the treatment does not always work. There are also many papers that attempt to predict onset of various kinds of interventions (e.g. [26]) and onset of hypotension events (e.g. [27, 28]). All of these works try to identify one policy, rather than providing reasonable alternatives.
Cohort and Data Processing
We draw our trajectories from the publiclyavailable MIMICIII database [29]. The full database contains static and dynamic information for nearly 60,000 patients treated in the critical care units of BethIsrael Deaconess Medical Center in Boston between 20012012. We use version 1.4 of MIMICIII, released in September 2016.
From the database, we considered adults (at least age 18), with MetaVision data (only patients for whom we could reliably and easily extract both start and end times for interventions). We then removed patients with very short ICU stays of less than 12 hours. For all other ICU stays, we only consider the first 72 hours within the ICU admission, as patients who are in the ICU for extended periods of time often receive different care than the initial treatments in the crucial first few days after admission. We required at least three distinct measurements of mean arterial pressure (MAP) below 65mmHg, indicating probable hypotension, and used only the first ICU admission if a single patient had multiple admissions. This filtering process resulted in ICU stays. We split the dataset into ICU stays (of which we use
as a validation set for hyperparameter selection and
for training), and the remaining as a heldout test set for final evaluation. See Table 1 for baseline characteristics and demographics of the selected cohort.Characteristic  Summary Statistic 

Age, mean years (25/50/75% quantiles) 
67.3 (57.5,69.3,80.5) 
Female (%)  47.8% 
Surgical ICU (%)  48.7% 
Nonwhite (%)  23.9% 
Emergency Admission  81.5% 
Urgent Admission  1.2% 
Hospital Admit to ICU Admit Time, mean hours (25/50/75% quantiles)  25.7 (0.02, 0.04, 15.97) 
In addition to these 7 baseline variables, we also include features derived from 10 different vital signs (e.g. heart rate, MAP) and 20 laboratory measurements (e.g. lactate, creatinine). Vitals are typically recorded about once an hour from (continuous) bedside monitors, while labs are typically only measured a few times a day from blood samples drawn from patients. We also include indicator variables that assess whether or not a variable was recently measured, as the action of decided to measure certain variables may itself be very informative [30].
Lastly, we extracted information on the interventions of interest: fluid bolus therapy and vasopressor administrations. We combine different types of fluids and blood products together when forming our fluid action variable (we only include common NaCl 0.9% solution, lactated ringers, packed red blood cells, fresh frozen plasma, and platelets). We include five different types of vasopressors for the vasopressor action: dopamine, epinephrine, norepinephrine, vasopressin, and phenylephrine. We map these five drugs into a common dosage amount based off norepinephrine equivalents, following the preprocessing in [10], where the infusion rates are in mcg/kg, normalized by body weight.
To apply RL to a problem, we must formalize the state and action spaces, as well as defining a reward and a timescale. We now describe each of these pieces below.
State Space, Time Discretization, and Imputation
We discretize time into hourly windows, and derive an 89dimensional state vector, consisting of the baseline variables in Table
1 and values of the physiological and indicator variables as shown in detail in Table3in the appendix. We impute any unobserved variable with the population median. Once a variable is observed in a given hospital admission, we then use the last observed measurement until a new value is measured. If more than one value is measured in a given hour window we take the most recent value, except for the three blood pressure variables, where we use the minimum value, as clinicians typically treat patients based on their most recent worst blood pressure value.
Action Space
We discretize the two types of interventions, fluid boluses and vasopressors, into 4 and 5 different discrete doses, so that in total there are 20 unique actions (see Figures 3,4,5,6 in the appendix for details). To compute the dose of a vasopressor, we aggregate the total amount of vasopressors given in each hour window, normalized by weight. For fluids, we only include fluids boluses of at least 200mL administered in an hour or less.
Reward
We use the common target of a mean arterial blood pressure (MAP) of 65mmHg. We consider MAP values above 65mmHg as acceptable (reward 1), and decrease the reward using a piecewise linear function, with inflection points at 60mmHg, and 55mmHg, down to a minimum of 28mmHg (the lowest observed MAP in our data, which we assign a reward of 0). Sufficient urine outputs are allowed to ignore the penalty for moderately low MAP values of 55mmHg or higher, as clinically the slightly lower MAP is less concerning if their fluids are well balanced. See Figure 7 in the appendix for a visual depiction of the chosen reward function. We leave a more thorough investigation of potential reward functions to future work. However, it is important to note that when we present SODARL in the next section, rewards are not included in the optimization, so the algorithm will be agnostic to choice of reward and this will only affect the posthoc value estimates.
Methods
When treating hypotension, there may legitimately exist different treatment strategies that are equally effective for a particular patient (e.g. one that focuses on vasopressor use and one that focuses on fluid use). There may also exist treatment strategies whose quality cannot be distinguished from the observational data.
Below, we introduce an algorithm, SODARL: Safely Optimized, Diverse, and Accurate Reinforcement Learning, for learning a collection of distinct, reliably highquality policies from a batch of data. Doing so requires three parts. First, we want to make sure that any policy () that we recommend never takes potentially dangerous actions i.e. . Second, we want the policy to be highperforming. Finally, we want the collection of policies ( to be distinct (that is, not repeating the same recommendations). The following objective function incorporates all of these desiderata:
(4) 
where
is a loss function that measures discrepancy between our collection of policies and the behavior policy,
is a loss function (with associated regularization strength ) related to diversity within the collection of policies. Note that before SODARL can be run, we first need to estimate the clinician behavior policy, . Following [31], we do this using a k nearest neighbors approach to count the proportion of each action observed in the 100 nearest states. To quantify distance between states, we use a manually constructed distance function that weights each of the 89 state variables differently depending on their relative importance to this clinical application.Safety:
The goal of the safety constraint is to ensure that a policy does not take a dangerous action. For our purpose, we define dangerous as unknown or rarely performed: assuming that the clinicians are choosing amongst reasonable decisions most of the time, there likely exists good reason for treatments that are not chosen. And even if not, there is no way to tell, given the current data, the potential consequences of a nevertried treatment.
The safe operator , uses an indicator function () to only allow stateaction pairs where the behavior action probability is greater than some threshold . For a given state, if multiple actions are allowed but some are not, the action probabilities are normalized over only the allowable actions.
(5) 
Distinct (), Likely () Collections:
The safety operation simply ensures that we do not take actions that are completely nonevaluable. However, it does not ensure that the policies will be of high quality. One option is to directly optimize policies with respect to the CWPDIS estimator in equation 2. However, [32] note that gradientbased optimization of importance sampling estimates is difficult with complex policies and long rollouts, and we experienced difficulty attempting to optimize this directly.
Thus, we will instead follow a different strategy: our goal will be to identify a collection of likely, distinct strategies. This objective is based on the intuition that the current clinician behaviors are generally reasonable. Our goal is to essentially disentangle the distinct treatment strategies that clinicians are currently using in practice and then each one can be evaluated and filtered using a value estimate from equation 2.
We shall measure how likely a proposed policy is given current clinician behavior at a particular state as the difference where is some loss function. We will consider the average difference over all policies in the collection ) and over all states in the batch () as the overall similarity (or quality) of the collection of proposed policies and clinician behavior:
(6) 
Of course, the optimal solution to equation 6 is to make all policies in the collection identical to the clinician policy. To separate out the strategies that clinicians may be using, we add a diversity term, weighted by hyperparameter , that will encourage us to discover a distinct collection of policies. We define the diversity between two policies as an average of the symmetric KL between their action probabilities, over all observed states in the batch :
(7) 
For a collection of policies , we define the diversity measure as the average of the pairwise diversity measure for pairs that are distinct:
(8) 
Together, equations 6 and 8 represent the tension between finding policies that are likely—have high support in the observed data—and yet distinct. Identifying this collection, we provide a space of potential policies that may be useful in any situation, and the opportunity for clinicians to optimize over the range of action they are already performing.
Experimental Setup
In this section, we provide details for the setup of our experiments on the particular task of hypotension management in the ICU. We try out two different variants for the loss function defined in equation 6. The first is the standard crossentropy (CE) loss function, that will encourage a policy’s action probabilities at each state in the batch to be close to the action that was actually taken. The second is the symmetric KL distance (symKL; also used for the diversity term), where here the distance is between the action probabilities for the behavior policy and the policy to be learned.
In practice, we try a range of values (1,0.4,0.1,0.01,0.001), and try several values for in equation 5 (.01,.03,.05; corresponding to only considering actions seen in at least 1, 3, and 5 of the 100 nearest neighbors of a given state, respectively). To actually learn a policy
that maps states to action probabilities, we use a simple threelayer feedforward neural network (multilayer perceptron), with 128 units per layer. Thus, the parameters to learn are three sets of weight matrices and bias vectors for each policy
. In our experiments we jointly learn 4 policies at once. We train our methods using the Adam optimizer with a learning rate of and a batch size of 100 trajectories at a time, and use a modest multiplier on an regularization term on all policy parameters.Evaluation Metrics
While our optimization metric aimed to identify distinct, likely treatment policies from the data, our original objective was to identify distinct, effective policies that can serve as options for clinicians. We evaluate the effectiveness of a policy via the CWPDIS estimator in equation 2, with . We also provide the effective sample size of a policy using equation 3. Together, these metrics provide an estimate of the effectiveness of a policy; CWPDIS value is an estimate of the policy’s value, while the ESS is a measure of confidence in that estimate. We also present the CE and symKL loss functions that are optimized in the quality term, as additional metrics to measure how likely a given policy is with respect to the behavior policy and behavior actions taken. We measure the distinctness of the collection using the average symmetric KL between each pair of policies, i.e. equation 8. Lastly, to measure safety, we count the number of times a policy places a nonnegligible probability (i.e. above ) on an action disallowed by the safety term in equation 5.
Baselines
We consider ablations of our approach to determine which aspects are most important to identifying a collection of effective policies. In particular, we explore variants where we turn off various combinations of the diversity and quality terms and safety constraint. We ran experiments with all three (the full method) using both the CE and symKL losses to measure quality, and also ran versions with: only a diversity term with safety constraint, and no quality term; a diversity and quality term but no safety constraint; a quality term and safety constraint, but no diversity term; and diversity term and quality terms alone, with no safety constraints.
Results
Setting  Quantitative Metrics  







ESS 



Diverse and Safe  High  Yes  CE  3  
High  Yes  SymKL  3  
Low  Yes  CE  0              
Low  Yes  SymKL  4  
High  Yes  None  4  
Diverse, not Safe  High  No  CE  0              
High  No  SymKL  2  
Low  No  CE  0              
Low  No  SymKL  0              
High  No  None  0              
Safe, not Diverse  None  Yes  CE  4  
None  Yes  SymKL  4  
Not Safe or Diverse  None  No  CE  0              
None  No  SymKL  0             
Quantitative results (means and standard deviations) for each collection of learned policies. For comparison, note that there were
trajectories in the test set, so this is the highest achievable ESS. Furthermore, the empirical average of the returns in the test set was , so this is a reasonable estimate of the value of the behavior policy. We only show results for agents who learned a policy that had an ESS of at least 50. We show results for .Table 2 presents our quantitative results. As a means of constraining our results to only include policies where we can reliably estimate their value, we prune out learned policies that have an individual ESS of less than 50 on the test set of trajectories, regardless of their value estimate. In general, most policies that we learn have value estimates that are quite close to the average returns on the test set of , which is an unbiased and reliable estimate of the value of the clinician behavior policy.
A major takeaway is that without the safety constraint, the optimization is very likely to end up learning a policy with an unacceptably low ESS. However, even if the ESS is reasonable, there will be a large number of transitions where the agent is recommending unknown, never before seen actions for patients similar to the current state. Without the diversity but with the safety constraint, it is possible to achieve better CE and SymKL loss values that push you closer to the behavior, but at the cost of very low to no diversity. Without a quality term of some sort, the combination of diversity and safety learns a very diverse set of policies that still has good value and ESS, but is substantially further away from the behavior. It often confidently recommends actions that were unlikely, but still possible, under the behavior. Lastly, using only a quality term also typically fails to learn a policy with a reasonable ESS. In contrast, the full method SODARL using all three terms is a tradeoff in the middle, still learning a fairly diverse set of policies, but sticking much closer to the behavior.
Lastly, we present qualitative results from the policies presented in the second row of Table 2, i.e. high diversity (), a safety constraint of , and the symKL loss in the quality term. Figure 1 illustrates the local diversity learned by this collection of 3 policies, at a particular state. The blue bars in the figure show the estimated behavior policy action probabilities, while orange, green, and red show the SODARL probabilities. Agent 1 (correctly) places high confidence in the lowvasopressor, nofluid action (v1,f0), while agent 2 places high confidence on the mediumvasopressor, nofluid action (v2,f0) and agent 3 assigns moderate probability to several other actions.
Figure 2 presents a more global picture of the type of diversity that the policies learn. Agent 1 primarily places high probability on high doses of vasopressors with fluids, low doses of vasopressor with no fluids, and medium doses of fluids with no vasopressors. Agent 2 mostly focuses on lower doses of vasopressors, regardless of fluid amount. Lastly, agent 3 largely recommends various amounts of fluids across a range of low to moderate vasopressor doses.
For additional qualitative results similar to these two, see Figures 818 in the appendix. Figures 814 illustrate additional states with high local diversity at that state among agents, and Figures 1518 show the distribution of action probabilities across subsets of states where different types of actions were taken and where patients were in states with high physiological instability (i.e. low MAP and high lactate).
Discussion
In this paper we introduced SODARL, a reinforcement learning approach for identifying a collection of effective treatment policies from observational data. When applied to the task of hypotension management in the ICU, we found that it is crucial that all three components in Equation 4 are utilized so that the learned policies are diverse, safe, and not that far from current clinical practice. Additionally, our qualitative results on a learned collection of policies suggests that they are each picking up on diverse sets of practices in the treatment of hypotension.
However, one of the major assumptions that we make is that the current set of features that comprise our definition of state are actually sufficient for a clinician to act on (i.e., that our defined state actually satisfies the Markov assumption). This is likely an unrealistic assumption, but future work could explore other ways of learning statestatistics, and our methods can be seamlessly combined with any state representation.
Another interesting line of future work would be to explore how and why different types of vasopressors are given, especially settings where more than one are given (e.g. vasopressin, which is often combined with another drug like norepinephrine). Finally, blood pressure targets themselves are an area of active research [33]. We focused on achieving certain targets in our rewards as that ensures that the actions were closely linked to the outcomes. More general forms of patient outcomes—e.g. mortality—may be more interesting, but have their own challenges, as these outcomes depend on many factors outside of how a patient’s hypotension is managed.
Overall, we believe SODARL represents an important and underexplored direction in reinforcement learning for healthcare: it is often statistically impossible to identify optimal treatment strategies from observational data. However it is possible to identify a collection of plausible alternatives, drawn from current practice variation. This collection can provide a starting point for clinical experts to perform a targeted review—starting with chart review, perhaps ending in a trial about different treatment options; once vetted, it could be used to help patients and providers think about options in the context of the patient’s specific presentation and the provider’s experience and expertise. Our proposed SODARL algorithm ensures that those alternatives are distinct and have sufficient support in the data, enabling what we believe will be a more practical and impactful way for clinicians to draw treatment policy insights from observational sources.
Acknowledgements
FDV and JF acknowledge support from NSF Project 1750358. MAM and FDV acknowledge support from AFOSR FA 95501710155. JF additionally acknowledges Oracle Labs, a Harvard CRCS fellowship, and a Harvard Embedded EthiCS fellowship.
References
 [1] Emad H Ibrahim, Glenda Sherman, Suzanne Ward, Victoria J Fraser, and Marin H Kollef. The influence of inadequate antimicrobial treatment of bloodstream infections on patient outcomes in the icu setting. Chest, 118(1):146–155, 2000.
 [2] AN Berbece and RMA Richardson. Sustained lowefficiency dialysis in the icu: cost, anticoagulation, and solute removal. Kidney international, 70(5):963–968, 2006.
 [3] Andrés Esteban, Antonio Anzueto, Fernando Frutos, Inmaculada Alía, Laurent Brochard, Thomas E Stewart, Salvador Benito, Scott K Epstein, Carlos Apezteguía, Peter Nightingale, et al. Characteristics and outcomes in adult patients receiving mechanical ventilation: a 28day international study. Jama, 287(3):345–355, 2002.
 [4] Neil J Glassford, Glenn M Eastwood, and Rinaldo Bellomo. Physiological changes after fluid bolus therapy in sepsis: a systematic review of contemporary data. Critical care, 18(6):696, 2014.
 [5] Christof Havel, Jasmin Arrich, Heidrun Losert, Gunnar Gamper, Marcus Müllner, and Harald Herkner. Vasopressors for hypotensive shock. Cochrane Database of Systematic Reviews, (5), 2011.
 [6] Kamal Maheshwari, Brian H Nathanson, Sibyl H Munson, Victor Khangulov, Mitali Stevens, Hussain Badani, Ashish K Khanna, and Daniel I Sessler. The relationship between icu hypotension and inhospital mortality and morbidity in septic patients. Intensive care medicine, 44(6):857–867, 2018.
 [7] Alan E Jones, Vasilios Yiannibas, Charles Johnson, and Jeffrey A Kline. Emergency department hypotension predicts sudden unexpected inhospital mortality: a prospective cohort study. Chest, 130(4):941–946, 2006.
 [8] Uma M Girkar, Ryo Uchimido, Liwei H Lehman, Peter Szolovits, Leo Celi, and WeiHung Weng. Predicting blood pressure response to fluid bolus therapy using attentionbased neural networks for clinical interpretability. arXiv preprint arXiv:1812.00699, 2018.
 [9] Richard S Sutton. Introduction to reinforcement learning, volume 2. 1998.

[10]
Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo
Faisal.
The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care.
Nature Medicine, 24(11):1716, 2018.  [11] Susan M Shortreed, Eric Laber, Daniel J Lizotte, T Scott Stroup, Joelle Pineau, and Susan A Murphy. Informing sequential clinical decisionmaking through reinforcement learning: an empirical study. Machine learning, 84(12):109–136, 2011.
 [12] Niranjani Prasad, LiFang Cheng, Corey Chivers, Michael Draugelis, and Barbara E Engelhardt. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. arXiv preprint arXiv:1704.06300, 2017.
 [13] Shamim Nemati, Mohammad M Ghassemi, and Gari D Clifford. Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 2978–2981. IEEE, 2016.
 [14] Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale DoshiVelez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature medicine, 25(1):16–18, 2019.
 [15] Philip S Thomas. Safe reinforcement learning. PhD thesis, University of Massachusetts Libraries, 2015.
 [16] Jun S Liu. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and computing, 6(2):113–119, 1996.
 [17] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054–1062, 2016.
 [18] Philip S Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. Highconfidence offpolicy evaluation. In TwentyNinth AAAI Conference on Artificial Intelligence, 2015.
 [19] Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High confidence policy improvement. In International Conference on Machine Learning, pages 2380–2388, 2015.
 [20] Yang Liu, Prajit Ramachandran, Qiang Liu, and Jian Peng. Stein variational policy gradient. arXiv preprint arXiv:1704.02399, 2017.
 [21] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energybased policies. arXiv preprint arXiv:1702.08165, 2017.
 [22] Matthew Smith, Herke Hoof, and Joelle Pineau. An inferencebased policy gradient method for learning options. In International Conference on Machine Learning, pages 4710–4719, 2018.
 [23] Shirin Sohrabi, Anton V Riabov, Octavian Udrea, and Oktie Hassanzadeh. Finding diverse highquality plans for hypothesis generation. In ECAI, pages 1581–1582, 2016.
 [24] Muhammad Masood and Finale DoshiVelez. Diversityinducing policy gradient: Using maximum mean discrepancy to find a set of diverse policies. In IJCAI, 2019.
 [25] Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Continuous statespace models for optimal sepsis treatmenta deep reinforcement learning approach. arXiv preprint arXiv:1705.08422, 2017.
 [26] Marzyeh Ghassemi, Mike Wu, Michael C Hughes, Peter Szolovits, and Finale DoshiVelez. Predicting intervention onset in the icu with switching state space models. AMIA Summits on Translational Science Proceedings, 2017:82, 2017.
 [27] Feras Hatib, Zhongping Jian, Sai Buddi, Christine Lee, Jos Settels, Karen Sibert, Joseph Rinehart, and Maxime Cannesson. Machinelearning algorithm to predict hypotension based on highfidelity arterial pressure waveform analysis. Anesthesiology: The Journal of the American Society of Anesthesiologists, 129(4):663–674, 2018.
 [28] Shameek Ghosh, Mengling Feng, Hung Nguyen, and Jinyan Li. Risk prediction for acute hypotensive patients by using gap constrained sequential contrast patterns. In AMIA Annual Symposium Proceedings, volume 2014, page 1748. American Medical Informatics Association, 2014.
 [29] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Liwei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimiciii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
 [30] Denis Agniel, Isaac S Kohane, and Griffin M Weber. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. bmj, 361:k1479, 2018.
 [31] Aniruddh Raghu, Omer Gottesman, Yao Liu, Matthieu Komorowski, Aldo Faisal, Finale DoshiVelez, and Emma Brunskill. Behaviour policy estimation in offpolicy policy evaluation: Calibration matters. arXiv preprint arXiv:1807.01066, 2018.
 [32] Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pages 1–9, 2013.
 [33] Pierre Asfar, Ferhat Meziani, JeanFrançois Hamel, Fabien Grelon, Bruno Megarbane, Nadia Anguel, JeanPaul Mira, PierreFrançois Dequin, Soizic Gergaud, Nicolas Weiss, et al. High versus low bloodpressure target in patients with septic shock. New England Journal of Medicine, 370(17):1583–1593, 2014.
Appendix
Our state space contains 89 clinical and demographic features, which we now briefly describe. There are 7 baseline variables (demographics and other characteristics available on ICU admission) in Table 1. We also include in the state formulation a continuous variable denoting how far into the first 72 hours of ICU stay a current time point is. The remaining 81 clinical variables are in Table 3, which summarizes the measured value of time series variables as well as the indicator variables. Lastly, there are indicator variables for the most recent type of treatment administered, and for the total amount of each treatment administered thus far and in the last 8 hours.
Clinical Variable 



Bicarbonate  24.3 (21.0, 24.0, 27.0)  
Bicarbonate, indicator if measured last hour  14.2%  
BUN  27.9 (14.0, 21.0, 34.0)  
BUN, indicator if measured last hour  8.5%  
Creatinine  1.5 (0.7, 1.0, 1.6)  
Creatinine, indicator if measured last hour  8.5%  
GFR  69.9 (36.5, 63.7, 94.8)  
FiO2  54.9 (40.0, 50.0, 60.0)  
FiO2, indicator if measured last hour  13.4%  
FiO2, indicator if ever measured  61.6%  
Glucose  139.6 (106.0, 127.0, 156.0)  
Glucose, indicator if measured last hour  28.6%  
Hct  30.6 (26.9, 30.0, 33.8)  
Hct, indicator if measured last hour  10.8%  
HR  84.6 (72.0, 83.0, 96.0)  
HR, indicator if measured last hour  94.4%  
Lactate  2.6 (1.3, 1.9, 3.0)  
Lactate, indicator if measured last hour  5.2%  
Lactate, indicator if measured in last 8 hours  28.3%  
Lactate, indicator if ever measured  78.1%  
Magnesium  2.1 (1.8, 2.0, 2.3)  
Magnesium, indicator if measured last hour  6.9%  
Platelets  200.0 (125.0, 182.0, 250.0)  
Platelets, indicator if measured last hour  8.1%  
Potassium  4.2 (3.8, 4.1, 4.5)  
Potassium, indicator if measured last hour  11.6%  
Sodium  138.1 (135.0, 138.0, 141.0)  
Sodium, indicator if measured last hour  9.8%  
SPO2  96.8 (95.0, 97.0, 99.0)  
SPO2, indicator if measured last hour  92.2%  
Spontaneous RR  19.4 (16.0, 19.0, 23.0)  
Spontaneous RR, indicator if measured last hour  93.9%  
Temp  36.9 (36.4, 36.8, 37.4)  
Temp, indicator if measured last hour  28.9%  
Urine Output in last hour  115.0 (40.0, 75.0, 140.0)  
Urine Output, indicator if measured last hour  63.5%  
WBC  11.9 (7.6, 10.5, 14.4)  
WBC, indicator if measured last hour  7.9%  
ALT  212.3 (18.0, 32.0, 79.0)  
ALT, indicator if measured last hour  2.6%  
ALT, indicator if ever measured  66.5%  
AST  285.7 (25.0, 44.0, 112.0)  
AST, indicator if measured last hour  2.6%  
AST, indicator if ever measured  66.5%  
Bilirubin Total  1.375 (0.5, 0.9, 0.9)  
Bilirubin Total, indicator if measured last hour  3.6%  
Bilirubin Total, indicator if ever measured  66.1%  
CO2  24.5 (22.0, 24.0, 27.0)  
CO2, indicator if measured last hour  8.3%  
DBP  57.2 (49.0, 56.0, 64.0)  
DBP, indicator if measured last hour  91.5%  
Hgb  10.4 (9.1, 10.2, 11.6)  
Hgb, indicator if measured last hour  13.1%  
MAP  72.8 (64.0, 71.0, 80.0)  
MAP, indicator if measured last hour  91.9%  
PCO2  41.4 (35.0, 40.0, 46.0)  
PCO2, indicator if measured last hour  8.3%  
PCO2, indicator if measured in last 8 hours  35.3%  
PCO2, indicator if ever measured  70.2%  
PO2  149.9 (86.0, 117.0, 176.0)  
PO2, indicator if measured last hour  8.3%  
PO2, indicator if measured in last 8 hours  35.3%  
PO2, indicator if ever measured  70.2%  
SBP  113.6 (100.0, 112.0, 126.0)  
SBP, indicator if measured last hour  91.6%  
Weight  83.6 (66.8, 80.0, 96.7)  
Weight, indicator if measured last hour  3.6%  
GCS  12.0 (10.0, 14.0, 15.0)  
GCS, indicator if measured last hour  28.3%  
Indicator for if vasopressor action 1 administered last hour  5.1%  
Indicator for if vasopressor action 2 administered last hour  5.8%  
Indicator for if vasopressor action 3 administered last hour  4.7%  
Indicator for if vasopressor action 4 administered last hour  2.4%  
Indicator for if fluid action 1 administered last hour  1.7%  
Indicator for if fluid action 2 administered last hour  2.1%  
Indicator for if fluid action 3 administered last hour  1.9%  
Total amount of vasopressor administered during ICU stay  129.7 (0.0, 0.0, 51.7)  
Total amount of fluids administered during ICU stay  1891.0 (0.0, 850.0, 2940.7)  
Total amount of vasopressor administered in last 8 hours  28.6 (0.0, 0.0, 0.0)  
Total amount of fluids administered in last 8 hours  342.1 (0.0, 0.0, 202.18)  
Reward value during last hour  0.978 (1.000, 1.000, 1.000) 
We now present histograms showing the distribution of actual values of treatments given, to show how we eventually discretized them to achieve our final action space of 20 possible actions.
We now show additional results figures exploring different specific states observed in the test set, where the three retained policies learned by SODARL exhibit high degrees of diversity.
Finally, we show additional histograms of action probabilities for the 3 learned policies, along with the behavior policy, for several different subsets of states. We show how the behavior and learned policies focus on different actions in states where a fluid action was subsequently taken (Figure 15), states where a vasopressor action was subsequently taken (Figure 16), and states where the patient is in a stage of especially high acuity, as measured by elevated lactate (Figure 17) and severely low MAP (Figure 18).
Comments
There are no comments yet.