POPCORN: Partially Observed Prediction COnstrained ReiNforcement Learning

01/13/2020 ∙ by Joseph Futoma, et al. ∙ 0

Many medical decision-making settings can be framed as partially observed Markov decision processes (POMDPs). However, popular two-stage approaches that first learn a POMDP model and then solve it often fail because the model that best fits the data may not be the best model for planning. We introduce a new optimization objective that (a) produces both high-performing policies and high-quality generative models, even when some observations are irrelevant for planning, and (b) does so in the kinds of batch, off-policy settings common in medicine. We demonstrate our approach on synthetic examples and a real-world hypotension management task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) has the potential to assist with health contexts that involve sequences of decisions, especially in settings where no evidence-based guidelines exist. For example, in this work we will focus on the task of managing patients with acute hypotension, a life-threatening emergency that occurs when a patient’s blood pressure drops to dangerously low. In these situations, it is not always clear which treatment might be most effective for a particular patient in a particular context, in what amount, when, and for how long (García et al., 2015). That said, applying RL to healthcare settings is challenging, as highlighted by Gottesman et al. (2019). Two key considerations for our chosen clinical setting will be the fact that decisions must be made under partial observability—the current observations of a patient doesn’t capture their history—and learning must occur in batch—we must learn from retrospective data alone.

We will further focus this work on pushing the limits of model-based approaches with discrete hidden state representations. In addition to being able to recommend actions (our primary goal), building generative models allows us to create forecasts of future observations (a form of validation), learn in the presence of missing data (frequent in clinical settings), and generally learn more sample-efficiently than model-free approaches (important as we are often data-limited in clinical settings). Building directly-inspectable models (via simple, discrete structures) further tends to increase sample efficiency over deep models, and also enables easy inspection for clinical sensibility.

Specifically, we introduce POPCORN, or Partially Observed Prediction COnstrained ReiNforcement learning, a new optimization objective for the well-known partially observable Markov decision process (POMDP) (Kaelbling et al., 1998)

. POMDPs have been traditionally trained in a two-stage process, where the first stage is generally learned by maximizing the likelihood of observations and is not tied to the decision-making task. However, the two-stage approach can fail to find good policies when the model is (inevitably) misspecified; in particular, the maximum likelihood model may spend effort on modeling irrelevant signal rather than signal that matters for the task at hand. We demonstrate this effect, and also demonstrate how POPCORN, which constrains maximum likelihood training of the POMDP generative model so that the value of that model’s induced policy achieves satisfactory performance, does not suffer from these issues. Our approach is compatible with both on-policy and off-policy value estimation, making use of differentiable importance sampling estimators 

(Thomas, 2015).

2 Related Work

Reinforcement learning in healthcare.

Healthcare applications of reinforcement learning have proliferated in recent years, in diverse clinical areas such as management of schizophrenia (Shortreed et al., 2011), sepsis (Komorowski et al., 2018; Raghu et al., 2017), mechanical ventilation (Prasad et al., 2017), HIV (Ernst et al., 2006), and dialysis (Martín-Guerrero et al., 2009). However, most health applications of RL focus on model-free approaches, largely because learning accurate models from noisy biological data is challenging. In addition, all of these works assume full-observability, which is typically not accurate in health settings.

A few prior works have explicitly modeled partial observability. POMDPs have been applied to heart disease management (Hauskrecht and Fraser, 2000), sepsis treatment in off-policy or simulated settings (Li et al., 2018; Oberst and Sontag, 2019; Peng et al., 2018), and HIV management (Parbhoo et al., 2017). All of these approaches take a two-stage approach to learning. In contrast, we are decision-aware through the whole optimization process.

End-to-end learning for RL

“End-to-end” optimization methods that directly incorporate a downstream decision-making task during model training are an area of growing popularity, be it for graphical models (Lacoste–Julien et al., 2011) or submodular optimization (Wilder et al., 2019). Within RL, decision-aware optimization efforts have explored partially-observed problems in both model-free (Karkus et al., 2017) and model-based settings (Igl et al., 2018).

These efforts differ from ours in two key respects. First, they exclusively focus on on-policy settings for environments such as Atari where data are easily collected. Second, these methods all rely heavily on black-box neural networks for feature extraction, which are neither sample-efficient nor easily interpreted. In many cases (e.g. 

Karkus et al. (2017)), the model is treated as an abstraction and there is no way to set the importance of the model’s ability to accurately generate trajectories. Perhaps closest in spirit to our approach is recent work on value-aware model learning in RL (Farahmand, 2018), although their contribution is mainly theoretical.

3 Background

Here we describe the standard two-stage approach for model-based RL, where the first stage learns the model and the second stage solves for a policy given the model.

POMDP Model.

We consider a POMDP with discrete latent states and discrete actions. We assume we have continuous -dimensional observations and deterministic rewards. The overall model parameters are . The transition parameter

describes the probability of moving to the next (unobserved) state

, given the current state and action . Emission parameters and

provide the mean and variance for observation

, when in state and given prior action (assumed in this work to be independent Gaussians across the dimensions).

Concretely, we assume the following generative model for states and observations, indexed by :

(1)

We let index each trajectory in a dataset of sequences, and let denote the length of trajectory . Completing the POMDP specification are the deterministic reward function specifying the reward from taking action in state , and the discount factor .

Given a POMDP, we can compute the posterior beliefvector over the current state given all actions and observations until time , i.e. . The belief is a sufficient statistic for the entire previous history, and can be computed efficiently via forward filtering algorithms (Rabiner, 1989). We can also solve the POMDP using a planning algorithm to learn a policy , yielding a function that returns an action (or distribution over actions for stochastic policies) for any queried belief vector. The ultimate goal is to learn a policy that will maximize the sum of discounted future rewards .

Learning Parameters: Input-Output HMM.

The model in Eq. (1

) is an input-output hidden Markov model (IO-HMM) 

(Bengio and Frasconi, 1995), where the actions are the inputs and the observations are the outputs. Thus, the model parameters that maximize the marginal likelihood of observed histories can be efficiently computed using the EM algorithm and standard dynamic programming subroutines for HMMs (Rabiner, 1989; Chrisman, 1992). If one is using Bayesian approaches, efficient algorithms for sampling from the posterior over POMDP models also exist (Doshi-Velez, 2012). Deterministic rewards can be estimated in a later stage that minimizes squared error.

Solving for the Policy.

The value function of a discrete-state POMDP can be modeled arbitrarily closely as the upper envelope of a finite set of linear functions of the belief (Sondik, 1978). However, exact solutions are intractable for even very small POMDPs. In this work, we start with point based value iteration (PBVI) (Pineau et al., 2003), which performs approximate Bellman backups at a fixed set of belief points. For the moderate state space sizes required for our applications (), PBVI is an efficient solver. We require two key innovations beyond standard PBVI. First, we adapt ideas from Hoey and Poupart (2005) to handle continuous observations. Second, we relax the algorithm so each step in the solver is differentiable. Additional details can be found in the supplement.

Off-Policy Value Estimation.

Let be the optimal policy given model parameters ; that is, the policy that maximizes expected future rewards, obtained by solving the model. However, the fact that is optimal for a learned model does not mean it is optimal in reality. In the batch setting, we lack the ability to interact with the environment and collect trajectories by following a specific policy. Instead, we must turn to off policy evaluation (OPE) to estimate a policy’s value. Let be the behavior policy, that is, the policy under which the observed data were collected (e.g. clinician behavior).111In our experiments with simulated environments, we assume is given. In the real data setting, we estimate the behavior policy following the k-nearest neighbors approach of Raghu et al. (2018).

Let be a set of trajectories collected under the behavior policy . Then, the consistent weighted per-decision importance sampling (CWPDIS, (Thomas, 2015)) estimate of the value of our proposed policy is given by:

(2)

In general, importance sampling (IS) estimators have lower bias than other approaches but suffer from high variance. Another class of OPE methods learn a separate model to simulate trajectories in order to estimate policy values (e.g. Chow et al. (2015)), but typically suffer from high bias in real-world, noisy settings.

4 Prediction-Constrained POMDPs

We now introduce POPCORN, our proposed prediction-constrained optimization framework for learning POMDPs. We seek to learn parameters that assign high likelihood to the observed data while also yielding a policy that has high value. As noted in Sec. 2, previous approaches for learning POMDPs generally fall into two categories. Two-stage methods (e.g. Chrisman (1992)) that first learn a model and then solve it often fail to find good policies under severe model misspecification. End-to-end methods (e.g.  Karkus et al. (2017)) that focus only on the “discriminative” task of policy learning fail to produce accurate generative models of the environment. They also generally lack the ability to handle missing observations, which is especially problematic in medical contexts where there is often missing data.

Our approach offers a balance between these purely maximum likelihood-driven (generative) and purely reward-driven (discriminative) extremes. We retain the strengths of the generative approach—the ability to plan under missing observations, simulate accurate dynamics, and inspect model parameters to inspire scientific hypotheses—while benefiting from model parameters that are directly informed by the decision task, as in end-to-end frameworks.

4.1 POPCORN Objective

Our proposed framework seeks POMDP model parameters that maximize the log marginal likelihood of the observed dataset , while enforcing that the solved policy’s (estimated) value is high enough to be useful. Formally, we seek a POMDP parameter vector that maximizes the constrained optimization problem:

(3)

with and functions defined below. The scalar threshold defines a minimum value the policy must achieve (to be determined by a task expert).

If we had a perfect optimizer for Eq. (3), we would prefer this constrained formulation as it best expresses our model-fitting goals: we want as good a generative model as possible, but we will not accept poor decision-making. The constraint in Eq. (3) is similar to the prediction-constrained objective used by Hughes et al. (2018) for optimizing supervised topic models; here we apply it in a batch, off-policy RL context.

In practice, solving constrained problems is challenging, so we transform to an equivalent unconstrained objective using a Lagrange multiplier :

(4)

Here, setting recovers classic two-stage training’s purely generative focus on , while the limit approximates end-to-end approaches that focus exclusively on policy value. In our experiments, we compare against both of these baseline approaches, referring to the case as “2 stage EM”, and the as “Value term only” (in practice, we do this by only optimizing the term).

Computing the Generative Term.

We define to denote the log marginal likelihood of observations, given the actions in and the parameters :

This IO-HMM likelihood marginalizes over uncertainty about the hidden states, can be computed efficiently via dynamic programming routines (Rabiner, 1989), and is also amenable to automatic differentiation.

Computing the Value Term.

Computation of the value term entails two distinct parts: solving for the optimal policy given the POMDP model , and estimating the value of this policy using OPE and our observed dataset . We require both to be differentiable to permit gradient-based optimization.

To solve for the policy, we apply a relaxation of PBVI, replacing the argmax operations in the PBVI backup updates with softmax operations. Likewise, we use a stochastic policy instead of a deterministic one, replacing the usual argmax over -vectors with a softmax to select an action probabilistically. However, we emphasize that our framework is general and other solvers could be used as long as they are differentiable.

To compute the value of the policy, we use the CWPDIS estimator in Eq. (2). The CWPDIS estimator is a differentiable function of the model parameters , and thus our unconstrained objective in Eq. (4) can be optimized via first-order gradient ascent.

4.2 Optimizing the Objective

We optimize the objective via gradient ascent using gradients computed from the full dataset (we do not use stochastic minibatches to avoid the extra variance, but in principle they may be used). We use the Rprop optimizer (Igel and Hüsken, 2003) with default settings. Our objective is challenging due to non-convexity, as even the generative term alone admits many local optima. To improve convergence, we utilize the best of many (in practice, 10) random restarts for POPCORN as well as the 2-stage EM and value term only baselines.

Stabilizing the Off-Policy Estimate.

Although the CWPDIS estimator was reliable in our simulated environments, on our real dataset it had unusably high variance, as is common with IS estimators. We address this in two ways.

First, we add an extra term in the objective favoring larger effective sample size (ESS), following Metelli et al. (2018). Our final objective includes an ESS penalty with weight :

(5)

The ESS can be computed given the IS weights using the standard formula (Kong, 1992). Since CWPDIS estimates the average discounted reward at each time step , we sum together all stepwise ESS values to yield a single ESS term.

Second, we mask a policy’s action probabilities so only actions with at least probability under the behavior policy are allowed. This forces a policy to be relatively close to the behavior policy. As a result, it both improves the reliability of the OPE and provides a notion of “safety”, as only previously seen actions are allowed. This prevents a policy from recommending dangerous or unknown actions.

Hyperparameter Selection.

The key hyperparameter for our approach is the scalar

. We generally try a range of 3-5 ’s per environment, spaced evenly on a log scale. In practice, rescaling the term by the number of observed scalars in the dataset, , keeps magnitudes similar across different datasets and keeps the range of values to explore more consistent. We also must select the weight on the ESS penalty , which we find is only necessary in our real data experiments. We also experiment with 3-5 values evenly spaced on a log scale, and combine the ESS penalty with the overall hyperparameter (i.e. the overall weight is ). Lastly, we set modest Dirichlet priors on the transitions and vague normal priors on the emission means , and add these as an additional term in the overall loss.

5 Results on Simulated Domains

We first evaluate POPCORN on several simulated environments to validate its utility across a range of model misspecifications, as well as on a harder sepsis simulator task. For the synthetic experiments in this section, everything is conducted in the batch, off-policy setting. The simulator is only used to produce the initial data set, and to evaluate the final policy after training concludes. We separate each experiment into a description of procedure and highlights of the results.

Our goal is to develop a method to learn simple—and therefore interpretable—models that perform robustly in misspecified settings. As such, we compare against an approach that does not attempt to model the dynamics (value-only), an approach that first learns the dynamics and then plans (2-stage), and a known optimal solution (when available). In all cases, we are interested in how these methods trade off between explaining the data well (log marginal likelihood of data) and making good decisions (policy value).

5.1 Synthetic Domains with Misspecified Models: Description

Figure 1: Solutions from all competitor methods in the 2D fitness landscape (policy value on y-axis; log marginal likelihood on x-axis). An ideal method would score in the top right corner of each plot. Left: Results from Tiger with Irrelevant Noise Dimensions. Middle: Results from Tiger with Missing Data. Right: Results from Tiger with Wrong Likelihood.

We demonstrate how POPCORN overcomes various kinds of model misspecification in the context of the classic POMDP tiger problem (Kaelbling et al., 1998). The classic tiger problem consists of a room with 2 doors: one door is safe, and the other door has a tiger behind it. The agent has 3 possible actions: either open one of the doors, thereby ending the episode, or listen for noisy evidence of which door is safe to open. Revealing a tiger gives reward, the safe door yields reward, and listening incurs reward. The goal is to maximize rewards over many repeated trials, with the “safe” door’s location randomly chosen each time.

In all cases, we set to encourage the agent to act quickly. We collect training data from a random policy that first listens for 5 time steps, and then randomly either opens a door or listens again. We train in the batch setting given a single collection of 1000 trajectories of length 5-15. After optimization completes, we evaluate each policy via an additional 1000 Monte Carlo rollouts, as we are able to simulate trajectories under our learned policy in these simulated settings.

Tiger with Irrelevant Noise: Finding dimensions that signal reward. We assume that whenever the agent listens for the tiger, it receives an observation with dimensions. The first dimension provides a noisy signal as to the location of the safe door. We set this “signal” dimension , where the mean is the safe door’s index . The second dimension is irrelevant to the safe door’s location, and we set , with in each trial. Thus, total states would be needed to explain perfectly both the relevant and irrelevant signals for all possible values of .

We measure performance allowing only states to assess how each method spends its limited capacity across the generative and reward-seeking goals. We expect the 2-stage baseline will identify the irrelevant states indexed by

, as they have lower standard deviation (0.1 vs. 0.3 for the signal dimension) and thus are more important to maximize likelihood. In contrast, we expect POPCORN will focus on the relevant signal dimension and recover high-value policies.

Tiger with Missing Data: Finding relevant dimensions when some data is missing. We continue with the previous setting, in which the listen action produces dimensional observations , where the first signals the safe door’s location and the second is irrelevant. However, this time the dimension with the relevant signal is often missing. Specifically, and , but we select 80% of signal observations to be missing uniformly at random. This simulates clinical settings where some measurements are rare but useful (e.g. lab tests), while others are common but not directly useful (e.g. temperature for a blood pressure task).

The expected outcome with states is that a 2-stage approach driven by maximizing likelihood would prefer to model the always-present irrelevant dimension. In contrast, POPCORN should learn to favor the signal dimension even though it is rarely available and contributes less overall to the likelihood. This ability to gracefully handle missing dimensions is a natural property of generative models and would not be easily done with a model-free approach.

Tiger with Wrong Likelihood: Overcoming a misspecified density model. Finally, we consider a situation in which our generative model’s density family cannot match the true observation distribution. This time, the listen action produces a dimensional observation . The true distribution of this observation signal is a truncation of a mixture of two Gaussians, denoted . If the first door is safe, listening results in strictly negative observations: . If the second door is safe, listening results in strictly positive observations: .

While the the true observation densities are not Gaussian, we will fit POMDP models with Gaussian likelihoods and states. We expect our POPCORN approach to still deliver high-value policies, even though the likelihood will be suboptimal.

5.2 Synthetic Domains with Misspecified Models: Conclusions

Across all 3 variants of the Tiger problem, we observe many common conclusions from Fig. 1. Together, these results demonstrate how POPCORN is robust to many different kinds of model misspecifcation.

POPCORN delivers higher policy value than 2-stage (EM then PBVI).

Across all 3 panels of Fig. 1, POPCORN (red) delivers higher value (y-axis) than the 2-stage baseline (purple).

Only optimizing for value hurts generative quality.

In 2 of 3 panels, the value-only baseline (green) has noticeably worse likelihood (x-axis) than POPCORN. The far right panels shows indistinguishable performance. Notably, optimizing this objective is significantly less stable than the full POPCORN objective. This aligns with Levine and Koltun (2013), who note gradient-based optimization of importance sampling estimates is notoriously challenging.

POPCORN solutions are consistent with manually-designed solutions.

In all 3 panels, POPCORN (red) is the closest method to the ideal manually-designed solution (gray), indicating our optimization procedures are effective.

5.3 Sepsis Simulator: Medically-motivated environment with known ground truth.

We now move from simple toy problems—each designed to demonstrate a particular robustness of our method—to a more challenging simulated domain. In real-world medical decision-making tasks, it is impossible to evaluate the value of a learned policy using data collected under that policy’s decisions. However, in a simulated setting, we can evaluate any given policy to assess its true value. We emphasize is still learned in the batch setting, but after optimization we use the simulator to allow for accurate evaluation of policy values.

Specifically, we use the simulator from Oberst and Sontag (2019), which is a coarse physiological model for sepsis with discrete-valued observations: 4 vitals, and an indicator for diabetes. The true simulator is governed by an underlying known Markov decision process (MDP), which has 1440 possible discrete states. To make the environment more challenging, we add Gaussian noise with standard deviation to each observation to make them continuous-valued. The environment has 8 actions (3 different binary actions), and is run for at most 20 time steps. Rewards are sparse, with 0 reward at intermediate time steps and or at the terminal time.

The true discrete-state MDP for this environment can be solved via exact value iteration. We then generate trajectories under an -greedy behavior policy, with so that each non-optimal action has a probability of being taken at each time. Given the observed trajectories, we learn POMDP models assuming possible states.

Results and Conclusions. The results in Figure 2 again show POPCORN delivers higher value than 2-stage EM. Additionally, while the value-term-only baseline learns a policy on par with POPCORN, the corresponding likelihood is substantially lower.

Figure 2: Quantitative results from the sepsis simulator. An ideal method would score in the top right corner. No methods recover a fully optimal policy (grey line), but POPCORN consistently outperforms 2-stage EM. A policy which takes actions uniformly at random has a value of , so the 2-stage policy barely outperforms this.

6 Application to Hypotension Management

To showcase the utility of our method on a real-world medical decision making task, we apply POPCORN to the challenging problem of managing acutely hypotensive patients in the ICU. Although hypotension is associated with high morbidity and mortality (Jones et al., 2006), management of these patients is difficult and treatment strategies are not standardized, in large part because there are many underlying potential causes of hypotension. Girkar et al. (2018) attempted to predict the efficacy of fluid therapy for hypotensive patients, and found considerable noise in being able to assess whether the treatment would be successful. We study the same trade-offs between generative and reward-seeking performance as in Sec. 5. We further perform an in-depth evaluation of the learned policy and our confidence in it (via effective sample sizes).

6.1 Description

Cohort. We use 10,142 ICU stays from MIMIC-III(Johnson et al., 2016), filtering to adult patients with at least 3 abnormally low mean arterial pressure (MAP) values in the first 72 hours of ICU admission. Our observations consist of 9 vitals and laboratory measurements: MAP, heart rate, urine output, lactate, Glasgow coma score, serum creatinine, FiO, total bilirubin, and platelets count. We discretized time into 1-hour windows, and setup the RL task to begin 1 hour after ICU admission, to ensure a sufficient amount of data exists before starting a policy. Trajectories end either at ICU discharge or at 72 hours into the ICU admission, so there are at most 71 actions taken. This problem formulation was made in consultation with a critical care physician, who advised most acute cases of hypotension would present early during an ICU admission. We expressly do notimpute missing observations, and only new observation measurements contribute to the overall likelihood.

Setup. Our action space consists of the two main treatments for acute hypotension: fluid bolus therapy and vasopressors, both of which are designed to quickly raise blood pressure and increase perfusion to the organs. We discretize fluids into 4 actions (none, low, medium, high), and discretize vasopressors into 5 actions (none, low, medium, high, very high) for a total of 20 discrete actions. To assign rewards to individual time steps, we use a simple piecewise-linear reward function created in consultation with a critical care physician. A MAP of 65mmHg is a common target level(Asfar et al., 2014), so if an action is taken and the next MAP is 65 or higher, the next reward is +1, the highest possible value. Otherwise, rewards decrease as MAP values drop, with MAP delivering a reward of 0, the smallest possible value. See the supplement for a plot of this reward function.

We split the dataset into 5 distinct test sets for cross-validation, and throughout present results on the test sets, with standard errors across folds where appropriate. We set

and set to be the threshold for the “safety” mask, i.e. any action which had estimated behavior probability below is forbidden. Lastly, we study several possible values for the Lagrange multiplier, . For each fold we use the best run among 10 random restarts, selecting the best in terms of training objective value.

Figure 3: Quantitative results from the hypotension data, showing the trade-offs between policy value and likelihood (an estimate of ESS is in legend).
Figure 4: Forecasting results. Top to bottom, left to right: MAP (scale zoomed out); MAP (value-only out of pane); lactate; urine output (scale zoomed out); urine output (value-only out of pane); heart rate.
Figure 5: Visualization of learned MAP distributions. Left: 2-stage EM. Middle: POPCORN, . Right: Value-only. Each subplot visualizes all learned distributions of MAP values for a given method across the actions and states. Each pane in a subplot corresponds to a different action, and shows distributions across the states. Vasopressor actions vary across rows, and IV fluid actions vary across columns.
Figure 6: Action probabilities for the behavior policy, a value-only policy, a POPCORN policy with , and a 2-stage EM policy. Actions are split from the full 20-dimensional space by type. Left: Action probabilities for the 4 doses of IV fluids, and Right: for the 5 doses of vasopressors.

6.2 Conclusions

POPCORN achieves the best balance of high-performing policies and high likelihood models. As in previous environments, Figure 3 demonstrates how POPCORN trades off between generative and decision-making performance, with darker red indicating higher ’s and thus improved policy values. The policy values for the 2-stage baseline and the likelihood values for the value-only baseline both substantially underperform POPCORN.

POPCORN enables reasonably accurate forecasting. To demonstrate the ability of models to predict future observations, Figure 4 shows results from a forecasting experiment. Each method is given the first 12 hours of a trajectory, and then must predict future observations up to 12 hours in the future. Importantly, only measured observations are used to calculate the mean absolute error between model predictions and the true values. Unsurprisingly the 2-stage baseline generally perfoms the best, although POPCORN for small values of often performs similarly. On the other hand, the value-only baseline fares significantly worse. For some observations (MAP and urine output; see left-most column of Figure 4), it makes nonsensical predictions far outside the range of observed data, with errors several orders of magnitudes worse than POPCORN and the 2-stage baseline.

POPCORN enables inspection if learned models are clinically sensible. We visualize the learned emission distributions for MAP across the states and actions for each method in Figure 5. Note that densities may appear non-Gaussian, as they are back-transformed to the scale of the data but were modeled on a log-scale. POPCORN’s distributions are more spread out and better differentiate between states compared to the 2-stage baseline, which learns very similar states with high overlap. As a result, the 2-stage policy will end up recommending similar actions for most patients. Value-only learns states that are even more diverse, allowing it to learn an effective policy but at the expense of not modeling the observed data well. Additional results for lactate, urine output, and heart rate can be found in the appendix. Although these results are highly exploratory, these simple visualizations of what the models have learned are only possible due to the white-box nature of our HMM-based approach, compared with deep reinforcement learning methods.

Lastly, Figure 6 visualizes the action probabilities for the behavior policy, a value-only policy, a POPCORN policy, and a 2-stage policy. In general, the POPCORN policy most closely aligns with the behavior, although it is also quite similar to value-only. On the other hand, the 2-stage policy seems in general much more conservative and tends to have lower probabilities on more aggressive actions. In future work we plan to work with clinical partners to explore individual patient trajectories, digging into where, how, and why these treatment policies differ.

7 Discussion

In this paper we propose POPCORN, an improved optimization objective in off-policy batch RL with partial observability. POPCORN can balance the trade-off between learning a model with high likelihood and a model that is well-suited for planning, even if evaluation must be done off-policy on historical data. Synthetic experiments demonstrate the primary advantages of our approach over alternatives: POPCORN achieves good policies and decent models even in the face of misspecification (in the number of states, the choice of the likelihood, or the availability of data). Performance on a real-world medical problem of hypotension management suggests we may be able to learn a policy on par or even slightly better than the observed clinician behavior policy. Future directions include scaling our methods to environments with more complex state structures or long-term temporal dependencies, investigating semi-supervised settings where not all sequences have observed rewards, and learning Pareto-optimal policies that balance multiple competing reward signals. We hope that methods such as ours may ultimately become decision support tools that are integrated into actual clinical practice to assist providers in improving the treatment of critically ill patients.

Acknowledgements

FDV and JF acknowledge support from NSF Project 1750358. JF additionally acknowledges Oracle Labs, a Harvard CRCS fellowship, and a Harvard Embedded EthiCS fellowship.

References

  • García et al. (2015) M. I. M. García, P. G. González, M. G. Romero, A. G. Cano, C. Oscier, A. Rhodes, R. M. Grounds, and M. Cecconi. Effects of fluid administration on arterial load in septic shock patients. Intensive care medicine, 41(7):1247–1255, 2015.
  • Gottesman et al. (2019) O. Gottesman, F. Johansson, M. Komorowski, A. Faisal, D. Sontag, F. Doshi-Velez, and L. A. Celi. Guidelines for reinforcement learning in healthcare. Nature Medicine, 25, 2019. URL https://doi.org/10.1038/s41591-018-0310-5.
  • Kaelbling et al. (1998) L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
  • Thomas (2015) P. S. Thomas. Safe Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, 2015. URL https://people.cs.umass.edu/~pthomas/papers/Thomas2015c.pdf.
  • Shortreed et al. (2011) S. M. Shortreed, E. Laber, D. J. Lizotte, T. S. Stroup, J. Pineau, and S. A. Murphy. Informing sequential clinical decision-making through reinforcement learning: An empirical study. Machine learning, 84(1-2):109–136, 2011. ISSN 0885-6125. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3143507/.
  • Komorowski et al. (2018) M. Komorowski, L. A. Celi, O. Badawi, A. C. Gordon, and A. A. Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11):1716, 2018.
  • Raghu et al. (2017) A. Raghu, M. Komorowski, L. A. Celi, P. Szolovits, and M. Ghassemi. Continuous State-Space Models for Optimal Sepsis Treatment: A Deep Reinforcement Learning Approach. In Machine Learning for Healthcare Conference, pages 147–163, 2017. URL http://proceedings.mlr.press/v68/raghu17a.html.
  • Prasad et al. (2017) N. Prasad, L. Cheng, C. Chivers, M. Draugelis, and B. E. Engelhardt. A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in Intensive Care Units. In Uncertainty in Artificial Intelligence, page 10, 2017. URL http://auai.org/uai2017/proceedings/papers/209.pdf.
  • Ernst et al. (2006) D. Ernst, G. Stan, J. Goncalves, and L. Wehenkel. Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control, pages 667–672. IEEE, 2006.
  • Martín-Guerrero et al. (2009) J. D. Martín-Guerrero, F. Gomez, E. Soria-Olivas, J. Schmidhuber, and N. V. Climente-Martí, M.and Jiménez-Torres. A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients. Expert Systems with Applications, 36(6):9737–9742, 2009. ISSN 0957-4174. URL http://www.sciencedirect.com/science/article/pii/S0957417409001699.
  • Hauskrecht and Fraser (2000) M. Hauskrecht and H. Fraser. Planning treatment of ischemic heart disease with partially observable Markov decision processes. Artificial Intelligence in Medicine, 18(3):221–244, 2000. ISSN 09333657. URL https://linkinghub.elsevier.com/retrieve/pii/S0933365799000421.
  • Li et al. (2018) L. Li, M. Komorowski, and A. A. Faisal. The Actor Search Tree Critic (ASTC) for Off-Policy POMDP Learning in Medical Decision Making. In arXiv:1805.11548 [Cs], 2018. URL http://arxiv.org/abs/1805.11548.
  • Oberst and Sontag (2019) M. Oberst and D. Sontag. Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models. In International Conference on Machine Learning, 2019. URL http://arxiv.org/abs/1905.05824.
  • Peng et al. (2018) X. Peng, Y. Ding, D. Wihl, O. Gottesman, M. Komorowski, L. H. Lehman, A. Ross, A. Faisal, and F. Doshi-Velez. Improving sepsis treatment strategies by combining deep and kernel-based reinforcement learning. In AMIA Annual Symposium Proceedings, volume 2018, page 887. American Medical Informatics Association, 2018.
  • Parbhoo et al. (2017) S. Parbhoo, J. Bogojeska, M. Zazzi, V. Roth, and F. Doshi-Velez. Combining kernel and model based learning for hiv therapy selection. AMIA Summits on Translational Science Proceedings, 2017:239, 2017.
  • Lacoste–Julien et al. (2011) S. Lacoste–Julien, F. Huszár, and Z. Ghahramani. Approximate inference for the loss-calibrated Bayesian. In Artificial Intelligence and Statistics, 2011.
  • Wilder et al. (2019) B. Wilder, B. Dilkina, and M. Tambe.

    Melding the Data-Decisions Pipeline: Decision-Focused Learning for Combinatorial Optimization.

    In AAAI Conference on Artificial Intelligence, page 8, 2019.
  • Karkus et al. (2017) P. Karkus, D. Hsu, and W. S. Lee.

    Qmdp-net: Deep learning for planning under partial observability.

    NIPS, 2017.
  • Igl et al. (2018) M. Igl, L. Zintgraf, T. A. Le, F. Wood, and S. Whiteson. Deep variational reinforcement learning for pomdps. ICML, 2018.
  • Farahmand (2018) A. Farahmand. Iterative value-aware model learning. In Advances in Neural Information Processing Systems, pages 9072–9083, 2018.
  • Rabiner (1989) L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of the IEEE, 77(2):257–286, 1989.
  • Bengio and Frasconi (1995) Y. Bengio and P. Frasconi. An input output HMM architecture. In Advances in neural information processing systems, 1995.
  • Chrisman (1992) L. Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In AAAI, 1992. URL https://www.aaai.org/Papers/AAAI/1992/AAAI92-029.pdf.
  • Doshi-Velez (2012) F. Doshi-Velez. Bayesian Nonparametric Approaches for Reinforcement Learning in Partially Observable Domains. PhD thesis, Massachusetts Institute of Technology, 2012.
  • Sondik (1978) E. J. Sondik. The optimal control of partially observable markov processes over the infinite horizon: discounted costs. Operations Research, 26(2), 1978.
  • Pineau et al. (2003) J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for pomdps. IJCAI, 2003.
  • Hoey and Poupart (2005) J. Hoey and P. Poupart. Solving pomdps with continuous or large discrete observation spaces. IJCAI, 2005.
  • Raghu et al. (2018) A. Raghu, O. Gottesman, Y. Liu, M. Komorowski, A. A. Faisal, F. Doshi-Velez, and E. Brunskill. Behaviour policy estimation in off-policy policy evaluation: Calibration matters. arXiv preprint arXiv:1807.01066, 2018.
  • Chow et al. (2015) Y. Chow, M. Petrik, and M. Ghavamzadeh. Robust policy optimization with baseline guarantees. arXiv preprint arXiv:1506.04514, 2015.
  • Hughes et al. (2018) M. C. Hughes, G. Hope, L. Weiner, T. H. Mccoy, R. H. Perlis, E. Sudderth, and F. Doshi-Velez. Semi-supervised prediction-constrained topic models. In AISTATS, 2018.
  • Igel and Hüsken (2003) C. Igel and M. Hüsken. Empirical evaluation of the improved rprop learning algorithms. Neurocomputing, 50:105–123, 2003.
  • Metelli et al. (2018) A. M. Metelli, M. Papini, F. Faccio, and M. Restelli. Policy optimization via importance sampling. In Advances in Neural Information Processing Systems, pages 5442–5454, 2018.
  • Kong (1992) A. Kong. A Note on Importance Sampling using Standardized Weights. Technical Report 348, University of Chicago Department of Statistics, 1992. URL https://galton.uchicago.edu/techreports/tr348.pdf.
  • Levine and Koltun (2013) S. Levine and V. Koltun. Guided policy search. In International Conference on Machine Learning, pages 1–9, 2013.
  • Jones et al. (2006) A. E. Jones, V. Yiannibas, C. Johnson, and J. A. Kline. Emergency department hypotension predicts sudden unexpected in-hospital mortality: a prospective cohort study. Chest, 130(4):941–946, 2006.
  • Girkar et al. (2018) U. M. Girkar, R. Uchimido, L. H. Lehman, P. Szolovits, L. A. Celi, and W. Weng. Predicting blood pressure response to fluid bolus therapy using attention-based neural networks for clinical interpretability. arXiv preprint arXiv:1812.00699, 2018.
  • Johnson et al. (2016) A. E. W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
  • Asfar et al. (2014) P. Asfar, F. Meziani, J. Hamel, F. Grelon, B. Megarbane, N. Anguel, J. Mira, P. Dequin, S. Gergaud, and N. Weiss. High versus low blood-pressure target in patients with septic shock. New England Journal of Medicine, 370(17):1583–1593, 2014.
  • Shani et al. (2013) G. Shani, J. Pineau, and R. Kaplow. A survey of point-based pomdp solvers. Autonomous Agents and Multi-Agent Systems, 27(1):1–51, 2013.

References

  • García et al. (2015) M. I. M. García, P. G. González, M. G. Romero, A. G. Cano, C. Oscier, A. Rhodes, R. M. Grounds, and M. Cecconi. Effects of fluid administration on arterial load in septic shock patients. Intensive care medicine, 41(7):1247–1255, 2015.
  • Gottesman et al. (2019) O. Gottesman, F. Johansson, M. Komorowski, A. Faisal, D. Sontag, F. Doshi-Velez, and L. A. Celi. Guidelines for reinforcement learning in healthcare. Nature Medicine, 25, 2019. URL https://doi.org/10.1038/s41591-018-0310-5.
  • Kaelbling et al. (1998) L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
  • Thomas (2015) P. S. Thomas. Safe Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, 2015. URL https://people.cs.umass.edu/~pthomas/papers/Thomas2015c.pdf.
  • Shortreed et al. (2011) S. M. Shortreed, E. Laber, D. J. Lizotte, T. S. Stroup, J. Pineau, and S. A. Murphy. Informing sequential clinical decision-making through reinforcement learning: An empirical study. Machine learning, 84(1-2):109–136, 2011. ISSN 0885-6125. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3143507/.
  • Komorowski et al. (2018) M. Komorowski, L. A. Celi, O. Badawi, A. C. Gordon, and A. A. Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11):1716, 2018.
  • Raghu et al. (2017) A. Raghu, M. Komorowski, L. A. Celi, P. Szolovits, and M. Ghassemi. Continuous State-Space Models for Optimal Sepsis Treatment: A Deep Reinforcement Learning Approach. In Machine Learning for Healthcare Conference, pages 147–163, 2017. URL http://proceedings.mlr.press/v68/raghu17a.html.
  • Prasad et al. (2017) N. Prasad, L. Cheng, C. Chivers, M. Draugelis, and B. E. Engelhardt. A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in Intensive Care Units. In Uncertainty in Artificial Intelligence, page 10, 2017. URL http://auai.org/uai2017/proceedings/papers/209.pdf.
  • Ernst et al. (2006) D. Ernst, G. Stan, J. Goncalves, and L. Wehenkel. Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control, pages 667–672. IEEE, 2006.
  • Martín-Guerrero et al. (2009) J. D. Martín-Guerrero, F. Gomez, E. Soria-Olivas, J. Schmidhuber, and N. V. Climente-Martí, M.and Jiménez-Torres. A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients. Expert Systems with Applications, 36(6):9737–9742, 2009. ISSN 0957-4174. URL http://www.sciencedirect.com/science/article/pii/S0957417409001699.
  • Hauskrecht and Fraser (2000) M. Hauskrecht and H. Fraser. Planning treatment of ischemic heart disease with partially observable Markov decision processes. Artificial Intelligence in Medicine, 18(3):221–244, 2000. ISSN 09333657. URL https://linkinghub.elsevier.com/retrieve/pii/S0933365799000421.
  • Li et al. (2018) L. Li, M. Komorowski, and A. A. Faisal. The Actor Search Tree Critic (ASTC) for Off-Policy POMDP Learning in Medical Decision Making. In arXiv:1805.11548 [Cs], 2018. URL http://arxiv.org/abs/1805.11548.
  • Oberst and Sontag (2019) M. Oberst and D. Sontag. Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models. In International Conference on Machine Learning, 2019. URL http://arxiv.org/abs/1905.05824.
  • Peng et al. (2018) X. Peng, Y. Ding, D. Wihl, O. Gottesman, M. Komorowski, L. H. Lehman, A. Ross, A. Faisal, and F. Doshi-Velez. Improving sepsis treatment strategies by combining deep and kernel-based reinforcement learning. In AMIA Annual Symposium Proceedings, volume 2018, page 887. American Medical Informatics Association, 2018.
  • Parbhoo et al. (2017) S. Parbhoo, J. Bogojeska, M. Zazzi, V. Roth, and F. Doshi-Velez. Combining kernel and model based learning for hiv therapy selection. AMIA Summits on Translational Science Proceedings, 2017:239, 2017.
  • Lacoste–Julien et al. (2011) S. Lacoste–Julien, F. Huszár, and Z. Ghahramani. Approximate inference for the loss-calibrated Bayesian. In Artificial Intelligence and Statistics, 2011.
  • Wilder et al. (2019) B. Wilder, B. Dilkina, and M. Tambe.

    Melding the Data-Decisions Pipeline: Decision-Focused Learning for Combinatorial Optimization.

    In AAAI Conference on Artificial Intelligence, page 8, 2019.
  • Karkus et al. (2017) P. Karkus, D. Hsu, and W. S. Lee.

    Qmdp-net: Deep learning for planning under partial observability.

    NIPS, 2017.
  • Igl et al. (2018) M. Igl, L. Zintgraf, T. A. Le, F. Wood, and S. Whiteson. Deep variational reinforcement learning for pomdps. ICML, 2018.
  • Farahmand (2018) A. Farahmand. Iterative value-aware model learning. In Advances in Neural Information Processing Systems, pages 9072–9083, 2018.
  • Rabiner (1989) L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of the IEEE, 77(2):257–286, 1989.
  • Bengio and Frasconi (1995) Y. Bengio and P. Frasconi. An input output HMM architecture. In Advances in neural information processing systems, 1995.
  • Chrisman (1992) L. Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In AAAI, 1992. URL https://www.aaai.org/Papers/AAAI/1992/AAAI92-029.pdf.
  • Doshi-Velez (2012) F. Doshi-Velez. Bayesian Nonparametric Approaches for Reinforcement Learning in Partially Observable Domains. PhD thesis, Massachusetts Institute of Technology, 2012.
  • Sondik (1978) E. J. Sondik. The optimal control of partially observable markov processes over the infinite horizon: discounted costs. Operations Research, 26(2), 1978.
  • Pineau et al. (2003) J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for pomdps. IJCAI, 2003.
  • Hoey and Poupart (2005) J. Hoey and P. Poupart. Solving pomdps with continuous or large discrete observation spaces. IJCAI, 2005.
  • Raghu et al. (2018) A. Raghu, O. Gottesman, Y. Liu, M. Komorowski, A. A. Faisal, F. Doshi-Velez, and E. Brunskill. Behaviour policy estimation in off-policy policy evaluation: Calibration matters. arXiv preprint arXiv:1807.01066, 2018.
  • Chow et al. (2015) Y. Chow, M. Petrik, and M. Ghavamzadeh. Robust policy optimization with baseline guarantees. arXiv preprint arXiv:1506.04514, 2015.
  • Hughes et al. (2018) M. C. Hughes, G. Hope, L. Weiner, T. H. Mccoy, R. H. Perlis, E. Sudderth, and F. Doshi-Velez. Semi-supervised prediction-constrained topic models. In AISTATS, 2018.
  • Igel and Hüsken (2003) C. Igel and M. Hüsken. Empirical evaluation of the improved rprop learning algorithms. Neurocomputing, 50:105–123, 2003.
  • Metelli et al. (2018) A. M. Metelli, M. Papini, F. Faccio, and M. Restelli. Policy optimization via importance sampling. In Advances in Neural Information Processing Systems, pages 5442–5454, 2018.
  • Kong (1992) A. Kong. A Note on Importance Sampling using Standardized Weights. Technical Report 348, University of Chicago Department of Statistics, 1992. URL https://galton.uchicago.edu/techreports/tr348.pdf.
  • Levine and Koltun (2013) S. Levine and V. Koltun. Guided policy search. In International Conference on Machine Learning, pages 1–9, 2013.
  • Jones et al. (2006) A. E. Jones, V. Yiannibas, C. Johnson, and J. A. Kline. Emergency department hypotension predicts sudden unexpected in-hospital mortality: a prospective cohort study. Chest, 130(4):941–946, 2006.
  • Girkar et al. (2018) U. M. Girkar, R. Uchimido, L. H. Lehman, P. Szolovits, L. A. Celi, and W. Weng. Predicting blood pressure response to fluid bolus therapy using attention-based neural networks for clinical interpretability. arXiv preprint arXiv:1812.00699, 2018.
  • Johnson et al. (2016) A. E. W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
  • Asfar et al. (2014) P. Asfar, F. Meziani, J. Hamel, F. Grelon, B. Megarbane, N. Anguel, J. Mira, P. Dequin, S. Gergaud, and N. Weiss. High versus low blood-pressure target in patients with septic shock. New England Journal of Medicine, 370(17):1583–1593, 2014.
  • Shani et al. (2013) G. Shani, J. Pineau, and R. Kaplow. A survey of point-based pomdp solvers. Autonomous Agents and Multi-Agent Systems, 27(1):1–51, 2013.

Appendix A Additional Details on Point-Based Value Iteration

Point based value iteration (PBVI) is an algorithm for efficiently solving POMDPs Pineau et al. (2003). See Shani et al. (2013) for a thorough survey of related algorithms and extensions in this area.

In PBVI, we do not perform full Bellman backups over the space of all possible belief points, as this is typically intractable. Instead, we will only perform backups at a fixed set of belief points. We first list an equation that computes the value at a belief after a Bellman backup over , where we let denote the vector :

(6)
(7)

, and . Then, we can use the computation of this value to efficiently compute the new -vector that would have been optimal for , had we ran the complete Bellman backup:

(8)
(9)
PBVI: Sampling Approximation to Deal with Complex Observation Models

In normal PBVI, we are limited by how complex our observation space is. The PBVI backup crucially depends on a summation over observation space (or integration, for continuous observations). Dealing with multi-dimensional, non-discrete observations is generally intractable to compute exactly.

Instead, we will utilize ideas from Hoey and Poupart (2005) to circumvent this issue. The main idea is to learn a partition of observation space, where we group together various observations that, conditional on a given belief and taking an action , would have the same maximizing -vector. That is, we want to learn . We can then treat this collection of as a set of “meta-observations”, which will allow us to replace the intractable sum/integral over observation space into a sum over the number of -vectors, by swapping out the term from PBVI in (8) with , the aggregate probability mass over all observations in the “meta-observation”. In particular, we can express the value of a belief by the following now:

(10)
(11)
(12)
(13)

We will make use of a sampling approximation that admits arbitrary observation functions in order to approximate these and (15), the aggregate probability of each “meta-observation”.

To do this, first we sample observations , for each pair of states and actions. Then, we can approximate by the fraction of sampled where was the optimal -vector, ie

(14)

where ties are broken by favoring the -vector with lowest index. Using this approximate discrete observation function, we can perform point-based backups for at a set of beliefs as before. Our backup operation is now:

(15)
(16)
(17)

Appendix B Additional Results from ICU Hypotension Domain

Figure 7: The true reward function used in hypotension experiments in Sec. 6. The agent is rewarded for keeping the the Mean Arterial blood Pressure (MAP) above known good value 65.
Reward function specification

Figure 7 shows the reward function used on this dataset. Note the inflection points at 55 and 60 MAP values. Also, patients who had adequate urine output had a lower threshold for MAP values that start to yield worse rewards.

Figure 8: Visualization of learned heart rate distributions. Left: 2-stage EM. Middle: POPCORN, . Right: Value-only. Each subplot visualizes all learned distributions of heart rate values for a given method across the actions and states. Each pane in a subplot corresponds to a different action, and shows distributions across the states. Vasopressor actions vary across rows, and IV fluid actions vary across columns.
Figure 9: Visualization of learned lactate distributions. Left: 2-stage EM. Middle: POPCORN, . Right: Value-only. Each subplot visualizes all learned distributions of lactate values for a given method across the actions and states. Each pane in a subplot corresponds to a different action, and shows distributions across the states. Vasopressor actions vary across rows, and IV fluid actions vary across columns.
Figure 10: Visualization of learned urine output distributions. Left: 2-stage EM. Middle: POPCORN, . Right: Value-only. Each subplot visualizes all learned distributions of urine output values for a given method across the actions and states. Each pane in a subplot corresponds to a different action, and shows distributions across the states. Vasopressor actions vary across rows, and IV fluid actions vary across columns.

Figures 89, and 10 show additional qualitative results about the learned models for POPCORN, 2-stage EM, and value-only, for the heart rate, lactate, and urine output variables. As in Figure 5 in the main text, we again see that the 2-stage approach largely learns states that exhibit very high overlap. Likewise, the value-only baseline learns states that are much more spread apart, and even occasionally learn bizarre distributions that are close to a point mass at one value. As expected, POPCORN learns models in between these two extremes, with diverse enough states to learn a good policy while also fitting the data decently well.