1 Introduction
Reinforcement learning (RL) has the potential to assist with health contexts that involve sequences of decisions, especially in settings where no evidencebased guidelines exist. For example, in this work we will focus on the task of managing patients with acute hypotension, a lifethreatening emergency that occurs when a patient’s blood pressure drops to dangerously low. In these situations, it is not always clear which treatment might be most effective for a particular patient in a particular context, in what amount, when, and for how long (García et al., 2015). That said, applying RL to healthcare settings is challenging, as highlighted by Gottesman et al. (2019). Two key considerations for our chosen clinical setting will be the fact that decisions must be made under partial observability—the current observations of a patient doesn’t capture their history—and learning must occur in batch—we must learn from retrospective data alone.
We will further focus this work on pushing the limits of modelbased approaches with discrete hidden state representations. In addition to being able to recommend actions (our primary goal), building generative models allows us to create forecasts of future observations (a form of validation), learn in the presence of missing data (frequent in clinical settings), and generally learn more sampleefficiently than modelfree approaches (important as we are often datalimited in clinical settings). Building directlyinspectable models (via simple, discrete structures) further tends to increase sample efficiency over deep models, and also enables easy inspection for clinical sensibility.
Specifically, we introduce POPCORN, or Partially Observed Prediction COnstrained ReiNforcement learning, a new optimization objective for the wellknown partially observable Markov decision process (POMDP) (Kaelbling et al., 1998)
. POMDPs have been traditionally trained in a twostage process, where the first stage is generally learned by maximizing the likelihood of observations and is not tied to the decisionmaking task. However, the twostage approach can fail to find good policies when the model is (inevitably) misspecified; in particular, the maximum likelihood model may spend effort on modeling irrelevant signal rather than signal that matters for the task at hand. We demonstrate this effect, and also demonstrate how POPCORN, which constrains maximum likelihood training of the POMDP generative model so that the value of that model’s induced policy achieves satisfactory performance, does not suffer from these issues. Our approach is compatible with both onpolicy and offpolicy value estimation, making use of differentiable importance sampling estimators
(Thomas, 2015).2 Related Work
Reinforcement learning in healthcare.
Healthcare applications of reinforcement learning have proliferated in recent years, in diverse clinical areas such as management of schizophrenia (Shortreed et al., 2011), sepsis (Komorowski et al., 2018; Raghu et al., 2017), mechanical ventilation (Prasad et al., 2017), HIV (Ernst et al., 2006), and dialysis (MartínGuerrero et al., 2009). However, most health applications of RL focus on modelfree approaches, largely because learning accurate models from noisy biological data is challenging. In addition, all of these works assume fullobservability, which is typically not accurate in health settings.
A few prior works have explicitly modeled partial observability. POMDPs have been applied to heart disease management (Hauskrecht and Fraser, 2000), sepsis treatment in offpolicy or simulated settings (Li et al., 2018; Oberst and Sontag, 2019; Peng et al., 2018), and HIV management (Parbhoo et al., 2017). All of these approaches take a twostage approach to learning. In contrast, we are decisionaware through the whole optimization process.
Endtoend learning for RL
“Endtoend” optimization methods that directly incorporate a downstream decisionmaking task during model training are an area of growing popularity, be it for graphical models (Lacoste–Julien et al., 2011) or submodular optimization (Wilder et al., 2019). Within RL, decisionaware optimization efforts have explored partiallyobserved problems in both modelfree (Karkus et al., 2017) and modelbased settings (Igl et al., 2018).
These efforts differ from ours in two key respects. First, they exclusively focus on onpolicy settings for environments such as Atari where data are easily collected. Second, these methods all rely heavily on blackbox neural networks for feature extraction, which are neither sampleefficient nor easily interpreted. In many cases (e.g.
Karkus et al. (2017)), the model is treated as an abstraction and there is no way to set the importance of the model’s ability to accurately generate trajectories. Perhaps closest in spirit to our approach is recent work on valueaware model learning in RL (Farahmand, 2018), although their contribution is mainly theoretical.3 Background
Here we describe the standard twostage approach for modelbased RL, where the first stage learns the model and the second stage solves for a policy given the model.
POMDP Model.
We consider a POMDP with discrete latent states and discrete actions. We assume we have continuous dimensional observations and deterministic rewards. The overall model parameters are . The transition parameter
describes the probability of moving to the next (unobserved) state
, given the current state and action . Emission parameters andprovide the mean and variance for observation
, when in state and given prior action (assumed in this work to be independent Gaussians across the dimensions).Concretely, we assume the following generative model for states and observations, indexed by :
(1)  
We let index each trajectory in a dataset of sequences, and let denote the length of trajectory . Completing the POMDP specification are the deterministic reward function specifying the reward from taking action in state , and the discount factor .
Given a POMDP, we can compute the posterior beliefvector over the current state given all actions and observations until time , i.e. . The belief is a sufficient statistic for the entire previous history, and can be computed efficiently via forward filtering algorithms (Rabiner, 1989). We can also solve the POMDP using a planning algorithm to learn a policy , yielding a function that returns an action (or distribution over actions for stochastic policies) for any queried belief vector. The ultimate goal is to learn a policy that will maximize the sum of discounted future rewards .
Learning Parameters: InputOutput HMM.
The model in Eq. (1
) is an inputoutput hidden Markov model (IOHMM)
(Bengio and Frasconi, 1995), where the actions are the inputs and the observations are the outputs. Thus, the model parameters that maximize the marginal likelihood of observed histories can be efficiently computed using the EM algorithm and standard dynamic programming subroutines for HMMs (Rabiner, 1989; Chrisman, 1992). If one is using Bayesian approaches, efficient algorithms for sampling from the posterior over POMDP models also exist (DoshiVelez, 2012). Deterministic rewards can be estimated in a later stage that minimizes squared error.Solving for the Policy.
The value function of a discretestate POMDP can be modeled arbitrarily closely as the upper envelope of a finite set of linear functions of the belief (Sondik, 1978). However, exact solutions are intractable for even very small POMDPs. In this work, we start with point based value iteration (PBVI) (Pineau et al., 2003), which performs approximate Bellman backups at a fixed set of belief points. For the moderate state space sizes required for our applications (), PBVI is an efficient solver. We require two key innovations beyond standard PBVI. First, we adapt ideas from Hoey and Poupart (2005) to handle continuous observations. Second, we relax the algorithm so each step in the solver is differentiable. Additional details can be found in the supplement.
OffPolicy Value Estimation.
Let be the optimal policy given model parameters ; that is, the policy that maximizes expected future rewards, obtained by solving the model. However, the fact that is optimal for a learned model does not mean it is optimal in reality. In the batch setting, we lack the ability to interact with the environment and collect trajectories by following a specific policy. Instead, we must turn to off policy evaluation (OPE) to estimate a policy’s value. Let be the behavior policy, that is, the policy under which the observed data were collected (e.g. clinician behavior).^{1}^{1}1In our experiments with simulated environments, we assume is given. In the real data setting, we estimate the behavior policy following the knearest neighbors approach of Raghu et al. (2018).
Let be a set of trajectories collected under the behavior policy . Then, the consistent weighted perdecision importance sampling (CWPDIS, (Thomas, 2015)) estimate of the value of our proposed policy is given by:
(2)  
In general, importance sampling (IS) estimators have lower bias than other approaches but suffer from high variance. Another class of OPE methods learn a separate model to simulate trajectories in order to estimate policy values (e.g. Chow et al. (2015)), but typically suffer from high bias in realworld, noisy settings.
4 PredictionConstrained POMDPs
We now introduce POPCORN, our proposed predictionconstrained optimization framework for learning POMDPs. We seek to learn parameters that assign high likelihood to the observed data while also yielding a policy that has high value. As noted in Sec. 2, previous approaches for learning POMDPs generally fall into two categories. Twostage methods (e.g. Chrisman (1992)) that first learn a model and then solve it often fail to find good policies under severe model misspecification. Endtoend methods (e.g. Karkus et al. (2017)) that focus only on the “discriminative” task of policy learning fail to produce accurate generative models of the environment. They also generally lack the ability to handle missing observations, which is especially problematic in medical contexts where there is often missing data.
Our approach offers a balance between these purely maximum likelihooddriven (generative) and purely rewarddriven (discriminative) extremes. We retain the strengths of the generative approach—the ability to plan under missing observations, simulate accurate dynamics, and inspect model parameters to inspire scientific hypotheses—while benefiting from model parameters that are directly informed by the decision task, as in endtoend frameworks.
4.1 POPCORN Objective
Our proposed framework seeks POMDP model parameters that maximize the log marginal likelihood of the observed dataset , while enforcing that the solved policy’s (estimated) value is high enough to be useful. Formally, we seek a POMDP parameter vector that maximizes the constrained optimization problem:
(3) 
with and functions defined below. The scalar threshold defines a minimum value the policy must achieve (to be determined by a task expert).
If we had a perfect optimizer for Eq. (3), we would prefer this constrained formulation as it best expresses our modelfitting goals: we want as good a generative model as possible, but we will not accept poor decisionmaking. The constraint in Eq. (3) is similar to the predictionconstrained objective used by Hughes et al. (2018) for optimizing supervised topic models; here we apply it in a batch, offpolicy RL context.
In practice, solving constrained problems is challenging, so we transform to an equivalent unconstrained objective using a Lagrange multiplier :
(4) 
Here, setting recovers classic twostage training’s purely generative focus on , while the limit approximates endtoend approaches that focus exclusively on policy value. In our experiments, we compare against both of these baseline approaches, referring to the case as “2 stage EM”, and the as “Value term only” (in practice, we do this by only optimizing the term).
Computing the Generative Term.
We define to denote the log marginal likelihood of observations, given the actions in and the parameters :
This IOHMM likelihood marginalizes over uncertainty about the hidden states, can be computed efficiently via dynamic programming routines (Rabiner, 1989), and is also amenable to automatic differentiation.
Computing the Value Term.
Computation of the value term entails two distinct parts: solving for the optimal policy given the POMDP model , and estimating the value of this policy using OPE and our observed dataset . We require both to be differentiable to permit gradientbased optimization.
To solve for the policy, we apply a relaxation of PBVI, replacing the argmax operations in the PBVI backup updates with softmax operations. Likewise, we use a stochastic policy instead of a deterministic one, replacing the usual argmax over vectors with a softmax to select an action probabilistically. However, we emphasize that our framework is general and other solvers could be used as long as they are differentiable.
4.2 Optimizing the Objective
We optimize the objective via gradient ascent using gradients computed from the full dataset (we do not use stochastic minibatches to avoid the extra variance, but in principle they may be used). We use the Rprop optimizer (Igel and Hüsken, 2003) with default settings. Our objective is challenging due to nonconvexity, as even the generative term alone admits many local optima. To improve convergence, we utilize the best of many (in practice, 10) random restarts for POPCORN as well as the 2stage EM and value term only baselines.
Stabilizing the OffPolicy Estimate.
Although the CWPDIS estimator was reliable in our simulated environments, on our real dataset it had unusably high variance, as is common with IS estimators. We address this in two ways.
First, we add an extra term in the objective favoring larger effective sample size (ESS), following Metelli et al. (2018). Our final objective includes an ESS penalty with weight :
(5) 
The ESS can be computed given the IS weights using the standard formula (Kong, 1992). Since CWPDIS estimates the average discounted reward at each time step , we sum together all stepwise ESS values to yield a single ESS term.
Second, we mask a policy’s action probabilities so only actions with at least probability under the behavior policy are allowed. This forces a policy to be relatively close to the behavior policy. As a result, it both improves the reliability of the OPE and provides a notion of “safety”, as only previously seen actions are allowed. This prevents a policy from recommending dangerous or unknown actions.
Hyperparameter Selection.
The key hyperparameter for our approach is the scalar
. We generally try a range of 35 ’s per environment, spaced evenly on a log scale. In practice, rescaling the term by the number of observed scalars in the dataset, , keeps magnitudes similar across different datasets and keeps the range of values to explore more consistent. We also must select the weight on the ESS penalty , which we find is only necessary in our real data experiments. We also experiment with 35 values evenly spaced on a log scale, and combine the ESS penalty with the overall hyperparameter (i.e. the overall weight is ). Lastly, we set modest Dirichlet priors on the transitions and vague normal priors on the emission means , and add these as an additional term in the overall loss.5 Results on Simulated Domains
We first evaluate POPCORN on several simulated environments to validate its utility across a range of model misspecifications, as well as on a harder sepsis simulator task. For the synthetic experiments in this section, everything is conducted in the batch, offpolicy setting. The simulator is only used to produce the initial data set, and to evaluate the final policy after training concludes. We separate each experiment into a description of procedure and highlights of the results.
Our goal is to develop a method to learn simple—and therefore interpretable—models that perform robustly in misspecified settings. As such, we compare against an approach that does not attempt to model the dynamics (valueonly), an approach that first learns the dynamics and then plans (2stage), and a known optimal solution (when available). In all cases, we are interested in how these methods trade off between explaining the data well (log marginal likelihood of data) and making good decisions (policy value).
5.1 Synthetic Domains with Misspecified Models: Description
We demonstrate how POPCORN overcomes various kinds of model misspecification in the context of the classic POMDP tiger problem (Kaelbling et al., 1998). The classic tiger problem consists of a room with 2 doors: one door is safe, and the other door has a tiger behind it. The agent has 3 possible actions: either open one of the doors, thereby ending the episode, or listen for noisy evidence of which door is safe to open. Revealing a tiger gives reward, the safe door yields reward, and listening incurs reward. The goal is to maximize rewards over many repeated trials, with the “safe” door’s location randomly chosen each time.
In all cases, we set to encourage the agent to act quickly. We collect training data from a random policy that first listens for 5 time steps, and then randomly either opens a door or listens again. We train in the batch setting given a single collection of 1000 trajectories of length 515. After optimization completes, we evaluate each policy via an additional 1000 Monte Carlo rollouts, as we are able to simulate trajectories under our learned policy in these simulated settings.
Tiger with Irrelevant Noise: Finding dimensions that signal reward. We assume that whenever the agent listens for the tiger, it receives an observation with dimensions. The first dimension provides a noisy signal as to the location of the safe door. We set this “signal” dimension , where the mean is the safe door’s index . The second dimension is irrelevant to the safe door’s location, and we set , with in each trial. Thus, total states would be needed to explain perfectly both the relevant and irrelevant signals for all possible values of .
We measure performance allowing only states to assess how each method spends its limited capacity across the generative and rewardseeking goals. We expect the 2stage baseline will identify the irrelevant states indexed by
, as they have lower standard deviation (0.1 vs. 0.3 for the signal dimension) and thus are more important to maximize likelihood. In contrast, we expect POPCORN will focus on the relevant signal dimension and recover highvalue policies.
Tiger with Missing Data: Finding relevant dimensions when some data is missing. We continue with the previous setting, in which the listen action produces dimensional observations , where the first signals the safe door’s location and the second is irrelevant. However, this time the dimension with the relevant signal is often missing. Specifically, and , but we select 80% of signal observations to be missing uniformly at random. This simulates clinical settings where some measurements are rare but useful (e.g. lab tests), while others are common but not directly useful (e.g. temperature for a blood pressure task).
The expected outcome with states is that a 2stage approach driven by maximizing likelihood would prefer to model the alwayspresent irrelevant dimension. In contrast, POPCORN should learn to favor the signal dimension even though it is rarely available and contributes less overall to the likelihood. This ability to gracefully handle missing dimensions is a natural property of generative models and would not be easily done with a modelfree approach.
Tiger with Wrong Likelihood: Overcoming a misspecified density model. Finally, we consider a situation in which our generative model’s density family cannot match the true observation distribution. This time, the listen action produces a dimensional observation . The true distribution of this observation signal is a truncation of a mixture of two Gaussians, denoted . If the first door is safe, listening results in strictly negative observations: . If the second door is safe, listening results in strictly positive observations: .
While the the true observation densities are not Gaussian, we will fit POMDP models with Gaussian likelihoods and states. We expect our POPCORN approach to still deliver highvalue policies, even though the likelihood will be suboptimal.
5.2 Synthetic Domains with Misspecified Models: Conclusions
Across all 3 variants of the Tiger problem, we observe many common conclusions from Fig. 1. Together, these results demonstrate how POPCORN is robust to many different kinds of model misspecifcation.
POPCORN delivers higher policy value than 2stage (EM then PBVI).
Across all 3 panels of Fig. 1, POPCORN (red) delivers higher value (yaxis) than the 2stage baseline (purple).
Only optimizing for value hurts generative quality.
In 2 of 3 panels, the valueonly baseline (green) has noticeably worse likelihood (xaxis) than POPCORN. The far right panels shows indistinguishable performance. Notably, optimizing this objective is significantly less stable than the full POPCORN objective. This aligns with Levine and Koltun (2013), who note gradientbased optimization of importance sampling estimates is notoriously challenging.
POPCORN solutions are consistent with manuallydesigned solutions.
In all 3 panels, POPCORN (red) is the closest method to the ideal manuallydesigned solution (gray), indicating our optimization procedures are effective.
5.3 Sepsis Simulator: Medicallymotivated environment with known ground truth.
We now move from simple toy problems—each designed to demonstrate a particular robustness of our method—to a more challenging simulated domain. In realworld medical decisionmaking tasks, it is impossible to evaluate the value of a learned policy using data collected under that policy’s decisions. However, in a simulated setting, we can evaluate any given policy to assess its true value. We emphasize is still learned in the batch setting, but after optimization we use the simulator to allow for accurate evaluation of policy values.
Specifically, we use the simulator from Oberst and Sontag (2019), which is a coarse physiological model for sepsis with discretevalued observations: 4 vitals, and an indicator for diabetes. The true simulator is governed by an underlying known Markov decision process (MDP), which has 1440 possible discrete states. To make the environment more challenging, we add Gaussian noise with standard deviation to each observation to make them continuousvalued. The environment has 8 actions (3 different binary actions), and is run for at most 20 time steps. Rewards are sparse, with 0 reward at intermediate time steps and or at the terminal time.
The true discretestate MDP for this environment can be solved via exact value iteration. We then generate trajectories under an greedy behavior policy, with so that each nonoptimal action has a probability of being taken at each time. Given the observed trajectories, we learn POMDP models assuming possible states.
Results and Conclusions. The results in Figure 2 again show POPCORN delivers higher value than 2stage EM. Additionally, while the valuetermonly baseline learns a policy on par with POPCORN, the corresponding likelihood is substantially lower.
6 Application to Hypotension Management
To showcase the utility of our method on a realworld medical decision making task, we apply POPCORN to the challenging problem of managing acutely hypotensive patients in the ICU. Although hypotension is associated with high morbidity and mortality (Jones et al., 2006), management of these patients is difficult and treatment strategies are not standardized, in large part because there are many underlying potential causes of hypotension. Girkar et al. (2018) attempted to predict the efficacy of fluid therapy for hypotensive patients, and found considerable noise in being able to assess whether the treatment would be successful. We study the same tradeoffs between generative and rewardseeking performance as in Sec. 5. We further perform an indepth evaluation of the learned policy and our confidence in it (via effective sample sizes).
6.1 Description
Cohort. We use 10,142 ICU stays from MIMICIII(Johnson et al., 2016), filtering to adult patients with at least 3 abnormally low mean arterial pressure (MAP) values in the first 72 hours of ICU admission. Our observations consist of 9 vitals and laboratory measurements: MAP, heart rate, urine output, lactate, Glasgow coma score, serum creatinine, FiO, total bilirubin, and platelets count. We discretized time into 1hour windows, and setup the RL task to begin 1 hour after ICU admission, to ensure a sufficient amount of data exists before starting a policy. Trajectories end either at ICU discharge or at 72 hours into the ICU admission, so there are at most 71 actions taken. This problem formulation was made in consultation with a critical care physician, who advised most acute cases of hypotension would present early during an ICU admission. We expressly do notimpute missing observations, and only new observation measurements contribute to the overall likelihood.
Setup. Our action space consists of the two main treatments for acute hypotension: fluid bolus therapy and vasopressors, both of which are designed to quickly raise blood pressure and increase perfusion to the organs. We discretize fluids into 4 actions (none, low, medium, high), and discretize vasopressors into 5 actions (none, low, medium, high, very high) for a total of 20 discrete actions. To assign rewards to individual time steps, we use a simple piecewiselinear reward function created in consultation with a critical care physician. A MAP of 65mmHg is a common target level(Asfar et al., 2014), so if an action is taken and the next MAP is 65 or higher, the next reward is +1, the highest possible value. Otherwise, rewards decrease as MAP values drop, with MAP delivering a reward of 0, the smallest possible value. See the supplement for a plot of this reward function.
We split the dataset into 5 distinct test sets for crossvalidation, and throughout present results on the test sets, with standard errors across folds where appropriate. We set
and set to be the threshold for the “safety” mask, i.e. any action which had estimated behavior probability below is forbidden. Lastly, we study several possible values for the Lagrange multiplier, . For each fold we use the best run among 10 random restarts, selecting the best in terms of training objective value.6.2 Conclusions
POPCORN achieves the best balance of highperforming policies and high likelihood models. As in previous environments, Figure 3 demonstrates how POPCORN trades off between generative and decisionmaking performance, with darker red indicating higher ’s and thus improved policy values. The policy values for the 2stage baseline and the likelihood values for the valueonly baseline both substantially underperform POPCORN.
POPCORN enables reasonably accurate forecasting. To demonstrate the ability of models to predict future observations, Figure 4 shows results from a forecasting experiment. Each method is given the first 12 hours of a trajectory, and then must predict future observations up to 12 hours in the future. Importantly, only measured observations are used to calculate the mean absolute error between model predictions and the true values. Unsurprisingly the 2stage baseline generally perfoms the best, although POPCORN for small values of often performs similarly. On the other hand, the valueonly baseline fares significantly worse. For some observations (MAP and urine output; see leftmost column of Figure 4), it makes nonsensical predictions far outside the range of observed data, with errors several orders of magnitudes worse than POPCORN and the 2stage baseline.
POPCORN enables inspection if learned models are clinically sensible. We visualize the learned emission distributions for MAP across the states and actions for each method in Figure 5. Note that densities may appear nonGaussian, as they are backtransformed to the scale of the data but were modeled on a logscale. POPCORN’s distributions are more spread out and better differentiate between states compared to the 2stage baseline, which learns very similar states with high overlap. As a result, the 2stage policy will end up recommending similar actions for most patients. Valueonly learns states that are even more diverse, allowing it to learn an effective policy but at the expense of not modeling the observed data well. Additional results for lactate, urine output, and heart rate can be found in the appendix. Although these results are highly exploratory, these simple visualizations of what the models have learned are only possible due to the whitebox nature of our HMMbased approach, compared with deep reinforcement learning methods.
Lastly, Figure 6 visualizes the action probabilities for the behavior policy, a valueonly policy, a POPCORN policy, and a 2stage policy. In general, the POPCORN policy most closely aligns with the behavior, although it is also quite similar to valueonly. On the other hand, the 2stage policy seems in general much more conservative and tends to have lower probabilities on more aggressive actions. In future work we plan to work with clinical partners to explore individual patient trajectories, digging into where, how, and why these treatment policies differ.
7 Discussion
In this paper we propose POPCORN, an improved optimization objective in offpolicy batch RL with partial observability. POPCORN can balance the tradeoff between learning a model with high likelihood and a model that is wellsuited for planning, even if evaluation must be done offpolicy on historical data. Synthetic experiments demonstrate the primary advantages of our approach over alternatives: POPCORN achieves good policies and decent models even in the face of misspecification (in the number of states, the choice of the likelihood, or the availability of data). Performance on a realworld medical problem of hypotension management suggests we may be able to learn a policy on par or even slightly better than the observed clinician behavior policy. Future directions include scaling our methods to environments with more complex state structures or longterm temporal dependencies, investigating semisupervised settings where not all sequences have observed rewards, and learning Paretooptimal policies that balance multiple competing reward signals. We hope that methods such as ours may ultimately become decision support tools that are integrated into actual clinical practice to assist providers in improving the treatment of critically ill patients.
Acknowledgements
FDV and JF acknowledge support from NSF Project 1750358. JF additionally acknowledges Oracle Labs, a Harvard CRCS fellowship, and a Harvard Embedded EthiCS fellowship.
References
 García et al. (2015) M. I. M. García, P. G. González, M. G. Romero, A. G. Cano, C. Oscier, A. Rhodes, R. M. Grounds, and M. Cecconi. Effects of fluid administration on arterial load in septic shock patients. Intensive care medicine, 41(7):1247–1255, 2015.
 Gottesman et al. (2019) O. Gottesman, F. Johansson, M. Komorowski, A. Faisal, D. Sontag, F. DoshiVelez, and L. A. Celi. Guidelines for reinforcement learning in healthcare. Nature Medicine, 25, 2019. URL https://doi.org/10.1038/s4159101803105.
 Kaelbling et al. (1998) L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(12):99–134, 1998.
 Thomas (2015) P. S. Thomas. Safe Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, 2015. URL https://people.cs.umass.edu/~pthomas/papers/Thomas2015c.pdf.
 Shortreed et al. (2011) S. M. Shortreed, E. Laber, D. J. Lizotte, T. S. Stroup, J. Pineau, and S. A. Murphy. Informing sequential clinical decisionmaking through reinforcement learning: An empirical study. Machine learning, 84(12):109–136, 2011. ISSN 08856125. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3143507/.
 Komorowski et al. (2018) M. Komorowski, L. A. Celi, O. Badawi, A. C. Gordon, and A. A. Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11):1716, 2018.
 Raghu et al. (2017) A. Raghu, M. Komorowski, L. A. Celi, P. Szolovits, and M. Ghassemi. Continuous StateSpace Models for Optimal Sepsis Treatment: A Deep Reinforcement Learning Approach. In Machine Learning for Healthcare Conference, pages 147–163, 2017. URL http://proceedings.mlr.press/v68/raghu17a.html.
 Prasad et al. (2017) N. Prasad, L. Cheng, C. Chivers, M. Draugelis, and B. E. Engelhardt. A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in Intensive Care Units. In Uncertainty in Artificial Intelligence, page 10, 2017. URL http://auai.org/uai2017/proceedings/papers/209.pdf.
 Ernst et al. (2006) D. Ernst, G. Stan, J. Goncalves, and L. Wehenkel. Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control, pages 667–672. IEEE, 2006.
 MartínGuerrero et al. (2009) J. D. MartínGuerrero, F. Gomez, E. SoriaOlivas, J. Schmidhuber, and N. V. ClimenteMartí, M.and JiménezTorres. A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients. Expert Systems with Applications, 36(6):9737–9742, 2009. ISSN 09574174. URL http://www.sciencedirect.com/science/article/pii/S0957417409001699.
 Hauskrecht and Fraser (2000) M. Hauskrecht and H. Fraser. Planning treatment of ischemic heart disease with partially observable Markov decision processes. Artificial Intelligence in Medicine, 18(3):221–244, 2000. ISSN 09333657. URL https://linkinghub.elsevier.com/retrieve/pii/S0933365799000421.
 Li et al. (2018) L. Li, M. Komorowski, and A. A. Faisal. The Actor Search Tree Critic (ASTC) for OffPolicy POMDP Learning in Medical Decision Making. In arXiv:1805.11548 [Cs], 2018. URL http://arxiv.org/abs/1805.11548.
 Oberst and Sontag (2019) M. Oberst and D. Sontag. Counterfactual OffPolicy Evaluation with GumbelMax Structural Causal Models. In International Conference on Machine Learning, 2019. URL http://arxiv.org/abs/1905.05824.
 Peng et al. (2018) X. Peng, Y. Ding, D. Wihl, O. Gottesman, M. Komorowski, L. H. Lehman, A. Ross, A. Faisal, and F. DoshiVelez. Improving sepsis treatment strategies by combining deep and kernelbased reinforcement learning. In AMIA Annual Symposium Proceedings, volume 2018, page 887. American Medical Informatics Association, 2018.
 Parbhoo et al. (2017) S. Parbhoo, J. Bogojeska, M. Zazzi, V. Roth, and F. DoshiVelez. Combining kernel and model based learning for hiv therapy selection. AMIA Summits on Translational Science Proceedings, 2017:239, 2017.
 Lacoste–Julien et al. (2011) S. Lacoste–Julien, F. Huszár, and Z. Ghahramani. Approximate inference for the losscalibrated Bayesian. In Artificial Intelligence and Statistics, 2011.

Wilder et al. (2019)
B. Wilder, B. Dilkina, and M. Tambe.
Melding the DataDecisions Pipeline: DecisionFocused Learning for Combinatorial Optimization.
In AAAI Conference on Artificial Intelligence, page 8, 2019. 
Karkus et al. (2017)
P. Karkus, D. Hsu, and W. S. Lee.
Qmdpnet: Deep learning for planning under partial observability.
NIPS, 2017.  Igl et al. (2018) M. Igl, L. Zintgraf, T. A. Le, F. Wood, and S. Whiteson. Deep variational reinforcement learning for pomdps. ICML, 2018.
 Farahmand (2018) A. Farahmand. Iterative valueaware model learning. In Advances in Neural Information Processing Systems, pages 9072–9083, 2018.
 Rabiner (1989) L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of the IEEE, 77(2):257–286, 1989.
 Bengio and Frasconi (1995) Y. Bengio and P. Frasconi. An input output HMM architecture. In Advances in neural information processing systems, 1995.
 Chrisman (1992) L. Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In AAAI, 1992. URL https://www.aaai.org/Papers/AAAI/1992/AAAI92029.pdf.
 DoshiVelez (2012) F. DoshiVelez. Bayesian Nonparametric Approaches for Reinforcement Learning in Partially Observable Domains. PhD thesis, Massachusetts Institute of Technology, 2012.
 Sondik (1978) E. J. Sondik. The optimal control of partially observable markov processes over the infinite horizon: discounted costs. Operations Research, 26(2), 1978.
 Pineau et al. (2003) J. Pineau, G. Gordon, and S. Thrun. Pointbased value iteration: An anytime algorithm for pomdps. IJCAI, 2003.
 Hoey and Poupart (2005) J. Hoey and P. Poupart. Solving pomdps with continuous or large discrete observation spaces. IJCAI, 2005.
 Raghu et al. (2018) A. Raghu, O. Gottesman, Y. Liu, M. Komorowski, A. A. Faisal, F. DoshiVelez, and E. Brunskill. Behaviour policy estimation in offpolicy policy evaluation: Calibration matters. arXiv preprint arXiv:1807.01066, 2018.
 Chow et al. (2015) Y. Chow, M. Petrik, and M. Ghavamzadeh. Robust policy optimization with baseline guarantees. arXiv preprint arXiv:1506.04514, 2015.
 Hughes et al. (2018) M. C. Hughes, G. Hope, L. Weiner, T. H. Mccoy, R. H. Perlis, E. Sudderth, and F. DoshiVelez. Semisupervised predictionconstrained topic models. In AISTATS, 2018.
 Igel and Hüsken (2003) C. Igel and M. Hüsken. Empirical evaluation of the improved rprop learning algorithms. Neurocomputing, 50:105–123, 2003.
 Metelli et al. (2018) A. M. Metelli, M. Papini, F. Faccio, and M. Restelli. Policy optimization via importance sampling. In Advances in Neural Information Processing Systems, pages 5442–5454, 2018.
 Kong (1992) A. Kong. A Note on Importance Sampling using Standardized Weights. Technical Report 348, University of Chicago Department of Statistics, 1992. URL https://galton.uchicago.edu/techreports/tr348.pdf.
 Levine and Koltun (2013) S. Levine and V. Koltun. Guided policy search. In International Conference on Machine Learning, pages 1–9, 2013.
 Jones et al. (2006) A. E. Jones, V. Yiannibas, C. Johnson, and J. A. Kline. Emergency department hypotension predicts sudden unexpected inhospital mortality: a prospective cohort study. Chest, 130(4):941–946, 2006.
 Girkar et al. (2018) U. M. Girkar, R. Uchimido, L. H. Lehman, P. Szolovits, L. A. Celi, and W. Weng. Predicting blood pressure response to fluid bolus therapy using attentionbased neural networks for clinical interpretability. arXiv preprint arXiv:1812.00699, 2018.
 Johnson et al. (2016) A. E. W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. Mimiciii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
 Asfar et al. (2014) P. Asfar, F. Meziani, J. Hamel, F. Grelon, B. Megarbane, N. Anguel, J. Mira, P. Dequin, S. Gergaud, and N. Weiss. High versus low bloodpressure target in patients with septic shock. New England Journal of Medicine, 370(17):1583–1593, 2014.
 Shani et al. (2013) G. Shani, J. Pineau, and R. Kaplow. A survey of pointbased pomdp solvers. Autonomous Agents and MultiAgent Systems, 27(1):1–51, 2013.
References
 García et al. (2015) M. I. M. García, P. G. González, M. G. Romero, A. G. Cano, C. Oscier, A. Rhodes, R. M. Grounds, and M. Cecconi. Effects of fluid administration on arterial load in septic shock patients. Intensive care medicine, 41(7):1247–1255, 2015.
 Gottesman et al. (2019) O. Gottesman, F. Johansson, M. Komorowski, A. Faisal, D. Sontag, F. DoshiVelez, and L. A. Celi. Guidelines for reinforcement learning in healthcare. Nature Medicine, 25, 2019. URL https://doi.org/10.1038/s4159101803105.
 Kaelbling et al. (1998) L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(12):99–134, 1998.
 Thomas (2015) P. S. Thomas. Safe Reinforcement Learning. PhD thesis, University of Massachusetts, Amherst, 2015. URL https://people.cs.umass.edu/~pthomas/papers/Thomas2015c.pdf.
 Shortreed et al. (2011) S. M. Shortreed, E. Laber, D. J. Lizotte, T. S. Stroup, J. Pineau, and S. A. Murphy. Informing sequential clinical decisionmaking through reinforcement learning: An empirical study. Machine learning, 84(12):109–136, 2011. ISSN 08856125. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3143507/.
 Komorowski et al. (2018) M. Komorowski, L. A. Celi, O. Badawi, A. C. Gordon, and A. A. Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11):1716, 2018.
 Raghu et al. (2017) A. Raghu, M. Komorowski, L. A. Celi, P. Szolovits, and M. Ghassemi. Continuous StateSpace Models for Optimal Sepsis Treatment: A Deep Reinforcement Learning Approach. In Machine Learning for Healthcare Conference, pages 147–163, 2017. URL http://proceedings.mlr.press/v68/raghu17a.html.
 Prasad et al. (2017) N. Prasad, L. Cheng, C. Chivers, M. Draugelis, and B. E. Engelhardt. A Reinforcement Learning Approach to Weaning of Mechanical Ventilation in Intensive Care Units. In Uncertainty in Artificial Intelligence, page 10, 2017. URL http://auai.org/uai2017/proceedings/papers/209.pdf.
 Ernst et al. (2006) D. Ernst, G. Stan, J. Goncalves, and L. Wehenkel. Clinical data based optimal sti strategies for hiv: a reinforcement learning approach. In Proceedings of the 45th IEEE Conference on Decision and Control, pages 667–672. IEEE, 2006.
 MartínGuerrero et al. (2009) J. D. MartínGuerrero, F. Gomez, E. SoriaOlivas, J. Schmidhuber, and N. V. ClimenteMartí, M.and JiménezTorres. A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients. Expert Systems with Applications, 36(6):9737–9742, 2009. ISSN 09574174. URL http://www.sciencedirect.com/science/article/pii/S0957417409001699.
 Hauskrecht and Fraser (2000) M. Hauskrecht and H. Fraser. Planning treatment of ischemic heart disease with partially observable Markov decision processes. Artificial Intelligence in Medicine, 18(3):221–244, 2000. ISSN 09333657. URL https://linkinghub.elsevier.com/retrieve/pii/S0933365799000421.
 Li et al. (2018) L. Li, M. Komorowski, and A. A. Faisal. The Actor Search Tree Critic (ASTC) for OffPolicy POMDP Learning in Medical Decision Making. In arXiv:1805.11548 [Cs], 2018. URL http://arxiv.org/abs/1805.11548.
 Oberst and Sontag (2019) M. Oberst and D. Sontag. Counterfactual OffPolicy Evaluation with GumbelMax Structural Causal Models. In International Conference on Machine Learning, 2019. URL http://arxiv.org/abs/1905.05824.
 Peng et al. (2018) X. Peng, Y. Ding, D. Wihl, O. Gottesman, M. Komorowski, L. H. Lehman, A. Ross, A. Faisal, and F. DoshiVelez. Improving sepsis treatment strategies by combining deep and kernelbased reinforcement learning. In AMIA Annual Symposium Proceedings, volume 2018, page 887. American Medical Informatics Association, 2018.
 Parbhoo et al. (2017) S. Parbhoo, J. Bogojeska, M. Zazzi, V. Roth, and F. DoshiVelez. Combining kernel and model based learning for hiv therapy selection. AMIA Summits on Translational Science Proceedings, 2017:239, 2017.
 Lacoste–Julien et al. (2011) S. Lacoste–Julien, F. Huszár, and Z. Ghahramani. Approximate inference for the losscalibrated Bayesian. In Artificial Intelligence and Statistics, 2011.

Wilder et al. (2019)
B. Wilder, B. Dilkina, and M. Tambe.
Melding the DataDecisions Pipeline: DecisionFocused Learning for Combinatorial Optimization.
In AAAI Conference on Artificial Intelligence, page 8, 2019. 
Karkus et al. (2017)
P. Karkus, D. Hsu, and W. S. Lee.
Qmdpnet: Deep learning for planning under partial observability.
NIPS, 2017.  Igl et al. (2018) M. Igl, L. Zintgraf, T. A. Le, F. Wood, and S. Whiteson. Deep variational reinforcement learning for pomdps. ICML, 2018.
 Farahmand (2018) A. Farahmand. Iterative valueaware model learning. In Advances in Neural Information Processing Systems, pages 9072–9083, 2018.
 Rabiner (1989) L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of the IEEE, 77(2):257–286, 1989.
 Bengio and Frasconi (1995) Y. Bengio and P. Frasconi. An input output HMM architecture. In Advances in neural information processing systems, 1995.
 Chrisman (1992) L. Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In AAAI, 1992. URL https://www.aaai.org/Papers/AAAI/1992/AAAI92029.pdf.
 DoshiVelez (2012) F. DoshiVelez. Bayesian Nonparametric Approaches for Reinforcement Learning in Partially Observable Domains. PhD thesis, Massachusetts Institute of Technology, 2012.
 Sondik (1978) E. J. Sondik. The optimal control of partially observable markov processes over the infinite horizon: discounted costs. Operations Research, 26(2), 1978.
 Pineau et al. (2003) J. Pineau, G. Gordon, and S. Thrun. Pointbased value iteration: An anytime algorithm for pomdps. IJCAI, 2003.
 Hoey and Poupart (2005) J. Hoey and P. Poupart. Solving pomdps with continuous or large discrete observation spaces. IJCAI, 2005.
 Raghu et al. (2018) A. Raghu, O. Gottesman, Y. Liu, M. Komorowski, A. A. Faisal, F. DoshiVelez, and E. Brunskill. Behaviour policy estimation in offpolicy policy evaluation: Calibration matters. arXiv preprint arXiv:1807.01066, 2018.
 Chow et al. (2015) Y. Chow, M. Petrik, and M. Ghavamzadeh. Robust policy optimization with baseline guarantees. arXiv preprint arXiv:1506.04514, 2015.
 Hughes et al. (2018) M. C. Hughes, G. Hope, L. Weiner, T. H. Mccoy, R. H. Perlis, E. Sudderth, and F. DoshiVelez. Semisupervised predictionconstrained topic models. In AISTATS, 2018.
 Igel and Hüsken (2003) C. Igel and M. Hüsken. Empirical evaluation of the improved rprop learning algorithms. Neurocomputing, 50:105–123, 2003.
 Metelli et al. (2018) A. M. Metelli, M. Papini, F. Faccio, and M. Restelli. Policy optimization via importance sampling. In Advances in Neural Information Processing Systems, pages 5442–5454, 2018.
 Kong (1992) A. Kong. A Note on Importance Sampling using Standardized Weights. Technical Report 348, University of Chicago Department of Statistics, 1992. URL https://galton.uchicago.edu/techreports/tr348.pdf.
 Levine and Koltun (2013) S. Levine and V. Koltun. Guided policy search. In International Conference on Machine Learning, pages 1–9, 2013.
 Jones et al. (2006) A. E. Jones, V. Yiannibas, C. Johnson, and J. A. Kline. Emergency department hypotension predicts sudden unexpected inhospital mortality: a prospective cohort study. Chest, 130(4):941–946, 2006.
 Girkar et al. (2018) U. M. Girkar, R. Uchimido, L. H. Lehman, P. Szolovits, L. A. Celi, and W. Weng. Predicting blood pressure response to fluid bolus therapy using attentionbased neural networks for clinical interpretability. arXiv preprint arXiv:1812.00699, 2018.
 Johnson et al. (2016) A. E. W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. Mimiciii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
 Asfar et al. (2014) P. Asfar, F. Meziani, J. Hamel, F. Grelon, B. Megarbane, N. Anguel, J. Mira, P. Dequin, S. Gergaud, and N. Weiss. High versus low bloodpressure target in patients with septic shock. New England Journal of Medicine, 370(17):1583–1593, 2014.
 Shani et al. (2013) G. Shani, J. Pineau, and R. Kaplow. A survey of pointbased pomdp solvers. Autonomous Agents and MultiAgent Systems, 27(1):1–51, 2013.
Appendix A Additional Details on PointBased Value Iteration
Point based value iteration (PBVI) is an algorithm for efficiently solving POMDPs Pineau et al. (2003). See Shani et al. (2013) for a thorough survey of related algorithms and extensions in this area.
In PBVI, we do not perform full Bellman backups over the space of all possible belief points, as this is typically intractable. Instead, we will only perform backups at a fixed set of belief points. We first list an equation that computes the value at a belief after a Bellman backup over , where we let denote the vector :
(6)  
(7) 
, and . Then, we can use the computation of this value to efficiently compute the new vector that would have been optimal for , had we ran the complete Bellman backup:
(8)  
(9) 
PBVI: Sampling Approximation to Deal with Complex Observation Models
In normal PBVI, we are limited by how complex our observation space is. The PBVI backup crucially depends on a summation over observation space (or integration, for continuous observations). Dealing with multidimensional, nondiscrete observations is generally intractable to compute exactly.
Instead, we will utilize ideas from Hoey and Poupart (2005) to circumvent this issue. The main idea is to learn a partition of observation space, where we group together various observations that, conditional on a given belief and taking an action , would have the same maximizing vector. That is, we want to learn . We can then treat this collection of as a set of “metaobservations”, which will allow us to replace the intractable sum/integral over observation space into a sum over the number of vectors, by swapping out the term from PBVI in (8) with , the aggregate probability mass over all observations in the “metaobservation”. In particular, we can express the value of a belief by the following now:
(10)  
(11)  
(12)  
(13) 
We will make use of a sampling approximation that admits arbitrary observation functions in order to approximate these and (15), the aggregate probability of each “metaobservation”.
To do this, first we sample observations , for each pair of states and actions. Then, we can approximate by the fraction of sampled where was the optimal vector, ie
(14) 
where ties are broken by favoring the vector with lowest index. Using this approximate discrete observation function, we can perform pointbased backups for at a set of beliefs as before. Our backup operation is now:
(15)  
(16)  
(17) 
Appendix B Additional Results from ICU Hypotension Domain
Reward function specification
Figure 7 shows the reward function used on this dataset. Note the inflection points at 55 and 60 MAP values. Also, patients who had adequate urine output had a lower threshold for MAP values that start to yield worse rewards.
Figures 8, 9, and 10 show additional qualitative results about the learned models for POPCORN, 2stage EM, and valueonly, for the heart rate, lactate, and urine output variables. As in Figure 5 in the main text, we again see that the 2stage approach largely learns states that exhibit very high overlap. Likewise, the valueonly baseline learns states that are much more spread apart, and even occasionally learn bizarre distributions that are close to a point mass at one value. As expected, POPCORN learns models in between these two extremes, with diverse enough states to learn a good policy while also fitting the data decently well.
Comments
There are no comments yet.