Introduction
A key goal in reinforcement learning (RL) is to quickly learn to make good decisions by interacting with an environment. In most cases the quality of the decision policy is evaluated with respect to its expected (discounted) sum of rewards. However, in many interesting cases, it is important to consider the full distributions over the potential sum of rewards, and the desired objective may be a risksensitive measure of this distribution. For example, a patient undergoing a surgery for a knee replacement will (hopefully) only experience that procedure once or twice, and may will be interested in the distribution of potential results for a single procedure, rather than what may happen on average if he or she were to undertake that procedure hundreds of time. Finance and (machine) control are other cases where interest in risksensitive outcomes are common.
A popular risksensitive measure of a distribution of outcomes is the Conditional Value at Risk (CVaR) [artzner1999coherent]. Intuitively, CVaR is the expected reward in the worst fraction of outcomes, and has seen extensive use in financial portfolio optimization [financezhu2009worst], often under the name “expected shortfall”. While there has been recent interest in the RL community in learning to converge or identify good CVaR decision policies in Markov decision processes [chow2014algorithms, chow2015risk, sampling, dabney2018implicit], interestingly we are unaware of prior work focused on how to quickly learn such CVaR MDP policies, even though sample efficient RL for maximizing expected outcomes is a deep and well studied theoretical [Jaksch10, dann2018policy] and empirical [bellemare2016unifying] topic. Sample efficient exploration seems of equal or even more importance in the case when the goal is riskaverse outcomes.
In this paper we work towards sample efficient reinforcement learning algorithms that can quickly identify a policy with an optimal CVaR. Our focus is in minimizing the amount of experience needed to find such a policy, similar in spirit to probably approximately correct RL methods for expected reward. Note that this is different than another important topic in risksensitive RL, which focuses on safe exploration: algorithms that focus on avoiding any potentially very poor outcomes during learning. These typically rely on local smoothness assumptions and do not typically focus on sample efficiency [berkenkamp2017safe, koller2018learning]; an interesting question for future work is whether one can do both safe and efficient learning of a CVaR policy. Our work is suitable for the many settings where some outcomes are undesirable but not catastrophic.
Our approach is inspired by the popular and effective principle of optimism in the face of uncertainty (OFU) in sample efficient RL for maximizing expected outcomes [strehl2008analysis, brafman2002r]
. Such work typically works by considering uncertainty over the MDP model parameters or stateaction value function, and constructing an optimistic value function given that uncertainty that is then used to guide decision making. To take a similar idea for rapidly learning the optimal CVaR policy, we seek to consider the uncertainty in the distribution of outcomes possible and the resulting CVaR value. To do so, we use the DvoretzkyKieferWolfowitz (DKW) inequality—while to our knowledge this has not been previously used in reinforcement learning settings, it is a very useful concentration inequality for our purposes as it provides bounds on the true cumulative distribution function (CDF) given a set of sampled outcomes. We leverage these bounds in order to compute optimistic estimates of the optimal CVaR.
Our interest is in creating empirically efficient and scalable algorithms that have a theoretically sound grounding. To that end, we introduce a new algorithm for quickly learning a CVaR policy in MDPs and show that at least in the evaluation case in tabular MDPs, this algorithm indeed produces optimistic estimates of the CVaR. We also show that it does converge eventually. We accompany the theoretical evidence with an empirical evaluation. We provide encouraging empirical results on a machine replacement task [delage2010percentile], a classic MDP where risk sensitive policies are critical, as well as a well validated simulator for type 1 diabetes [man2014uva] and a simulated treatment optimization task for HIV [ernst2006clinical]. In all cases we find a substantial benefit over simpler exploration strategies. To our knowledge this is the first algorithm that performs strategic exploration to learn good CVaR MDP policies.
Background and Notation
Let
be a bounded random variable with cumulative distribution function
. The conditional value at risk (CVaR) at level of a random variable is then defined as [rockafellar2000optimization]:(1) 
We define the inverse CDF as . It is well known that when has a continuous distribution, [acerbi2002coherence]. For ease of notation we sometimes write as a function of the CDF , .
We are interested in the of the discounted cumulative reward in a Markov Decision Process (MDP). An MDP is defined by a tuple , where and are finite state and action space, is the reward distribution, is the transition kernel and is the discount factor. A stationary policy maps each state
to a probability distribution over action space
.Let denote the space of distributions over returns (discounted cumulative rewards) from such an MDP, and assume that these returns are in almost surely, where . We define to be the distribution of the return of policy with CDF and initial state action pair as . RL algorithms most commonly optimize policies for expected return and explicitly learn Qvalues,
by applying approximate versions of Bellman backups. Instead, we are interested in other properties of the return distribution and we will build on several recently proposed algorithms that aim to learn a parametric model of the entire return distribution instead of only its expectation. Such approaches are known as
distributional RL methods.Distributional Reinforcement Learning
Distributional RL methods apply a samplebased approximation to distributional versions of the usual Bellman operators. For example, one can define a distributional Bellman operator [bellemare2017distributional] as as
(2) 
where denotes equality in distribution, and the transition operator is defined as with , . The optimality version is similarly any where is an optimal policy w.r.t. expected return. Note that this is not necessarily unique when there are multiple optimal policies. [rowland2018analysis] showed that is a contraction in the Cramérmetric,
(3)  
One of the canonical algorithms in distributional RL is CDRL or C51 [bellemare2017distributional] which represent the return distribution as a discrete distribution with fixed support on atoms the discrete distribution is parameterized as :
Essentially, C51 uses a sample transition to perform an approximate Bellman backup , where is a samplebased Bellman operator and is a projection back onto the support of discrete distribution .
Optimistic Distributional Operator
In contrast to the typical RL setup where an agent tries to maximize its expected return, we seek to learn a stationary policy that maximizes the of the return at risk level .^{1}^{1}1Note that the optimal policy at any state can be nonstationary [shapiro2009lectures], as it depends on the sum of rewards achieved up to that state. For simplicity, as [dabney2018distributional] we instead seek a stationary policy, which will generally can be suboptimal but typically still achieve high CVaR, as observed in our experiments. To find such policies quickly, we follow the optimisminthefaceofuncertainty (OFU) principle and introduce optimism in our CVaR estimates to guide exploration. While adding a bonus to rewards is a popular approach for optimism in the standard expected return case [ostrovski2017count], we here follow a different approach and introduce optimism into our return estimates by shifting the empirical CDFs. Formally, consider a return distribution with CDF . We define the optimism operator as
(4) 
where is a constant and is short for . In the definition above, is the number of times the pair has been observed so far or an approximation such as pseudocounts [bellemare2016unifying]. By shifting the cumulative distribution function down, this operator essentially puts probability mass from the lower tail to the highest possible value . An illustration is provided in Figure 1.
This approach to optimism is motivated by an application of the DKWinequality to the empirical CDF. As shown recently by [thomas2019concentration], this can yield tighter upper confidence bounds on the CVaR.
Theoretical Analysis
The optimistic operator introduced above operates on the entire return distribution and our algorithm introduced in the next section combines this optimistic operator to estimated returntogo distributions. As such, it belongs to the family of distributional RL methods [dabney2018distributional]. These methods are a recent development and come with strong asymptotic convergence guarantees when used for policy evaluation in tabular MDPs [rowland2018analysis]. Yet, finite sample guarantees such as regret or PAC bounds still remain elusive for distributional RL policy optimization algorithms.
A key technical challenge in proving performing bounds for distributionally robust policy optimization during RL is that convergence of the distributional Bellman optimality operator can generally not be guaranteed. Prior results have only showed that if the optimization process itself is to compute a policy which maximizes expected returns, such as Qlearning, then convergence of the distirbutional Bellman optimality operator is guaranteed to converge. [rowland2018analysis, Theorem 2]. Note however that if the goal is to leverage distributional information to compute a policy to maximize something other than expected outcomes, such as a risk sensitive policy like we consider here, no prior theoretical results are known in the reinforcement learning setting to our knowledge. However, it is promising that there is some empirical evidence that one can compute risksensitive policies using distributional Bellman operators [dabney2018implicit] which suggests that more theoretical results may be possible.
Here we take a first step towards this goal. Our primary aim in this work is to provide tools to introduce optimism into distributional returntogo estimates to guide sampleefficient exploration for CVaR. Therefore, our theoretical analysis focuses on showing that this form of optimism does not harm convergence and is indeed a principled way to obtain optimistic CVaR estimates.
First, we prove that the optimism operator is a nonexpansion in the Cramér distance. This results shows that this operator can be used with other contraction operators without negatively impacting the convergence behaviour. Specifically we can guarantee convergence with distributional Bellman backup. For any , the operator is a nonexpansion in the Cramér distance . This implies that optimistic distributional Bellman backups and the projected version are contractions in and iterates of these operators converge in to a unique fixedpoint.
Next, we provide theoretical evidence that this operator indeed produces optimistic CVaR estimates. Consider here batch policy evaluation in MDPs with finite state and actionspaces. Assume that we have collected a fixed number of samples (which can vary across states and actions) and build an empirical model of the MDP. For any policy , let denote the distributional Bellman operator in this empirical MDP. Then we indeed achieve optimistic estimates by the following result:
Let the shift parameter in the optimistic operator be sufficiently large which is . Then with probability at least , the iterates converges for any risk level and initial to an optimistic estimate of the policy’s conditional value at risk. That is, with probability at least ,
This theorem uses the DKW inequality which to the best of our knowledge has not been used for MDPs. Note, that the statement guarantees optimism for all risk levels without paying a penalty for it. Since we estimate the transitions and rewards for each state and action separately, one generally does not expect to be able to use a shift parameter smaller than . Thus, Theorem Theoretical Analysis is unimprovable in that sense. Specifically, we avoid a polynomial dependency on the number of states in the shift parameter by combining two techniques: (1) concentration inequalities w.r.t. the optimal CVaR of the next state for a certain finite set of alphas and (2) a covering argument to get optimism for all infinitely many . This is substantially more involved than the expected reward case.
These results are a key step towards finitesample analyses. In future work it would be very interesting to obtain a convergence analysis for distributional Bellman optimality operators in general, though this is outside the scope of this current paper. Such a result could lead to samplecomplexity guarantees when combined with our existing analysis.
Algorithm
In the policy evaluation case where we would like to compute optimistic estimates of the CVaR of a given observed policy , our algorithm essentially performs an approximate version of the optimistic Bellman update where is the distributional Bellman operator. For the control case where we would like to learn a policy that maximizes CVaR, we instead define a distributional Bellman optimality operator . Analogous to prior work [bellemare2017distributional], is any operator that satisfies for some policy that is greedy w.r.t. CVaR at level . Our algorithm then performs an approximate version of the optimistic Bellman backup , shown in Algorithm 1.
The main structure of our algorithm resembles categorical distributional reinforcement learning (C51) [bellemare2017distributional]. In a similar vein, our algorithm also maintains a return distribution estimate for each stateaction pair, represented as a set of weights for . These weights represent a discrete distribution with outcomes at equally spaced locations , each apart. The current probability assigned to outcome in is denoted by , where the atom probabilities
are given by a differentiable model such as a neural network, similar to C51. Note that other parameterized representations of the weights
[bellemare2017distributional] are straightforward to incorporate.The main differences between Algorithm 1 and existing distributional RL algorithms (e.g. C51) are highlighted in red. We first apply an optimism operator to our successor distribution (Lines 1–1) to form an optimistic CDF for all actions . This operator should encourage exploring actions that might lead to higher CVaR policies for our input . These optimistic CDFs are also used to decide on the successor action in the control setting (Line 1). Then, similar to C51 we apply the Bellman operator for and distribute the probability of to the immediate neighbours of , where we calculate the probability mass with the optimistic CDF (Line 1).
Following [bellemare2017distributional], we train this model using the crossentropy loss, which for a particular state transition at time is
(5) 
where are the weights of the target distribution computed in Lines 1–1 in Algorithm 1. In the tabular setting we can directly update the probability mass by
where is the learning rate.
In tabular settings, the counts n(s,a) can be directly stored and used; however, this is not the case in continuous settings. For this reason, we adopt the pseudocount estimation method proposed by [ostrovski2017count] and replace by a pseudocount in the optimistic distributional operator (Equation 4). Let be a density model and the probability assigned to the state action pair by the model after training steps. The prediction gain of is defined
(6) 
Where is the probability assigned to if it were trained on that same one more time. Now we define the pseudo count of as
(7) 
where is a constant hyperparameter, and thresholds the value of the prediction gain at 0.
Our setting differs from [ostrovski2017count] in the sense that we have to compute the count before taking the action . A naive way would be to try all actions and train the model to compute the counts but this method is slow and requires the environment to support an undo action. Instead, we can estimate for all actions as follows. Consider the density model parametrized by , . After observing , the training step to maximize the log likelihood will update the parameters by , where is the learning rate. So we can approximate the new log probability using a firstorder Taylor expansion
This calculation suggests that the prediction gain can be estimated just by computing the gradient of the log likelihood given a stateaction pair, i.e., . As discussed in [graves2017automated] this estimate of prediction gain is biased, but empirically we have found this method to perform well.
Experimental Evaluation
We validate our algorithm empirically in three simulated environments against baseline approaches. Finance, health and operations are common areas where risksensitive strategies are important, and we focus on two health domains and one operations domain. Details, where omitted, are provided in the supplemental material.
Machine Replacement
Machine repair and replacement is a classic example in the risk sensitive literature, though to our knowledge no prior work has considered how to quickly learn a good risksensitive policy for such domains. Here we consider a minor variant of a prior setting [delage2010percentile]. Specifically, as shown in Figure 2, the environment consists of a chain of (25 in our experiments) states. There are two actions: replace and don’t replace the machine. Choosing replace at any state terminates the episode, while choosing don’t replace moves the agent to the next state in the chain. At the end of the chain, choosing don’t replace
terminates the episode with a high variance cost, and choosing
replace terminates the episode with a higher cost but lower variance. This environment is especially a challenging exploration task due to the chain structure of the MDP, as well as the high variance of the reward distributions when taking actions in the last state. Additionally in this MDP it is feasible to exactly compute the optimal policy, which allows us to compare the learned policy to the true optimal CVaR policy. Note here that the optimal policy for maximizing is to replace on the final state in the chain to avoid the high variance alternative; in contrast, the optimal policy for expected return always chooses don’t replace.HIV Treatment
In order to test our algorithm on a larger continuous state space, we leverage an HIV Treatment simulator. The environment is based on the implementation by [geramifard2015rlpy] of the physical model described in [ernst2006clinical]. The patient state is represented as a
dimensional continuous vector and the reward is a function of number of free HIV viruses, immune response of the body to HIV, and side effects. There are four actions, each determining which drugs are administered for the next
day period: Reverse Transcriptase Inhibitors (RTI), Protease Inhibitors (PI), neither, or both. There are time steps in total per episode, for a total of days. We chose here a larger number of days per time step compared to the typical setup ( steps of days each) to facilitate faster experimentation. This design choice also makes the exploration task harder, since taking one wrong action can drastically destabilize a patient’s trajectory. The original proposed model was deterministic, which makes the CVaR policy identical to the policy optimizing the expected value. Such simulators are rarely a perfect proxy for real systems, and in our setting we add Gaussian noise to the efficacy of each drug (RTI: and PI: in [ernst2006clinical]). This change necessitates risksensitive policies in this environment.greedy  MDP  

Adult#003  11.2% 3.6%  4.2% 2.3% 
Adult#004  2.3% 0.3%  1.4% 0.6% 
Adult#005  3.3% 0.3%  1.7% 0.6% 
Diabetes 1 Treatment
Patients with type 1 diabetes regulate their blood glucose level with insulin in order to avoid hypoglycemia or hyperglycemia (very low or very high blood glucose level, respectively). A simulator has been created [man2014uva]
that is an open source version of a simulator that was approved by the FDA as a substitute for certain preclinical trials. The state is continuousvalued vector of the current blood glucose level and the amount of carbohydrate intake (through food). The action space is discretized into 6 levels of a bolus insulin injection. The reward function is defined similar to the prior work
[bastani2014model] as following:Where which is the estimate of bg (blood glucose) in mmol/L.
Additionally we inject two source of stochasticity into the taken action: First, we add Gaussian noise to the action. Second, we delay the time of the injection by at most 5 steps, where the probability of injection at time is higher than time following the power law. Each simulation lasts for 200 steps, during which a patient eats five meals. The agent chooses an action after each meal, and after the 200 steps each patient resets to its initial state.
This domain also readily offers a suite of related tasks, since the environment simulates 30 patients with slightly different dynamics. Tuning hyperparameters on the same task can be misleading [henderson2018deep], as is the case in our two previous benchmarks. In this setting we tune baselines and our method on one patient, and test the performance on different patients.
Baselines and Experimental Setup
The majority of prior risksensitive RL work has not focused on efficient exploration, and there has been very little deep distributional RL work focused on risk sensitivity. Our key contribution is to evaluate the impact of more strategic exploration on the efficiency with which a risksensitive policy can be learned. We compare to following approaches:

greedy CVaR: In this benchmark we use the same algorithm, except we do not introduce an optimism operator, instead using an greedy approach for exploration. This benchmark can be viewed as analogous to the distributional RL methods of C51 [bellemare2017distributional] if the computed policy had optimized for CVaR instead of expected reward.

IQN
greedy CVaR: In this benchmark we use implicit quantile network (IQN) that also uses
greedy method for exploration [dabney2018implicit]. We adopted the dopamine implementation of IQN [castro18dopamine]. 
CVaRAC: An actorcritic method proposed by [chow2014algorithms] that maximizes the expected return while satisfying an inequality constraint on the . This method relies on the stochasticity of the policy for exploration.
Note that a comparison to an expectation maximizing algorithm is uninformative since such approaches are maximizing different (nonrisksensitive) objectives.
All of these algorithms use hyperparameters, and it is well recognized that
greedy algorithms can often perform quite well if their hyperparameters are welltuned. To provide a fair comparison, we evaluated across a number of schedules for reducing the parameter for both greedy and IQN, and a small set of parameters (47) for the optimism value for our method. We used the specification described in Appendix C of [chow2014algorithms] for CVaRAC.The system architectures used in continuous settings are identical for Baseline 1 (
greedy) and our method. This consists of 2 hidden layers of size 32 with ReLU activation for Diabetes 1 Treatment, and 4 hidden layers of size 128 with ReLU activation for HIV Treatment, both followed by a softmax layer for each action. Similarly for IQN we used the same architecture, followed by a cosine embedding function and a fully connected layer of size 128 for HIV Treatment (32 for Diabetes 1 Treatment) with ReLU activation, followed by a softmax layer. The density model is a realNVP
[dinh2016density] with 3 hidden layers each of size 64.All results are averaged over 10 runs and we report 95% confidence intervals. We report the performance of greedy at evaluation time (setting = 0), which is the best performance of greedy.
For the Diabetes Treatment domain, hyperparameters are optimized only on adult#001. We then report results of the methods using those hyperparameters on adult#003 , adult#004 and adult#005.
Results and Discussion
Results on machine replacement environment (Figure 3), HIV Treatment (Figure 4) and Diabetes 1 Treatment (Figure 5) all show our optimistic algorithm achieves better performance much faster than the baselines.
In Machine Replacement (Figure 3) we see that our method quickly converges to the optimal CVaR performance. Unfortunately despite our best efforts, our implementation of CVaRAC did not perform well even on the simplest environment, so we did not show the performance of this method on other environments. One challenge here is that CVaRAC has a significant number of hyperparameters, including 3 different learning rates schedule for the optimization process, initial Lagrange multipliers and the kernel functions.
In the HIV Treatment we also see a clear and substantial benefit to our optimistic approach over the baseline greedy approach and IQN(Figure 4).
Figure 5 is particularly encouraging, as it shows the results for the diabetes simulator across 3 patients, where the hyperparameters were fixed after optimizing for a separate patient. Since in real settings it would be commonly necessary to fix the hyperparameters in advance, this result provides a nice demonstration that the optimistic approach can consistently equal or significantly improve over an greedy policy in related settings, similar to the well known results in Atari in which hyperparameters are optimized for one game and then used for multiple others.
”Safer” Exploration.
Our primary contribution is a new algorithm to learn risksensitive policies quickly, with less data. However, an interesting side benefit of such a method might be that the number of extremely poor outcomes experienced over time may also be reduced, not due to explicitly prioritizing a form of safe exploration, but because our algorithm may enable a faster convergence to a safe policy. To evaluate this, we consider a risk measure proposed by [clarke2009statistical], which quantifies the risk of a severe medical condition based on how close their glucose level is to hypoglycemia (blood glucose, 3.9 mmol/l) and hyperglycemia (blood glucose, 10 mmol/l).
Table 6 shows the fraction of episodes in which each patient experienced a severely poor outcome for each algorithm while learning. Optimismbased exploration approximately halves the number of episodes with severely poor outcomes, highlighting a side benefit of our optimistic approach of more quickly learning a good safe policy.
Related Work
Optimizing policies for risk sensitivity in MDPs has been long studied, with policy gradient [sampling, tamar2015policy], actor critic [tamar2013variance] and TD methods [tamar2013variance, sato2001td]. While most of this work considers meanvariance trade objectives, [chow2015risk] establish a connection between a optimizing CVaR and robustness to modeling errors, presenting a value iteration algorithm. In contrast, we do not assume access to transition and rewards models. [chow2014algorithms]
present a policy gradient and actorcritic algorithm for an expectationmaximizing objective with a CVaR constraint. None of these works considers systematic exploration but rely on heuristics such as
greedy or on the stochasticity of the policy for exploration. Instead, we focus on how to explore systematically to find a good CVaRpolicy.Our work builds upon recent advances on distributional RL [bellemare2017distributional, rowland2018analysis, dabney2018distributional] which are still concerned with optimizing expected return. Notably, [dabney2018implicit] aims to train riskaverse and riskseeking agents, but does not address the exploration problem or attempts to find optimal policies quickly.
[dilokthanakul2018deep] uses riskaverse objectives to guide exploration for good performance w.r.t. expected return. [moerland2018potential] leverages the return distribution learned in distributional RL as a means for optimism in deterministic environments. [mavrin2019distributional] follow a similar pattern but can handle stochastic environments by disentangling intrinsic and parametric uncertainty. While they also evaluate the policy that picks the VaRgreedy action in one experiment, their algorithm still optimizes expected return during learning. In general, these approaches are fundamentally different from ours which learns CVaR policies in stochastic environments efficiently by introducing optimism into the learned return distribution.
Conclusion
We present a new algorithm for quickly learning CVaRoptimal policies in Markov decision processes. This algorithm is the first to leverage optimism in combination with distributional reinforcement learning to learn riskaverse policies in a sampleefficient manner. Unlike existing work on expected return criteria which rely on reward bonuses for optimism, We introduce optimism by directly modifying the target return distribution and provide a theoretical justification that in the evaluation case for finite MDPs, this indeed yields optimistic estimates. We further empirically observe significantly faster learning of CVaRoptimal policies by our algorithm compared to existing baselines on several benchmark tasks. This includes simulated healthcare tasks where riskaverse policies are of particular interest: HIV medication treatment and insulin pump control for diabetes type 1 patients.
References
Appendix A Appendix
Experimental Details
Machine Replacement
Machine replacement environment consist of states (we use in the experiment), where action replace transitions to the terminal state with cost at state , where and . Action don’t replace has cost and transitions to state . In our experiment we used . However, for the last state action don’t replace has cost , where we used , and transitions to the terminal state. For C51 algorithm we use , , learning rate and 51 atoms.
Tuning: We use greedy with schedule that starts with and decays linearly to in time steps, staying constant afterwards. This schedule achieved the best performance in our experiments when compared to other linear schedules {, , , }, and exponential decays with schedule in the form of (, , ): {(0.9, 0.99, 5), (0.9, 0.99, 20), (0.9, 0.99, 2), (0.9, 0.99, 30), (0.5, 0.99, 5)} where . We have also tried our algorithm with optimism values of .
For the actor critic method we use the CVaR limit as 10, radial basis function as kernel and other set of hyperparameters are as described in the appendix of
[chow2014algorithms].Additional Experiments: Additional to the risk level , we observe the same gain in the performance for other risk levels. As shown in figure 7, optimism based exploration shows a significant gain over greedy exploration for risk levels and .
Type 1 Diabetes Simulator
An open source implementation of type 1 diabetes simulator [simglucose] simulates 30 different virtual patients, 10 child, 10 adolescent and 10 adult. For our experiments in this paper we have used adult#003, adult#004 and adult#005. Additionally we have used "Dexcom" sensor for CGM (to measure blood glucose level) and "Insulet" as a choice of insulin pump. All simulations are 10 hours for each patient and after 10 hour, patient resets to the initial state. Each step of simulation is 3 minutes.
State space is a continous vector of size 2 (glucose level, meal size) where glucose level is the amount of glucose measured by "Dexcom" sensor and meal size is the amount of Carbohydrate in each meal.
Action space is defined as (bolus, basal=0) where amount of bolus injection discretized by 6 bins between 30 (max bolus, a property of the "Insulet" insulin pump) and 0 (no injection). Additionally we inject two source of stochasticity to the taken action, assume action at time is the agent’s decision, then we take the action at time where:
Where is drawn from the power law distribution and . Note that this means delay the action at most 5 step where the probability of taking the action at time is higher than time following the power law. Since each step of simulation is 3 minutes, patient might take the insulin up to 15 minutes after the prescribed time by the agent.
Reward structure is defined similar to the prior work [bastani2014model] as following:
Where which is the estimate of bg (blood glucose) in mmol/L rather than mg/dL. Additionally if the amount of glucose is less than 39 mg/dL agent incurs a penalty of .
We generated a meal plan scenario for all the patients that is meal of size 60, 20, 60, 20 CHO with the schedule 1 hour, 3 hours, 5 hours and 7 hours after starting the simulation. Notice that this will make the simulation horizon 200 steps and 5 actionable steps (initial state, and after each meal).
Categorical Distributional RL:
The C51 model consist of 2 hidden layers each of size 32 and ReLU activation function, followed by of
each with 51 neurons followed by a softmax activation, for representing the distribution of each action.
We used Adam optimizer with learning rate , , and . We set , , 51 probability atoms, and used batch size of 32. For computing the CVaR we use 50 samples of the return.
Density Model: For log likelihood density model we used realNVP [dinh2016density] with 3 layers each of size 64. The input of the model is a concatenated vector of . We used same hyper parameters for optimizer as in C51 model. We have used constant for computing the pseudocount.
Tuning: We have tuned our method and greedy on patient adult#001 and used the same parameters for the other patients. We tried 5 different linear schedule of greedy, {(0.9, 0.1, 2), (0.9, 0.05, 4), (0.9, 0.05, 6), (0.9, 0.3, 4), (0.9, 0.3, 4), (0.9, 0.05, 10)} where first element is the initial , second element is the final and the third element is episode ratio (i.e. epsilon starts from initial and reaches to the final value in episode ratio fraction of total number of episodes, linearly). Additionally we have tried 5 different exponential decay schedule for greedy in the form of (, , ): {(0.9, 0.99, 5), (0.9, 0.99, 20), (0.9, 0.99, 2), (0.9, 0.99, 30), (0.5, 0.99, 5)} where . The first of the exponential decay set preformed the best. We have also tested our algorithm with constant optimism values of where we picked the best value .
HIV Simulator
The environment is an implementation of the physical model described in [ernst2006clinical]. The state space is of dimension 6 with and action space is of size 4, indicating the efficacy of being on the treatment. described in [ernst2006clinical] takes values as and . We have also added the stochasticity to the action by a random gaussian noise. So the efficacy of a drug is computed as .
The reward structure is defined similar to the prior work [ernst2006clinical]. And we simulate for 1000 time steps, where agent can take action in 50 steps (each 20 simulation step) and actions remains constant in each interval. While trianing we normalize the reward by dividing them by .
Categorical Distributional RL: The C51 model consist of 4 hidden layers each of size 128 and ReLU activation function, followed by of each with 151 neurons followed by a softmax activation, for representing the distribution of each action.
We used Adam optimizer with learning rate decay schedule from to in half a number of episodes, , and . We set , , 151 probability atoms, and used batch size of 32. For computing the CVaR we use 50 samples of the return.
Implicit Quantile Network: IQN model consists of 4 hidden layers with size 128 and ReLU activation. Then an embedding of size 64 computed by ReLU. Then we take the element wise multiplication of the embedding and the output of 4 hidden layers, followed by a fully connected layer with size 128 and ReLU activation, and a softmax layer. We used 8 samples for and and 32 quantiles.
Density Model: For log likelihood density model we used realNVP [dinh2016density] with 3 layers each of size 64. The input of the model is a concatenated vector of . We used same hyper parameters for optimizer as in C51 model. We have used constant for computing the pseudocount.
Tuning: We have tuned our method, greedy and IQN. For greedy we tried 5 different linear schedule of greedy, {(0.9, 0.05, 10), (0.9, 0.05, 8), (0.9, 0.05, 5), (0.9, 0.05, 4), (0.9, 0.05, 2)} where first element is the initial , second element is the final and the third element is episode ratio (i.e. epsilon starts from initial and reaches to the final value in episode ratio fraction of total number of episodes, linearly). Additionally we have tried 5 different exponential decay schedule for greedy and IQN in the form of (, , ): {(1.0, 0.9, 10), (1.0, 0.9, 100), (1.0, 0.9, 500), (1.0, 0.99, 10), (1, 0.99, 100)} where . The first of the linear decay set preformed the best. We have also tested our algorithm with constant optimism values of where we picked the best value .
Theoretical Analysis
[Restatement of Proposition Theoretical Analysis] For any , the operator is a nonexpansion in the Cramér distance . This implies that optimistic distributional Bellman backups and the projected version are contractions in and iterates of these operators converge in to a unique fixedpoint.
Proof.
Consider , any state and action with CDFs and and consider the application of the optimism operator :
(8) 
Generally, for any we have
(9) 
and applying this casebycase bound to the quantity in the integral above, we get
(10) 
By taking the square root on both sides as well as a max over states and actions, we get that is a nonexpansion in . The rest of the statement follows from the fact that is a contraction and a nonexpansion [rowland2018analysis] and the Banach fixedpoint theorem. ∎
[Restatement of Theorem Theoretical Analysis] Let the shift parameter in the optimistic operator be sufficiently large which is . Then with probability at least , the iterates converges for any risk level and initial to an optimistic estimate of the policy’s conditional value at risk. That is, with probability at least ,
Proof.
By Lemma A and Lemma 3 by [bellemare2017distributional], we know that converges to a unique fixedpoint , independent of the initial . Hence, without loss of generality, we can choose .
We proceed by first showing how our result will follow under a particular definition of , and then show what that definition is. Assume that we have obtained a value for that satisfies the assumption of Lemma A (see other parts of this appendix), and let and . Then Lemma A implies that if for all , then also for all . Thus, for all . Finally, we can use Lemma A to obtain the desired result of our proof statement, for all .
Going back, we use concentration inequalities to determine the value of that ensures the required condition in Lemma A (expressed in Eq. (19)). The DKWinequality which give us that for any with probability at least
(11) 
Further, the inequality by [weissman2003inequalities] gives that
(12) 
Combining both with a union bound over all stateaction pairs, we get that it is sufficient to choose to ensure that allowing us to apply Lemma A.
However, we can improve this result by removing the polynomial dependency of on the number of states as follows. Consider a fixed and denote where we set . Our goal is to derive a concentration bound on that is tighter than the bound derived from . Note that is not a random quantity and hence is a normalized sum of independent random variables for any . To deal with the continuous variable which prevents us from applying a union bound over directly, we use a covering argument. Let be arbitrary and consider the discretization set
(13) 
Define as the discretization of at the discretization points in . This construction ensures that the discretization error is uniformly bounded by , that is, holds for all and . Hence, we can bound for all
(14)  
(15)  
(16)  
(17) 
where in , we applied Hoeffding’s inequality to the first term in combination with a union bound over as can only take values in . The second term was bounded with Hölder’s inequality and in the concentration inequality from Eq. (12) was used. Combining this bound with Eq. (11) by applying a union bound over all states and actions, we get that picking
(18) 
is sufficient to apply Lemma A. Since is nondecreasing, the size of the discretization set is at most and by picking , we see that is sufficient.
∎
Let be such that for all . Assume finitely many states and actions . Let and be the empirical reward distributions and transition probabilities. Assume further that
(19) 
where is the shift in the optimism operator . Then
Comments
There are no comments yet.