Being Optimistic to Be Conservative: Quickly Learning a CVaR Policy

11/05/2019 ∙ by Ramtin Keramati, et al. ∙ 0

While maximizing expected return is the goal in most reinforcement learning approaches, risk-sensitive objectives such as conditional value at risk (CVaR) are more suitable for many high-stakes applications. However, relatively little is known about how to explore to quickly learn policies with good CVaR. In this paper, we present the first algorithm for sample-efficient learning of CVaR-optimal policies in Markov decision processes based on the optimism in the face of uncertainty principle. This method relies on a novel optimistic version of the distributional Bellman operator that moves probability mass from the lower to the upper tail of the return distribution. We prove asymptotic convergence and optimism of this operator for the tabular policy evaluation case. We further demonstrate that our algorithm finds CVaR-optimal policies substantially faster than existing baselines in several simulated environments with discrete and continuous state spaces.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


A key goal in reinforcement learning (RL) is to quickly learn to make good decisions by interacting with an environment. In most cases the quality of the decision policy is evaluated with respect to its expected (discounted) sum of rewards. However, in many interesting cases, it is important to consider the full distributions over the potential sum of rewards, and the desired objective may be a risk-sensitive measure of this distribution. For example, a patient undergoing a surgery for a knee replacement will (hopefully) only experience that procedure once or twice, and may will be interested in the distribution of potential results for a single procedure, rather than what may happen on average if he or she were to undertake that procedure hundreds of time. Finance and (machine) control are other cases where interest in risk-sensitive outcomes are common.

A popular risk-sensitive measure of a distribution of outcomes is the Conditional Value at Risk (CVaR) [artzner1999coherent]. Intuitively, CVaR is the expected reward in the worst -fraction of outcomes, and has seen extensive use in financial portfolio optimization [finance-zhu2009worst], often under the name “expected shortfall”. While there has been recent interest in the RL community in learning to converge or identify good CVaR decision policies in Markov decision processes  [chow2014algorithms, chow2015risk, sampling, dabney2018implicit], interestingly we are unaware of prior work focused on how to quickly learn such CVaR MDP policies, even though sample efficient RL for maximizing expected outcomes is a deep and well studied theoretical [Jaksch10, dann2018policy] and empirical [bellemare2016unifying] topic. Sample efficient exploration seems of equal or even more importance in the case when the goal is risk-averse outcomes.

In this paper we work towards sample efficient reinforcement learning algorithms that can quickly identify a policy with an optimal CVaR. Our focus is in minimizing the amount of experience needed to find such a policy, similar in spirit to probably approximately correct RL methods for expected reward. Note that this is different than another important topic in risk-sensitive RL, which focuses on safe exploration: algorithms that focus on avoiding any potentially very poor outcomes during learning. These typically rely on local smoothness assumptions and do not typically focus on sample efficiency [berkenkamp2017safe, koller2018learning]; an interesting question for future work is whether one can do both safe and efficient learning of a CVaR policy. Our work is suitable for the many settings where some outcomes are undesirable but not catastrophic.

Our approach is inspired by the popular and effective principle of optimism in the face of uncertainty (OFU) in sample efficient RL for maximizing expected outcomes [strehl2008analysis, brafman2002r]

. Such work typically works by considering uncertainty over the MDP model parameters or state-action value function, and constructing an optimistic value function given that uncertainty that is then used to guide decision making. To take a similar idea for rapidly learning the optimal CVaR policy, we seek to consider the uncertainty in the distribution of outcomes possible and the resulting CVaR value. To do so, we use the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality—while to our knowledge this has not been previously used in reinforcement learning settings, it is a very useful concentration inequality for our purposes as it provides bounds on the true cumulative distribution function (CDF) given a set of sampled outcomes. We leverage these bounds in order to compute optimistic estimates of the optimal CVaR.

Our interest is in creating empirically efficient and scalable algorithms that have a theoretically sound grounding. To that end, we introduce a new algorithm for quickly learning a CVaR policy in MDPs and show that at least in the evaluation case in tabular MDPs, this algorithm indeed produces optimistic estimates of the CVaR. We also show that it does converge eventually. We accompany the theoretical evidence with an empirical evaluation. We provide encouraging empirical results on a machine replacement task [delage2010percentile], a classic MDP where risk sensitive policies are critical, as well as a well validated simulator for type 1 diabetes [man2014uva] and a simulated treatment optimization task for HIV [ernst2006clinical]. In all cases we find a substantial benefit over simpler exploration strategies. To our knowledge this is the first algorithm that performs strategic exploration to learn good CVaR MDP policies.

Background and Notation


be a bounded random variable with cumulative distribution function

. The conditional value at risk (CVaR) at level of a random variable is then defined as [rockafellar2000optimization]:


We define the inverse CDF as . It is well known that when has a continuous distribution, [acerbi2002coherence]. For ease of notation we sometimes write as a function of the CDF , .

We are interested in the of the discounted cumulative reward in a Markov Decision Process (MDP). An MDP is defined by a tuple , where and are finite state and action space, is the reward distribution, is the transition kernel and is the discount factor. A stationary policy maps each state

to a probability distribution over action space


Let denote the space of distributions over returns (discounted cumulative rewards) from such an MDP, and assume that these returns are in almost surely, where . We define to be the distribution of the return of policy with CDF and initial state action pair as . RL algorithms most commonly optimize policies for expected return and explicitly learn Q-values,

by applying approximate versions of Bellman backups. Instead, we are interested in other properties of the return distribution and we will build on several recently proposed algorithms that aim to learn a parametric model of the entire return distribution instead of only its expectation. Such approaches are known as

distributional RL methods.

Distributional Reinforcement Learning

Distributional RL methods apply a sample-based approximation to distributional versions of the usual Bellman operators. For example, one can define a distributional Bellman operator [bellemare2017distributional] as as


where denotes equality in distribution, and the transition operator is defined as with , . The optimality version is similarly any where is an optimal policy w.r.t. expected return. Note that this is not necessarily unique when there are multiple optimal policies. [rowland2018analysis] showed that is a -contraction in the Cramér-metric,


One of the canonical algorithms in distributional RL is CDRL or C51 [bellemare2017distributional] which represent the return distribution as a discrete distribution with fixed support on atoms the discrete distribution is parameterized as :

Essentially, C51 uses a sample transition to perform an approximate Bellman backup , where is a sample-based Bellman operator and is a projection back onto the support of discrete distribution .

Optimistic Distributional Operator

In contrast to the typical RL setup where an agent tries to maximize its expected return, we seek to learn a stationary policy that maximizes the of the return at risk level .111Note that the -optimal policy at any state can be non-stationary [shapiro2009lectures], as it depends on the sum of rewards achieved up to that state. For simplicity, as [dabney2018distributional] we instead seek a stationary policy, which will generally can be suboptimal but typically still achieve high CVaR, as observed in our experiments. To find such policies quickly, we follow the optimism-in-the-face-of-uncertainty (OFU) principle and introduce optimism in our CVaR estimates to guide exploration. While adding a bonus to rewards is a popular approach for optimism in the standard expected return case [ostrovski2017count], we here follow a different approach and introduce optimism into our return estimates by shifting the empirical CDFs. Formally, consider a return distribution with CDF . We define the optimism operator as


where is a constant and is short for . In the definition above, is the number of times the pair has been observed so far or an approximation such as pseudo-counts [bellemare2016unifying]. By shifting the cumulative distribution function down, this operator essentially puts probability mass from the lower tail to the highest possible value . An illustration is provided in Figure 1.

Figure 1: Top-left: Empirical CDF Top-right: The lower DKW confidence band (a shifted-down version of the empirical CDF). Bottom-left: Empirical PDF. Bottom-right: Optimistic PDF.

This approach to optimism is motivated by an application of the DKW-inequality to the empirical CDF. As shown recently by [thomas2019concentration], this can yield tighter upper confidence bounds on the CVaR.

Theoretical Analysis

The optimistic operator introduced above operates on the entire return distribution and our algorithm introduced in the next section combines this optimistic operator to estimated return-to-go distributions. As such, it belongs to the family of distributional RL methods [dabney2018distributional]. These methods are a recent development and come with strong asymptotic convergence guarantees when used for policy evaluation in tabular MDPs [rowland2018analysis]. Yet, finite sample guarantees such as regret or PAC bounds still remain elusive for distributional RL policy optimization algorithms.

A key technical challenge in proving performing bounds for distributionally robust policy optimization during RL is that convergence of the distributional Bellman optimality operator can generally not be guaranteed. Prior results have only showed that if the optimization process itself is to compute a policy which maximizes expected returns, such as Q-learning, then convergence of the distirbutional Bellman optimality operator is guaranteed to converge. [rowland2018analysis, Theorem 2]. Note however that if the goal is to leverage distributional information to compute a policy to maximize something other than expected outcomes, such as a risk sensitive policy like we consider here, no prior theoretical results are known in the reinforcement learning setting to our knowledge. However, it is promising that there is some empirical evidence that one can compute risk-sensitive policies using distributional Bellman operators [dabney2018implicit] which suggests that more theoretical results may be possible.

Here we take a first step towards this goal. Our primary aim in this work is to provide tools to introduce optimism into distributional return-to-go estimates to guide sample-efficient exploration for CVaR. Therefore, our theoretical analysis focuses on showing that this form of optimism does not harm convergence and is indeed a principled way to obtain optimistic CVaR estimates.

First, we prove that the optimism operator is a non-expansion in the Cramér distance. This results shows that this operator can be used with other contraction operators without negatively impacting the convergence behaviour. Specifically we can guarantee convergence with distributional Bellman backup. For any , the operator is a non-expansion in the Cramér distance . This implies that optimistic distributional Bellman backups and the projected version are -contractions in and iterates of these operators converge in to a unique fixed-point.

Next, we provide theoretical evidence that this operator indeed produces optimistic CVaR estimates. Consider here batch policy evaluation in MDPs with finite state- and action-spaces. Assume that we have collected a fixed number of samples (which can vary across states and actions) and build an empirical model of the MDP. For any policy , let denote the distributional Bellman operator in this empirical MDP. Then we indeed achieve optimistic estimates by the following result:

Let the shift parameter in the optimistic operator be sufficiently large which is . Then with probability at least , the iterates converges for any risk level and initial to an optimistic estimate of the policy’s conditional value at risk. That is, with probability at least ,

This theorem uses the DKW inequality which to the best of our knowledge has not been used for MDPs. Note, that the statement guarantees optimism for all risk levels without paying a penalty for it. Since we estimate the transitions and rewards for each state and action separately, one generally does not expect to be able to use a shift parameter smaller than . Thus, Theorem Theoretical Analysis is unimprovable in that sense. Specifically, we avoid a polynomial dependency on the number of states in the shift parameter by combining two techniques: (1) concentration inequalities w.r.t. the optimal CVaR of the next state for a certain finite set of alphas and (2) a covering argument to get optimism for all infinitely many . This is substantially more involved than the expected reward case.

These results are a key step towards finite-sample analyses. In future work it would be very interesting to obtain a convergence analysis for distributional Bellman optimality operators in general, though this is outside the scope of this current paper. Such a result could lead to sample-complexity guarantees when combined with our existing analysis.


In the policy evaluation case where we would like to compute optimistic estimates of the CVaR of a given observed policy , our algorithm essentially performs an approximate version of the optimistic Bellman update where is the distributional Bellman operator. For the control case where we would like to learn a policy that maximizes CVaR, we instead define a distributional Bellman optimality operator . Analogous to prior work [bellemare2017distributional], is any operator that satisfies for some policy that is greedy w.r.t. CVaR at level . Our algorithm then performs an approximate version of the optimistic Bellman backup , shown in Algorithm 1.

The main structure of our algorithm resembles categorical distributional reinforcement learning (C51)  [bellemare2017distributional]. In a similar vein, our algorithm also maintains a return distribution estimate for each state-action pair, represented as a set of weights for . These weights represent a discrete distribution with outcomes at equally spaced locations , each apart. The current probability assigned to outcome in is denoted by , where the atom probabilities

are given by a differentiable model such as a neural network, similar to C51. Note that other parameterized representations of the weights

[bellemare2017distributional] are straightforward to incorporate.

The main differences between Algorithm 1 and existing distributional RL algorithms (e.g. C51) are highlighted in red. We first apply an optimism operator to our successor distribution (Lines 11) to form an optimistic CDF for all actions . This operator should encourage exploring actions that might lead to higher CVaR policies for our input . These optimistic CDFs are also used to decide on the successor action in the control setting (Line 1). Then, similar to C51 we apply the Bellman operator for and distribute the probability of to the immediate neighbours of , where we calculate the probability mass with the optimistic CDF (Line 1).

Input: Parameters: , risk level , , density model ,
1 for t=1, … do
2       Observe transition , , , ;
3       for  do
             /* emp. CDF of return for */
4             ;
             /* Pseudo-counts using density model */
             /* Optimistic CDF */
5             ;


    if Control then


    if Evaluation then
12       for ;
13       for  do
             /* optimistic PDF from opt. CDF */
14             ;
             /* Project on support of */
15             ;
             /* Distribute prob. of */
16             ;
17             ;
18             ;
19             ;
21      Update return weights by optimization step on cross-entropy loss ;
       /* Take next action */
22       ;
23       Update density model for with additional observation of ;
Algorithm 1 CVaR-MDP

Following [bellemare2017distributional], we train this model using the cross-entropy loss, which for a particular state transition at time is


where are the weights of the target distribution computed in Lines 11 in Algorithm 1. In the tabular setting we can directly update the probability mass by

where is the learning rate.

In tabular settings, the counts n(s,a) can be directly stored and used; however, this is not the case in continuous settings. For this reason, we adopt the pseudo-count estimation method proposed by [ostrovski2017count] and replace by a pseudo-count in the optimistic distributional operator (Equation 4). Let be a density model and the probability assigned to the state action pair by the model after training steps. The prediction gain of is defined


Where is the probability assigned to if it were trained on that same one more time. Now we define the pseudo count of as


where is a constant hyper-parameter, and thresholds the value of the prediction gain at 0.

Our setting differs from [ostrovski2017count] in the sense that we have to compute the count before taking the action . A naive way would be to try all actions and train the model to compute the counts but this method is slow and requires the environment to support an undo action. Instead, we can estimate for all actions as follows. Consider the density model parametrized by , . After observing , the training step to maximize the log likelihood will update the parameters by , where is the learning rate. So we can approximate the new log probability using a first-order Taylor expansion

This calculation suggests that the prediction gain can be estimated just by computing the gradient of the log likelihood given a state-action pair, i.e., . As discussed in [graves2017automated] this estimate of prediction gain is biased, but empirically we have found this method to perform well.

Experimental Evaluation

We validate our algorithm empirically in three simulated environments against baseline approaches. Finance, health and operations are common areas where risk-sensitive strategies are important, and we focus on two health domains and one operations domain. Details, where omitted, are provided in the supplemental material.

Machine Replacement

Machine repair and replacement is a classic example in the risk sensitive literature, though to our knowledge no prior work has considered how to quickly learn a good risk-sensitive policy for such domains. Here we consider a minor variant of a prior setting [delage2010percentile]. Specifically, as shown in Figure 2, the environment consists of a chain of (25 in our experiments) states. There are two actions: replace and don’t replace the machine. Choosing replace at any state terminates the episode, while choosing don’t replace moves the agent to the next state in the chain. At the end of the chain, choosing don’t replace

terminates the episode with a high variance cost, and choosing

replace terminates the episode with a higher cost but lower variance. This environment is especially a challenging exploration task due to the chain structure of the MDP, as well as the high variance of the reward distributions when taking actions in the last state. Additionally in this MDP it is feasible to exactly compute the -optimal policy, which allows us to compare the learned policy to the true optimal CVaR policy. Note here that the optimal policy for maximizing is to replace on the final state in the chain to avoid the high variance alternative; in contrast, the optimal policy for expected return always chooses don’t replace.

Figure 2: Machine Replacement: This environment consists of a chain of states, each affording two actions: replace and don’t replace.
HIV Treatment

In order to test our algorithm on a larger continuous state space, we leverage an HIV Treatment simulator. The environment is based on the implementation by [geramifard2015rlpy] of the physical model described in [ernst2006clinical]. The patient state is represented as a

-dimensional continuous vector and the reward is a function of number of free HIV viruses, immune response of the body to HIV, and side effects. There are four actions, each determining which drugs are administered for the next

day period: Reverse Transcriptase Inhibitors (RTI), Protease Inhibitors (PI), neither, or both. There are time steps in total per episode, for a total of days. We chose here a larger number of days per time step compared to the typical setup ( steps of days each) to facilitate faster experimentation. This design choice also makes the exploration task harder, since taking one wrong action can drastically destabilize a patient’s trajectory. The original proposed model was deterministic, which makes the CVaR policy identical to the policy optimizing the expected value. Such simulators are rarely a perfect proxy for real systems, and in our setting we add Gaussian noise to the efficacy of each drug (RTI: and PI: in [ernst2006clinical]). This change necessitates risk-sensitive policies in this environment.

Figure 3: Machine Replacement: The thick grey dashed line is the CVaR

-optimal policy. The thin dashed lines labeled as the suboptimal policy is the optimal expectation-maximizing policy. The shaded area shows the 95% confidence intervals.

Figure 4: Comparison of our approach against an -greedy and IQN baseline. All models were trained to optimize the of the return on a stochastic version of the HIV simulator [ernst2006clinical]. Top: Objective CVaR; Bottom: Discounted expected return of the same policies as in top plot.
Figure 5: Type 1 diabetes simulator: CVaR for three different adults. Plots are averaged over 10 runs with 95% CI.
-greedy -MDP
Adult#003 11.2% 3.6% 4.2% 2.3%
Adult#004 2.3% 0.3% 1.4% 0.6%
Adult#005 3.3% 0.3% 1.7% 0.6%
Figure 6: Type 1 Diabetes simulator, percent of episodes where patients experienced a severe medical condition (hypoglycemia or hyperglycemia), averaged across 10 runs
Diabetes 1 Treatment

Patients with type 1 diabetes regulate their blood glucose level with insulin in order to avoid hypoglycemia or hyperglycemia (very low or very high blood glucose level, respectively). A simulator has been created [man2014uva]

that is an open source version of a simulator that was approved by the FDA as a substitute for certain pre-clinical trials. The state is continuous-valued vector of the current blood glucose level and the amount of carbohydrate intake (through food). The action space is discretized into 6 levels of a bolus insulin injection. The reward function is defined similar to the prior work

[bastani2014model] as following:

Where which is the estimate of bg (blood glucose) in mmol/L.

Additionally we inject two source of stochasticity into the taken action: First, we add Gaussian noise to the action. Second, we delay the time of the injection by at most 5 steps, where the probability of injection at time is higher than time following the power law. Each simulation lasts for 200 steps, during which a patient eats five meals. The agent chooses an action after each meal, and after the 200 steps each patient resets to its initial state.

This domain also readily offers a suite of related tasks, since the environment simulates 30 patients with slightly different dynamics. Tuning hyper-parameters on the same task can be misleading [henderson2018deep], as is the case in our two previous benchmarks. In this setting we tune baselines and our method on one patient, and test the performance on different patients.

Baselines and Experimental Setup

The majority of prior risk-sensitive RL work has not focused on efficient exploration, and there has been very little deep distributional RL work focused on risk sensitivity. Our key contribution is to evaluate the impact of more strategic exploration on the efficiency with which a risk-sensitive policy can be learned. We compare to following approaches:

  1. -greedy CVaR: In this benchmark we use the same algorithm, except we do not introduce an optimism operator, instead using an -greedy approach for exploration. This benchmark can be viewed as analogous to the distributional RL methods of C51 [bellemare2017distributional] if the computed policy had optimized for CVaR instead of expected reward.

  2. IQN-

    -greedy CVaR: In this benchmark we use implicit quantile network (IQN) that also uses

    -greedy method for exploration [dabney2018implicit]. We adopted the dopamine implementation of IQN [castro18dopamine].

  3. CVaR-AC: An actor-critic method proposed by [chow2014algorithms] that maximizes the expected return while satisfying an inequality constraint on the . This method relies on the stochasticity of the policy for exploration.

Note that a comparison to an expectation maximizing algorithm is uninformative since such approaches are maximizing different (non-risk-sensitive) objectives.

All of these algorithms use hyperparameters, and it is well recognized that

-greedy algorithms can often perform quite well if their hyperparameters are well-tuned. To provide a fair comparison, we evaluated across a number of schedules for reducing the parameter for both -greedy and IQN, and a small set of parameters (4-7) for the optimism value for our method. We used the specification described in Appendix C of [chow2014algorithms] for CVaR-AC.

The system architectures used in continuous settings are identical for Baseline 1 (

-greedy) and our method. This consists of 2 hidden layers of size 32 with ReLU activation for Diabetes 1 Treatment, and 4 hidden layers of size 128 with ReLU activation for HIV Treatment, both followed by a softmax layer for each action. Similarly for IQN we used the same architecture, followed by a cosine embedding function and a fully connected layer of size 128 for HIV Treatment (32 for Diabetes 1 Treatment) with ReLU activation, followed by a softmax layer. The density model is a realNVP

[dinh2016density] with 3 hidden layers each of size 64.

All results are averaged over 10 runs and we report 95% confidence intervals. We report the performance of -greedy at evaluation time (setting = 0), which is the best performance of -greedy.

For the Diabetes Treatment domain, hyperparameters are optimized only on adult#001. We then report results of the methods using those hyperparameters on adult#003 , adult#004 and adult#005.

Results and Discussion

Results on machine replacement environment (Figure 3), HIV Treatment (Figure 4) and Diabetes 1 Treatment (Figure 5) all show our optimistic algorithm achieves better performance much faster than the baselines.

In Machine Replacement (Figure 3) we see that our method quickly converges to the optimal CVaR performance. Unfortunately despite our best efforts, our implementation of CVaR-AC did not perform well even on the simplest environment, so we did not show the performance of this method on other environments. One challenge here is that CVaR-AC has a significant number of hyper-parameters, including 3 different learning rates schedule for the optimization process, initial Lagrange multipliers and the kernel functions.

In the HIV Treatment we also see a clear and substantial benefit to our optimistic approach over the baseline -greedy approach and IQN(Figure  4).

Figure 5 is particularly encouraging, as it shows the results for the diabetes simulator across 3 patients, where the hyperparameters were fixed after optimizing for a separate patient. Since in real settings it would be commonly necessary to fix the hyperparameters in advance, this result provides a nice demonstration that the optimistic approach can consistently equal or significantly improve over an -greedy policy in related settings, similar to the well known results in Atari in which hyperparameters are optimized for one game and then used for multiple others.

”Safer” Exploration.

Our primary contribution is a new algorithm to learn risk-sensitive policies quickly, with less data. However, an interesting side benefit of such a method might be that the number of extremely poor outcomes experienced over time may also be reduced, not due to explicitly prioritizing a form of safe exploration, but because our algorithm may enable a faster convergence to a safe policy. To evaluate this, we consider a risk measure proposed by [clarke2009statistical], which quantifies the risk of a severe medical condition based on how close their glucose level is to hypoglycemia (blood glucose, 3.9 mmol/l) and hyperglycemia (blood glucose, 10 mmol/l).

Table 6 shows the fraction of episodes in which each patient experienced a severely poor outcome for each algorithm while learning. Optimism-based exploration approximately halves the number of episodes with severely poor outcomes, highlighting a side benefit of our optimistic approach of more quickly learning a good safe policy.

Related Work

Optimizing policies for risk sensitivity in MDPs has been long studied, with policy gradient [sampling, tamar2015policy], actor critic [tamar2013variance] and TD methods [tamar2013variance, sato2001td]. While most of this work considers mean-variance trade objectives, [chow2015risk] establish a connection between a optimizing CVaR and robustness to modeling errors, presenting a value iteration algorithm. In contrast, we do not assume access to transition and rewards models. [chow2014algorithms]

present a policy gradient and actor-critic algorithm for an expectation-maximizing objective with a CVaR constraint. None of these works considers systematic exploration but rely on heuristics such as

-greedy or on the stochasticity of the policy for exploration. Instead, we focus on how to explore systematically to find a good CVaR-policy.

Our work builds upon recent advances on distributional RL [bellemare2017distributional, rowland2018analysis, dabney2018distributional] which are still concerned with optimizing expected return. Notably, [dabney2018implicit] aims to train risk-averse and risk-seeking agents, but does not address the exploration problem or attempts to find optimal policies quickly.

[dilokthanakul2018deep] uses risk-averse objectives to guide exploration for good performance w.r.t. expected return. [moerland2018potential] leverages the return distribution learned in distributional RL as a means for optimism in deterministic environments. [mavrin2019distributional] follow a similar pattern but can handle stochastic environments by disentangling intrinsic and parametric uncertainty. While they also evaluate the policy that picks the VaR-greedy action in one experiment, their algorithm still optimizes expected return during learning. In general, these approaches are fundamentally different from ours which learns CVaR policies in stochastic environments efficiently by introducing optimism into the learned return distribution.


We present a new algorithm for quickly learning CVaR-optimal policies in Markov decision processes. This algorithm is the first to leverage optimism in combination with distributional reinforcement learning to learn risk-averse policies in a sample-efficient manner. Unlike existing work on expected return criteria which rely on reward bonuses for optimism, We introduce optimism by directly modifying the target return distribution and provide a theoretical justification that in the evaluation case for finite MDPs, this indeed yields optimistic estimates. We further empirically observe significantly faster learning of CVaR-optimal policies by our algorithm compared to existing baselines on several benchmark tasks. This includes simulated healthcare tasks where risk-averse policies are of particular interest: HIV medication treatment and insulin pump control for diabetes type 1 patients.


Appendix A Appendix

Experimental Details

Machine Replacement

Machine replacement environment consist of states (we use in the experiment), where action replace transitions to the terminal state with cost at state , where and . Action don’t replace has cost and transitions to state . In our experiment we used . However, for the last state action don’t replace has cost , where we used , and transitions to the terminal state. For C51 algorithm we use , , learning rate and 51 atoms.

Tuning: We use -greedy with schedule that starts with and decays linearly to in time steps, staying constant afterwards. This schedule achieved the best performance in our experiments when compared to other linear schedules {, , , }, and exponential decays with schedule in the form of (, , ): {(0.9, 0.99, 5), (0.9, 0.99, 20), (0.9, 0.99, 2), (0.9, 0.99, 30), (0.5, 0.99, 5)} where . We have also tried our algorithm with optimism values of .

For the actor critic method we use the CVaR limit as -10, radial basis function as kernel and other set of hyper-parameters are as described in the appendix of


Additional Experiments: Additional to the risk level , we observe the same gain in the performance for other risk levels. As shown in figure 7, optimism based exploration shows a significant gain over greedy exploration for risk levels and .

Figure 7: Machine Replacement with different risk levels. Left: risk level , Right: risk level

Type 1 Diabetes Simulator

An open source implementation of type 1 diabetes simulator [simglucose] simulates 30 different virtual patients, 10 child, 10 adolescent and 10 adult. For our experiments in this paper we have used adult#003, adult#004 and adult#005. Additionally we have used "Dexcom" sensor for CGM (to measure blood glucose level) and "Insulet" as a choice of insulin pump. All simulations are 10 hours for each patient and after 10 hour, patient resets to the initial state. Each step of simulation is 3 minutes.

State space is a continous vector of size 2 (glucose level, meal size) where glucose level is the amount of glucose measured by "Dexcom" sensor and meal size is the amount of Carbohydrate in each meal.

Action space is defined as (bolus, basal=0) where amount of bolus injection discretized by 6 bins between 30 (max bolus, a property of the "Insulet" insulin pump) and 0 (no injection). Additionally we inject two source of stochasticity to the taken action, assume action at time is the agent’s decision, then we take the action at time where:

Where is drawn from the power law distribution and . Note that this means delay the action at most 5 step where the probability of taking the action at time is higher than time following the power law. Since each step of simulation is 3 minutes, patient might take the insulin up to 15 minutes after the prescribed time by the agent.

Reward structure is defined similar to the prior work [bastani2014model] as following:

Where which is the estimate of bg (blood glucose) in mmol/L rather than mg/dL. Additionally if the amount of glucose is less than 39 mg/dL agent incurs a penalty of .

We generated a meal plan scenario for all the patients that is meal of size 60, 20, 60, 20 CHO with the schedule 1 hour, 3 hours, 5 hours and 7 hours after starting the simulation. Notice that this will make the simulation horizon 200 steps and 5 actionable steps (initial state, and after each meal).

Categorical Distributional RL:

The C51 model consist of 2 hidden layers each of size 32 and ReLU activation function, followed by of

each with 51 neurons followed by a softmax activation, for representing the distribution of each action.

We used Adam optimizer with learning rate , , and . We set , , 51 probability atoms, and used batch size of 32. For computing the CVaR we use 50 samples of the return.

Density Model: For log likelihood density model we used realNVP [dinh2016density] with 3 layers each of size 64. The input of the model is a concatenated vector of . We used same hyper parameters for optimizer as in C51 model. We have used constant for computing the pseudo-count.

Tuning: We have tuned our method and -greedy on patient adult#001 and used the same parameters for the other patients. We tried 5 different linear schedule of -greedy, {(0.9, 0.1, 2), (0.9, 0.05, 4), (0.9, 0.05, 6), (0.9, 0.3, 4), (0.9, 0.3, 4), (0.9, 0.05, 10)} where first element is the initial , second element is the final and the third element is episode ratio (i.e. epsilon starts from initial and reaches to the final value in episode ratio fraction of total number of episodes, linearly). Additionally we have tried 5 different exponential decay schedule for -greedy in the form of (, , ): {(0.9, 0.99, 5), (0.9, 0.99, 20), (0.9, 0.99, 2), (0.9, 0.99, 30), (0.5, 0.99, 5)} where . The first of the exponential decay set preformed the best. We have also tested our algorithm with constant optimism values of where we picked the best value .

HIV Simulator

The environment is an implementation of the physical model described in [ernst2006clinical]. The state space is of dimension 6 with and action space is of size 4, indicating the efficacy of being on the treatment. described in [ernst2006clinical] takes values as and . We have also added the stochasticity to the action by a random gaussian noise. So the efficacy of a drug is computed as .

The reward structure is defined similar to the prior work [ernst2006clinical]. And we simulate for 1000 time steps, where agent can take action in 50 steps (each 20 simulation step) and actions remains constant in each interval. While trianing we normalize the reward by dividing them by .

Categorical Distributional RL: The C51 model consist of 4 hidden layers each of size 128 and ReLU activation function, followed by of each with 151 neurons followed by a softmax activation, for representing the distribution of each action.

We used Adam optimizer with learning rate decay schedule from to in half a number of episodes, , and . We set , , 151 probability atoms, and used batch size of 32. For computing the CVaR we use 50 samples of the return.

Implicit Quantile Network: IQN model consists of 4 hidden layers with size 128 and ReLU activation. Then an embedding of size 64 computed by ReLU. Then we take the element wise multiplication of the embedding and the output of 4 hidden layers, followed by a fully connected layer with size 128 and ReLU activation, and a softmax layer. We used 8 samples for and and 32 quantiles.

Density Model: For log likelihood density model we used realNVP [dinh2016density] with 3 layers each of size 64. The input of the model is a concatenated vector of . We used same hyper parameters for optimizer as in C51 model. We have used constant for computing the pseudo-count.

Tuning: We have tuned our method, -greedy and IQN. For -greedy we tried 5 different linear schedule of -greedy, {(0.9, 0.05, 10), (0.9, 0.05, 8), (0.9, 0.05, 5), (0.9, 0.05, 4), (0.9, 0.05, 2)} where first element is the initial , second element is the final and the third element is episode ratio (i.e. epsilon starts from initial and reaches to the final value in episode ratio fraction of total number of episodes, linearly). Additionally we have tried 5 different exponential decay schedule for -greedy and IQN in the form of (, , ): {(1.0, 0.9, 10), (1.0, 0.9, 100), (1.0, 0.9, 500), (1.0, 0.99, 10), (1, 0.99, 100)} where . The first of the linear decay set preformed the best. We have also tested our algorithm with constant optimism values of where we picked the best value .

Theoretical Analysis

[Restatement of Proposition Theoretical Analysis] For any , the operator is a non-expansion in the Cramér distance . This implies that optimistic distributional Bellman backups and the projected version are -contractions in and iterates of these operators converge in to a unique fixed-point.


Consider , any state and action with CDFs and and consider the application of the optimism operator :


Generally, for any we have


and applying this case-by-case bound to the quantity in the integral above, we get


By taking the square root on both sides as well as a max over states and actions, we get that is a non-expansion in . The rest of the statement follows from the fact that is a -contraction and a non-expansion [rowland2018analysis] and the Banach fixed-point theorem. ∎

[Restatement of Theorem Theoretical Analysis] Let the shift parameter in the optimistic operator be sufficiently large which is . Then with probability at least , the iterates converges for any risk level and initial to an optimistic estimate of the policy’s conditional value at risk. That is, with probability at least ,


By Lemma A and Lemma 3 by [bellemare2017distributional], we know that converges to a unique fixed-point , independent of the initial . Hence, without loss of generality, we can choose .

We proceed by first showing how our result will follow under a particular definition of , and then show what that definition is. Assume that we have obtained a value for that satisfies the assumption of Lemma A (see other parts of this appendix), and let and . Then Lemma A implies that if for all , then also for all . Thus, for all . Finally, we can use Lemma A to obtain the desired result of our proof statement, for all .

Going back, we use concentration inequalities to determine the value of that ensures the required condition in Lemma A (expressed in Eq. (19)). The DKW-inequality which give us that for any with probability at least


Further, the inequality by [weissman2003inequalities] gives that


Combining both with a union bound over all state-action pairs, we get that it is sufficient to choose to ensure that allowing us to apply Lemma A.

However, we can improve this result by removing the polynomial dependency of on the number of states as follows. Consider a fixed and denote where we set . Our goal is to derive a concentration bound on that is tighter than the bound derived from . Note that is not a random quantity and hence is a normalized sum of independent random variables for any . To deal with the continuous variable which prevents us from applying a union bound over directly, we use a covering argument. Let be arbitrary and consider the discretization set


Define as the discretization of at the discretization points in . This construction ensures that the discretization error is uniformly bounded by , that is, holds for all and . Hence, we can bound for all


where in

, we applied Hoeffding’s inequality to the first term in combination with a union bound over as can only take values in . The second term was bounded with Hölder’s inequality and in

the concentration inequality from Eq. (12) was used. Combining this bound with Eq. (11) by applying a union bound over all states and actions, we get that picking


is sufficient to apply Lemma A. Since is non-decreasing, the size of the discretization set is at most and by picking , we see that is sufficient.

Let be such that for all . Assume finitely many states and actions . Let and be the empirical reward distributions and transition probabilities. Assume further that


where is the shift in the optimism operator . Then