Reinforcement learning (RL) is a framework where an agent interacts with an unknown environment [39, 6] and the agent can optimize its performance based on a scalar feedback from the environment. The objective in this framework is typically to maximize (or minimize) a cumulative reward (or cost). However, in many cases the agent is tasked with solving the problem at hand while minimizing some form of risk, which is a measure that quantifies potential worst case scenarios, that may result from the agent’s solution (a policy).
There are many examples of risk-aware decision making problems. Important cases include process control (where one is looking for optimizing a process but without endangering the deliverables), finance (where one is looking to avoid catastrophic financial events), motion control (where one, for example, is looking for safety in a shared space where humans and robots are working together), or automotive (where one is interested in safe plan for lane change in self-driving car).
The main vehicle we employ in this work in order to incorporate risk into the optimization of an agent is the policy gradient method (PG; [14, 5]). As was shown by , extending the PG method for risk is straight forward. We further develop the PG method in this work in order to be able plugin in any risk function.
In this work we show that even the undoubtedly rich class of coherent risk functions111A risk function is said to be coherent if it satisfies (A1) Convexity: , ; (A2) Monotonicity: if then ; (A3) Translation Invariance: , ; (A4) Positive Homogeneity: if , .  does not fully capture the desired practical properties of risk functions (for example, look at the non-standard risk functions suggested by [22, 23]). Also, the state-of-the-art methods in coherent risk measure optimization involve complicated machinery , whereas our proposed method provides a significant improvement in terms of computational efficiency, as well its relative simplicity.
The main contributions of this work are the following:
In our proposed architecture, we allow general risk measures. Although, coherent risk measures (as in ) are quite general, still for some use cases one may need other risk shapes.
We show how to implement an RL agent that considers a risk measure with a deep neural architecture [21, 15]. Since deep networks are powerful function approximators, almost any risk function that one can think off can be plugged into our architecture. Our proposed architecture is based on the Actor Critic method ([39, 19]).
Deep Q-Networks (DQN; ) are used in a bootstrapping fashion to estimate the value function by minimizing the square Temporal Difference (TD; ). Such methods are broadly ineffective for estimating risk functions, apart from a notable exceptions (e.g., variance). Therefore, our proposed method is based mainly on Monte-Carlo simulations.
We demonstrate how to shape the risk function. Most of the previous works assume an arbitrary risk function, where its fitness to the problem at hand is unknown. Similarly to reward-shaping, we suggest risk-shaping and a simple process to extract from observed data the risk function. The induction for this approach comes from the reward-shaping literature , where it was demonstrated that a better shaped reward function can increase an algorithm performance.
We note that Risk Shaping and Generalized utility functions are complementary to each other. As we will show, when extracting from data a risk function, we are not guaranteed that the extracted risk function is coherent.
The paper is organized as follows. In Section 2 we review related work and in Section 3 we formulate the problem. In Section 5 we discuss risk shaping. In Section 6 we provide neural architectures for solving the risk problem. In Section 7 we demonstrate our findings. We present our conclusions and future work in Section 8.
2 Related Work
The literature of risk in MDPs is quite rich and dates back to .  was the first to show that the risk measures on the reward-to-go can be written in a closed form, but solving MDPs (both for planning or RL) and incorporating risk measures based on these closed forms is practically impossible due to its high non linearity. Another analytical direction is the exponential utility function risk measure . Although this form is highly analytical, it hardly captures real world problems.
In recent years, the interest in Risk in MDPs gained a new interest. A very basic form of risk is constraining the instantaneous variance of a state as investigated in [2, 13, 36]. This problem is of polynomial complexity and can be solved easily.  had shown that if one tries to optimize an MDP where a constraint on the variance of the reward-to-go is given the problem in hand may be NP-Hard and only a local solution for the optimization is possible.
In the context of RL and planning where the risk criteria is the variance of the reward-to-go such local optimal solutions were given explicitly by  to the policy gradient method . In the context of policy evaluation,  provides a way to incorporate the variance of the reward-to-go for TD and LSTD methods.  suggest an actor critic algorithm.
Another direction of research is of the Value at Risk (VaR) criteria and limiting the percentile [10, 35, 28, 8]. In this case, we want to limit the worst cases trajectories starting from a specific state. Another related method is the Conditional Value at Risk (CVaR; [44, 8]
). In this method the objective is to estimate the average of some lower percentile in contrast to the percentile distribution. For both methods, it seems that one cannot escape estimating the probability distribution of the trajectories starting for a specific state, therefore, making our proposed simplistic architecture a bit more straight forward.
A generalization for both the variance and the CVaR in the context of RL is the coherent risk . In this generalization, some “good traits” of a risk function are considered (among them, convexity, insensibility to constants, scaling of risk, etc.  developed a closed form formula for policy gradient with generalized risk measure in the context of MDPs and RL. Finding the solution for the coherent risk measures is not an easy task: one needs to solve a constrained optimization problem which may differ for different risk functions. Our approach on the other hand is a “plug-and-play” approach, where for each desired risk function one can use it in a straight forward manner in our architecture.
Another body of work that relates to Risk is the Constrainted MDPs literature (CMDP; ). The CMDP case can be viewed as risk case where the risk function is the special case of the identity function. A variant of constrained MDPs is “constrained policy iteration” . In this case, constraints on the policy computation itself are applied.
3 Setup and Basic Formulae
We consider a Markov Decision Process (MDP;) where and are the state space and action space, respectively. The probability is the transition probability from state when applying action to the state . For this transition matrix , under a specific policy, we let denote the stationary distribution. The reward function is denoted with where we assume that . We consider a probabilistic policy mapping which expresses the probability of the agent to choose an action given that the agent is in state .
is a random variable that expresses the accumulative discounted rewards that the agent receives during its interaction with the environment
where is the discount factor and is the reward-to-go horizon. The horizon can be either finite, infinite, or stochastic. The goal of the agent is to find a policy that maximizes the so called Value Function
where is the expectation w.r.t. the MDP and the policy function that depends on . For ease of exposition, we omit the subscript whenever it is clear from the context. Based on the reward-to-go, we define the risk measure to be
where is a function such as the square, absolute value, square root, etc. . Our objective in this case is
Similarly to , we suggest to approximate this optimization problem with a soft constraint. We define
where is a penalty function that is typically taken to be , and is the penalty coefficient. Based on Eq. (5), we have the optimization problem
whereas increases, the solution of Eq. (4) converges to the solution of Eq. (6). We propose a solution to this problem that follows the gradient descent approach by way of iterative updates to the value of :
3.1 Policy Gradient Methods
Next, we recall the equations of the policy gradient method for the case of a finite time trajectory (see ). We can further generalize this method as follows.
Let be a differentiable function. Then,
where the gradient is taken w.r.t .
where in (a) we changed order of the summation and the gradient, and in (b) we multiplied and divided by the same factor.
Next, for it is easy to show that
Plugging this expression into (3.1) we get the desired result. ∎
Now, for general risk measures, we need to calculate the gradient of Eq. (3). Here, the inner part of the expectation (specifically, the term ) also depends on , which adds an additional step to our derivation:
4 The Grid World Setup
Although our framework is quite general, we present our risk-shaping approach (in the following section) and focus our experiments on the well-studied Grid World setting . There, the objective is to find a path from a prescribed starting point to a designated target point. Each square represents a location on a two dimensional grid. Within this simple two-dimensional space, and between the start and termination points, there are two special types of entries. The first one are mines, which are associated with a negative reward222For the simplicity of our discussion we consider uniform reward values, but this restriction can be easily lifted. of that is given to the agent whenever it steps on them. The probability that a location will contain a mine is linear and monotonically decreasing with the the coordinate, where ):
Typically, we set . The second special type of entry is an entry with an associated reward of . All other entries, are associated with zero rewards. At every time step, the agent can move up, down, right, or left to any of its adjacent squares, and we let denote the set of these four directions/actions. An illustration of this grid world is given in Figure 1. Additionally, the agent’s movement is subject to control noise that, with probability , causes the agent to move in a uniformly random direction in .
5 Risk-Shaping and General Risk Functions
In this section we demonstrate that risk functions do not necessarily need to be coherent, and in particular, can be non-convex. We illustrate this with particular instance of the gambler’s ruin problem . Many real life problems, and in particular problems in economics and finance, lie well within this domain of problems [9, 45].
Our specific variant of the gambler’s ruin problem relies on a Markov Reward Process (MRP; 
), and is a more powerful setup than the Markov chain model. The state space is infinite and the states are denoted by. These state values denote the amount of money (in dollars) in the current possession of the agent. The process terminates when the agent reached the state , which indicates the bankruptcy of the agent. At each time step, the agent gambles, and wins a dollar with probability , and otherwise it loses a dollar (with probability ). The agent can only gamble as long as its fortune is strictly positive (we exclude borrowing money). We define the following condition for risk: is the probability that the agent will go bankrupt within steps, given an initial fortune of . Therefore, the risk in this case is a probability; i.e., , where is the risk-look-ahead parameter. In other words, dictates the interval upon which we optimize our risk.
We want to solve Eq. (3) according to this model. We know the risk for each state and we can calculate using Monte Carlo simulation or any other Policy Evaluation (PE; ) and has a realization of the argument. Based on that, we get samples for the risk as function of the state . Now, we are interested in learning a function that best fits the model. We refer to this process of learning this function from the data as Risk Shaping.
We ran the process described above and obtained samples that describe the correspondence between and , as described in Eq. (3). This correspondence is depicted in Figure 2 (red dots). We selected as a model for these samples the equation , where and .
We remark that as opposed to common practice, in this particular example, limiting the variance (as in [42, 43] and its followup work), or optimizing the VaR or CVaR is arguably disadvantageous. The high-risk regime in this example is characterized by low variance (due to the vicinity to ). Moreover, for the variance is maximal and the risk is zero. This is the reason for the non-convexity of the , as illustrate Figure 2.
6 Neural Architectures
In this section we describe a neural architecture for estimating the different components. We propose the Actor Risk-Critic Value-Critic Architecture (ARCVC), which consists of three main components: (1) An actor for the policy (2) a value-function critic, and (3) a risk-function critic. The value function critic component is a standard (and not necessarily linear) function approximation for , denoted by . We focus in this section on the other two components. We recall that, in order to estimate the value function, most techniques (excluding Monte-Carlo simulations; ) use some form of the Bellman equation. For some concrete cases of risk functions, a similar approach may be applied, such as in the variance case . However, as noted by the authors, such closed form equations exist only for a limited cases. To address this difficulty, we employ Monte Carlo simulations in the proposed architectures.
We propose the use of a finite time buffer, denoted by FTB, to collect the recent samples of the reward, and, once collected, we use them to compute an estimate of the risk according to Eq. (3):
Such a buffer can be implemented by using a queue. The running time complexity and space complexity of such architecture using a queue are and , respectively, for computing the risk sample. Therefore, the loss function that the Risk-Critic is minimizing is
A schematic illustration of this Architecture is provided in Figure 3. We can see that three networks are involved: one for the actor, one for the value function critic, and one for the risk critic. We can see that the value function is needed by the risk network: it is used as a reference value for computing the risk. The roles of the risk network itself, w.r.t. the policy network, are twofold. First, it provides indication of whether or not the risk constraint is violated. The second role pertains to the objective itself: whenever we violate the constraint it adds to the general objective function and pulls the policy gradient towards the direction that minimizes the risk value.
6.1 A Compact Architecture
As was shown in the previous section, a naive architecture involves three networks. In this section, we show how to reduce the network size making several modifications. This reduces the computational overhead, both in terms of the running time complexity as well as the space complexity.
6.1.1 Changing the Reference
We suggest a more compact architecture that does not involve the value function network. First, consider a slightly more general version of the objective function:
Setting admits the original setup presented in the previous section (see Eq. (3)). We substitute the soft constraint with the following
where the argument of the function and the constraint are identical and where we denoted this by . We consider several ways to set .
Changing reference to . We propose to replace with in Eq. (5). Therefore, the objective is
The meaning of Eq. (13) is the following: instead of just measuring the average distance from the mean trajectory, as captured by the subtraction of from the trajectory, we also include the soft constraint in the distance function. This modified version has an interesting property: It is easy to show that when the constraint is satisfied, the function converges to . In other words, we get a similar objective for satisfying the constraints, i.e.,
The downside of this replacement is that it does not reduce the complexity. Now, instead of having to estimate we only need to estimate .
Changing reference to . We propose to replace with where is the stationary distribution. In order to estimate we propose the following Stochastic Approximation iteration :
where is the step of the iteration that may be a “small” constant or decreasing time step that behaves like , for . The advantage of this iteration is clear: instead of maintaining a network for , we only need to perform scalar updates. The downside to this approach is the potential loss in the accuracy of the risk estimate. However, in many cases, this approximation proves to be relatively good.
Changing reference to constant. Another possibility would be to replace with some constant, based on some prior information we obtain from a domain expert. We do not study this approach in the present work, although it might prove to be stable and eventually beneficial.
6.1.2 Sample Based Penalty Function
Another simplification is to remove the dependency of the penalty function on . The risk network role is map from state to the risk associated with this state. We propose to base the penalty function solely on a sample that represents the current risk. Similarly to Section 6.1.1, we can shrink the architecture by a whole estimation network. Later, in the experiments, we demonstrate that such change does not reduce the agent performance, neither in the accumulated reward, nor in the risk quality. The basic algorithm for the architecture AVCRC (in episodic form) is summarized in Algorithm 1.
In this section empirically investigate the behavior of different risk measures and strengthen our understanding of the specific trade-offs of such scenarios.
In all experiments, we set
. The policy gradient network has two layers with a ReLU
activation between them and a soft-max output layer. Both critics, the value estimation network and the risk estimation network have three layers where the first two activation functions layers are ReLU, and the output layer is linear. For optimization we use the Adam optimizer, which gave us the best and most stable results.
7.1 The Risk Violation Rate - A Measure for Examining the Risk
In order to define the efficacy of our risk is in practical terms, we need to quantify how well it manages to reach its objective while minimizing the likelihood of violating its prescribed risk constraint. We propose the following method for grading.
Risk Violation Rate is the fraction of times in which constrained Risk problem violated the risk constraint.
In the experiments we describe below, we use this measure in order to estimate how good a method is for risk estimation and satisfying the risk constraints. Additionally, we couple the risk violation rate with the algorithm’s success rate, as otherwise the algorithm could trivially uphold the constraint without actually reaching the original objective.333This is analogous to the precision-recall trade-off .
7.2 Comparison of Different Risks Measures
We examined our algorithm on three representative risk functions: (1) “One Sided Variance” (denoted with ) given by , (2) “One Sided Absolute Value” (denoted by ) given by , and (3) “One Sided Square Root” (denoted by ) that is given by . The reason for taking one-sided functions is that we are only concerned about negative rewards.
Now, suppose we want to compare these risk functions. For appropriate comparison, it is easy to show that a constraint value should scale differently for different risk functions. Indeed, the constraint value is scaled by a constant, as dictated by the function in hand. For example, if we use the One Sided Absolute Value function with a constraint value , then its analogous constraint value for the One Sided Square Root will be , and similarly, for the One Sided Variance function.
In our experiments, we set the constraint parameter . This value turned out to push the algorithm to display quite an interesting behavior. Indeed, it caused the algorithm to have difficulty with balancing both the risk satisfaction constraint and the reward-to-go maximization. The results are depicted in Figure 4. We can see that the non-coherent risk measures (the function) may be beneficial in some cases. First, they are aggressive in forcing the algorithm not to violate the constraint. Second, since the risk constraint is enforced during training, it can be regarded as safe exploration. On the other hand, as evident the results that overly aggressive risk functions can deteriorate the success rate. As pointed-out by , dealing with risk in the context of RL can sometimes pose a trade-off between performance and the risk constraint satisfaction.444The rest of the parameters for this experiments are: grid world of size , D=0.1, , batch size of 100, 1000 episodes each run, repeated 50 times for each risk function, , ADAM optimizer, PyTorch ver. 1.0.0, and learning rate
, ADAM optimizer, PyTorch ver. 1.0.0, and learning ratefor all networks..
7.3 Changing the Reference
In the next experiment we examine the effect of changing the reference of the risk, as described in Section 6.1.1. We compared the original risk definition (Eq. 3) to the global reference . We define a score function that measures how much the risk based on a global reference (i.e., ) deviates from the original risk weighted by the stationary distribution (i.e., ).
In order to express the deviation as a score, we define the distance between and to be
Figure 5 depicts the empirical distribution of as we vary the discount factor between and . We see that the difference between the global reference and the original references is diminishes as the discount factor approaches .
7.4 Sample Based Penalty Function Estimation
The main objective of the risk network is to faithfully capture the risk for the penalty function . In this experiment we show that we do not lose much when we replace the risk network signal in the penalty function of Eq. 5 with a single sample based estimation. We conducted an experiment on the grid world environment, with different mine layouts, to estimate the penalty function based on the risk network output. We estimate the penalty function for the same 50 mine layouts, based on a single sample at each time step. More specifically, if we examine the penalty function in Eq. 5, we replace the term with the term .
The results are depicted in Figure 6. On one hand, there is a clear deterioration in terms of both the average accumulated reward, as well as the risk violation. On the other hand, this modification to the architecture eliminated the need for training an additional network (the risk network), and therefore results in massive savings in both memory and running time.
To Summarize, one can get a compact architecture that is based only on the policy gradient network if (1) the value function network is replaced with a global reference and (2) the risk function is replaced with a single sample estimator.
8 Conclusions and Future Work
We have shown that natural risk measures that are extracted from some simple domains do not exhibit necessarily some of the coherent risk requirements. In addition, we described a procedure for extracting an appropriate risk function from data, enabling the domain expert to understand and then approximate risk functions. Being able to shape the risk of a given problem, and tailor it to problem specifics is important because risk is more difficult to understand (and design) than the reward and consequently the value function.
We believe that investigating methods and best practices in risk shaping for different domains is paramount for the applicability of risk awareness in planning, RL problems, and MDPs in general.
In this work we did not apply a holistic approach, i.e, for given realizations of the agent interacting with an environment, we provided a method for extracting the risk function (i.e., risk shaping) and afterward, we showed a method that can apply this general risk function in the MDP (i.e., applying generalized risk functions). As a future direction, we propose to interleave these two important methods into a single holistic algorithm where using light supervision we learn the risk while solving the MDP. Also, in the context of safe exploration, we suggest to incorporate constraining the violations rate as well.
-  (2017) Constrained policy optimization. arXiv preprint arXiv:1705.10528. Cited by: §2.
-  (1999) Constrained markov decision processes. Vol. 7, CRC Press. Cited by: §2, §2.
Safe policy search for lifelong reinforcement learning with sublinear regret.
International Conference on Machine Learning, pp. 2361–2369. Cited by: §2.
-  (2016) Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. Cited by: §2.
-  (2001) Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research 15, pp. 319–350. Cited by: §1, §3.1.
-  (2005) Dynamic programming and optimal control. Vol. 1, Athena scientific Belmont, MA. Cited by: §1, §5.
-  (2013) Model predictive control. Springer Science & Business Media. Cited by: §2.
-  (2017) Risk-constrained reinforcement learning with percentile risk criteria.. Journal of Machine Learning Research 18, pp. 167–1. Cited by: §2.
-  (1974) GAMBLER’s ruin and investment analysis. In Proceedings, Annual Meeting (Western Agricultural Economics Association), Cited by: §5.
-  (1995) Percentile performance criteria for limiting average markov decision processes. IEEE Transactions on Automatic Control 40 (1), pp. 2–10. Cited by: §2.
-  (2013) Monte carlo: concepts, algorithms, and applications. Springer Science & Business Media. Cited by: §2.
-  (2015) A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16 (1), pp. 1437–1480. Cited by: §2.
-  (2005) Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research 24, pp. 81–108. Cited by: §2.
-  (1990) Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM 33 (10), pp. 75–84. Cited by: §1, §2.
-  (2016) Deep learning. Vol. 1, MIT press Cambridge. Cited by: item 2, §7.
-  (1972) Risk-sensitive markov decision processes. Management science 18 (7), pp. 356–369. Cited by: §2.
Dynamic probabilistic systems: markov models. Vol. 1, Courier Corporation. Cited by: §5.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §7.
-  (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: item 2.
-  (2003) Stochastic approximation and recursive algorithms and applications. Vol. 35, Springer Science & Business Media. Cited by: item 2.
-  (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: item 2.
-  (2006) Functional value iteration for decision-theoretic planning with general utility functions. In PROCEEDINGS OF THE NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, Vol. 21, pp. 1186. Cited by: §1.
-  (2012) Existence and finiteness conditions for risk-sensitive planning: results and conjectures. arXiv preprint arXiv:1207.1391. Cited by: §1.
-  (2011) Mean-variance optimization in markov decision processes. arXiv preprint arXiv:1104.5601. Cited by: §2.
-  (2002) Risk-sensitive reinforcement learning. Machine learning 49 (2-3), pp. 267–290. Cited by: §2.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: item 2, 20.
-  (2012) Safe exploration in markov decision processes. arXiv preprint arXiv:1205.4810. Cited by: §2.
-  (2012) Parametric return density estimation for reinforcement learning. arXiv preprint arXiv:1203.3497. Cited by: §2.
-  (1999) Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: item 3.
-  (1998) Markov chains. Cambridge university press. Cited by: §5.
-  (2013) Safe policy iteration. In International Conference on Machine Learning, pp. 307–315. Cited by: §2.
-  (2011) Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. Cited by: footnote 3.
-  (2013) Actor-critic algorithms for risk-sensitive mdps. In Advances in neural information processing systems, pp. 252–260. Cited by: §2.
-  (1994) Markov decision processes. j. Wiley and Sons. Cited by: §3.
-  (2000) Optimization of conditional value-at-risk. Journal of risk 2, pp. 21–42. Cited by: §2.
-  (2001) TD algorithm for the variance of return and mean-variance reinforcement learning. Transactions of the Japanese Society for Artificial Intelligence 16 (3), pp. 353–362. Cited by: §2.
-  (2009) Lectures on stochastic programming: modeling and theory. SIAM. Cited by: §1, §2.
-  (1982) The variance of discounted markov decision processes. Journal of Applied Probability 19 (4), pp. 794–802. Cited by: §2, §6.
-  (2018) Reinforcement learning: an introduction. MIT press. Cited by: item 2, §1, §4, §6.
-  (1988) Learning to predict by the methods of temporal differences. Machine learning 3 (1), pp. 9–44. Cited by: item 2.
-  (2015) Policy gradient for coherent risk measures. In Advances in Neural Information Processing Systems, pp. 1468–1476. Cited by: item 1, §1.
-  (2012) Policy gradients with variance related risk criteria. In Proceedings of the twenty-ninth international conference on machine learning, pp. 387–396. Cited by: §1, §2, §3, §5, §7.2.
-  (2013) Temporal difference methods for the variance of the reward to go. In International Conference on Machine Learning, pp. 495–503. Cited by: §2, §5.
-  (2015) Optimizing the cvar via sampling.. In AAAI, pp. 2993–2999. Cited by: §2, §2.
-  (1976) The gambler’s ruin approach to business risk. Sloan Management Review (pre-1986) 18 (1), pp. 33. Cited by: §5.