1 Introduction
A standard optimization criterion for an infinite horizon Markov decision process (MDP) is the
expected sum of (discounted) costs (i.e., finding a policy that minimizes the value function of the initial state of the system). However in many applications, we may prefer to minimize some measure of risk in addition to this standard optimization criterion. In such cases, we would like to use a criterion that incorporates a penalty for the variability (due to the stochastic nature of the system) induced by a given policy. In risksensitive MDPs [18], the objective is to minimize a risksensitive criterion such as the expected exponential utility [18], a variancerelated measure [32, 16], or the percentile performance [17]. The issue of how to construct such criteria in a manner that will be both conceptually meaningful and mathematically tractable is still an open question.Although most losses (returns) are not normally distributed, the typical Markiowitz meanvariance optimization
[22], that relies on the first two moments of the loss (return) distribution, has dominated the risk management for over
years. Numerous alternatives to meanvariance optimization have emerged in the literature, but there is no clear leader amongst these alternative risksensitive objective functions. Valueatrisk (VaR) and conditional valueatrisk (CVaR) are two promising such alternatives that quantify the losses that might be encountered in the tail of the loss distribution, and thus, have received high status in risk management. For (continuous) loss distributions, while VaR measures risk as the maximum loss that might be incurred w.r.t. a given confidence level , CVaR measures it as the expected loss given that the loss is greater or equal to VaR. Although VaR is a popular risk measure, CVaR’s computational advantages over VaR has boosted the development of CVaR optimization techniques. We provide the exact definitions of these two risk measures and briefly discuss some of the VaR’s shortcomings in Section 2. CVaR minimization was first developed by Rockafellar and Uryasev [29] and its numerical effectiveness was demonstrated in portfolio optimization and option hedging problems. Their work was then extended to objective functions consist of different combinations of the expected loss and the CVaR, such as the minimization of the expected loss subject to a constraint on CVaR. This is the objective function that we study in this paper, although we believe that our proposed algorithms can be easily extended to several other CVaRrelated objective functions. Boda and Filar [10] and Bäuerle and Ott [25, 4] extended the results of [29] to MDPs (sequential decisionmaking). While the former proposed to use dynamic programming (DP) to optimize CVaR, an approach that is limited to small problems, the latter showed that in both finite and infinite horizon MDPs, there exists a deterministic historydependent optimal policy for CVaR optimization (see Section 3 for more details).Most of the work in risksensitive sequential decisionmaking has been in the context of MDPs (when the model is known) and much less work has been done within the reinforcement learning (RL) framework. In risksensitive RL, we can mention the work by Borkar
[11, 12] who considered the expected exponential utility and those by Tamar et al. [34] and Prashanth and Ghavamzadeh [21] on several variancerelated risk measures. CVaR optimization in RL is a rather novel subject. Morimura et al. [24] estimate the return distribution while exploring using a CVaRbased risksensitive policy. Their algorithm does not scale to large problems. Petrik and Subramanian [27] propose a method based on stochastic dual DP to optimize CVaR in largescale MDPs. However, their method is limited to linearly controllable problems. Borkar and Jain [15] consider a finitehorizon MDP with CVaR constraint and sketch a stochastic approximation algorithm to solve it. Finally, Tamar et al. [35] have recently proposed a policy gradient algorithm for CVaR optimization.In this paper, we develop policy gradient (PG) and actorcritic (AC) algorithms for meanCVaR optimization in MDPs. We first derive a formula for computing the gradient of this risksensitive objective function. We then propose several methods to estimate this gradient both incrementally and using system trajectories (update at each timestep vs. update after observing one or more trajectories). We then use these gradient estimations to devise PG and AC algorithms that update the policy parameters in the descent direction. Using the ordinary differential equations (ODE) approach, we establish the asymptotic convergence of our algorithms to locally risksensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in an optimal stopping problem. In comparison to
[35], while they develop a PG algorithm for CVaR optimization in stochastic shortest path problems that only considers continuous loss distributions, uses a biased estimator for VaR, is not incremental, and has no convergence proof, here we study meanCVaR optimization, consider both discrete and continuous loss distributions, devise both PG and (several) AC algorithms (trajectorybased and incremental – plus AC helps in reducing the variance of PG algorithms), and establish convergence proof for our algorithms.2 Preliminaries
We consider problems in which the agent’s interaction with the environment is modeled as a MDP. A MDP is a tuple , where and are the state and action spaces;
is the bounded cost random variable whose expectation is denoted by
;is the transition probability distribution; and
is the initial state distribution. For simplicity, we assume that the system has a single initial state , i.e., . All the results of the paper can be easily extended to the case that the system has more than one initial state. We also need to specify the rule according to which the agent selects actions at each state. A stationary policy is a probability distribution over actions, conditioned on the current state. In policy gradient and actorcritic methods, we define a class of parameterized stochastic policies , estimate the gradient of a performance measure w.r.t. the policy parameters from the observed system trajectories, and then improve the policy by adjusting its parameters in the direction of the gradient. Since in this setting a policy is represented by itsdimensional parameter vector
, policy dependent functions can be written as a function of in place of . So, we use and interchangeably in the paper. We denote by and the discounted visiting distribution of state and stateaction pair under policy , respectively.Let be a boundedmean random variable, i.e.,
, with the cumulative distribution function
(e.g., one may think of as the loss of an investment strategy ). We define the valueatrisk at the confidence level as VaR. Here the minimum is attained because is nondecreasing and rightcontinuous in . When is continuous and strictly increasing, VaR is the unique satisfying , otherwise, the VaR equation can have no solution or a whole range of solutions. Although VaR is a popular risk measure, it suffers from being unstable and difficult to work with numerically when is not normally distributed, which is often the case as loss distributions tend to exhibit fat tails or empirical discreteness. Moreover, VaR is not a coherent risk measure [2] and more importantly does not quantify the losses that might be suffered beyond its value at the tail of the distribution [28]. An alternative measure that addresses most of the VaR’s shortcomings is conditional valueatrisk, CVAR, which is the mean of the tail distribution of . If there is no probability atom at VaR, CVaR has a unique value that is defined as CVaR. Rockafellar and Uryasev [29] showed that(1) 
Note that as a function of , is finite and convex (hence continuous).
3 CVaR Optimization in MDPs
For a policy , we define the loss of a state (stateaction pair ) as the sum of (discounted) costs encountered by the agent when it starts at state (stateaction pair ) and then follows policy , i.e., and . The expected value of these two random variables are the value and actionvalue functions of policy , i.e., and . The goal in the standard discounted formulation is to find an optimal policy .
For CVaR optimization in MDPs, we consider the following optimization problem: For a given confidence level and loss tolerance ,
(2) 
(3) 
To solve (3), we employ the Lagrangian relaxation procedure [5] to convert it to the following unconstrained problem:
(4) 
where is the Lagrange multiplier. The goal here is to find the saddle point of , i.e., a point that satisfies . This is achieved by descending in and ascending in using the gradients of w.r.t. , , and , i.e.,^{1}^{1}1The notation in (6) means that the rightmost term is a member of the subgradient set .
(5)  
(6)  
(7) 
We assume that there exists a policy such that CVaR (feasibility assumption). As discussed in Section 1, Bäuerle and Ott [25, 4] showed that there exists a deterministic historydependent optimal policy for CVaR optimization. The important point is that this policy does not depend on the complete history, but only on the current time step , current state of the system , and accumulated discounted cost .
In the following, we present a policy gradient (PG) algorithm (Sec. 4) and several actorcritic (AC) algorithms (Sec. 5.5) to optimize (4). While the PG algorithm updates its parameters after observing several trajectories, the AC algorithms are incremental and update their parameters at each timestep.
4 A Trajectorybased Policy Gradient Algorithm
In this section, we present a policy gradient algorithm to solve the optimization problem (4). The unit of observation in this algorithm is a system trajectory generated by following the current policy. At each iteration, the algorithm generates trajectories by following the current policy, use them to estimate the gradients in (5)(7), and then use these estimates to update the parameters .
Let be a trajectory generated by following the policy , where and is usually a terminal state of the system. After visits the terminal state, it enters a recurring sink state at the next time step, incurring zero cost, i.e., , . Time index is referred as the stopping time of the MDP. Since the transition is stochastic, is a nondeterministic quantity. Here we assume that the policy is proper, i.e., for every . This further means that with probability , the MDP exits the transient states and hits (and stays in ) in finite time . For simplicity, we assume that the agent incurs zero cost in the terminal state. Analogous results for the general case with a nonzero terminal cost can be derived using identical arguments. The loss and probability of are defined as and , respectively. It can be easily shown that .
Algorithm 1 contains the pseudocode of our proposed policy gradient algorithm. What appears inside the parentheses on the righthandside of the update equations are the estimates of the gradients of w.r.t. (estimates of (5)(7)) (see Appendix A.2). is an operator that projects a vector to the closest point in a compact and convex set , and and are projection operators to and , respectively. These projection operators are necessary to ensure the convergence of the algorithm. The stepsize schedules satisfy the standard conditions for stochastic approximation algorithms, and ensures that the VaR parameter update is on the fastest timescale , the policy parameter update is on the intermediate timescale , and the Lagrange multiplier update is on the slowest timescale (see Appendix A.1 for the conditions on the stepsize schedules). This results in a three timescale stochastic approximation algorithm. We prove that our policy gradient algorithm converges to a (local) saddle point of the risksensitive objective function (see Appendix A.3).
Update:  
Update:  
Update: 
5 Incremental ActorCritic Algorithms
As mentioned in Section 4, the unit of observation in our policy gradient algorithm (Algorithm 1) is a system trajectory. This may result in high variance for the gradient estimates, especially when the length of the trajectories is long. To address this issue, in this section, we propose actorcritic algorithms that use linear approximation for some quantities in the gradient estimates and update the parameters incrementally (after each stateaction transition). To develop our actorcritic algorithms, we should show how the gradients of (5)(7) are estimated in an incremental fashion. We show this in the next four subsections, followed by a subsection that contains the algorithms.
5.1 Gradient w.r.t. the Policy Parameters
The gradient of our objective function w.r.t. the policy parameters in (5) may be rewritten as
(8) 
Given the original MDP and the parameter , we define the augmented MDP as , , , and
where is any terminal state of the original MDP and is the value of the part of the state when a policy reaches a terminal state after steps, i.e., . We define a class of parameterized stochastic policies for this augmented MDP. Thus, the total (discounted) loss of this trajectory can be written as
(9) 
From (9), it is clear that the quantity in the parenthesis of (8) is the value function of the policy at state in the augmented MDP , i.e., . Thus, it is easy to show that (the proof of the second equality can be found in the literature, e.g., [26])
(10) 
where is the discounted visiting distribution (defined in Section 2) and is the actionvalue function of policy in the augmented MDP . We can show that
is an unbiased estimate of
, where is the temporaldifference (TD) error in , and is an unbiased estimator of (see e.g. [8]). In our actorcritic algorithms, the critic uses linear approximation for the value function , where the feature vector is from lowdimensional space .5.2 Gradient w.r.t. the Lagrangian Parameter
We may rewrite the gradient of our objective function w.r.t. the Lagrangian parameters in (7) as
(11) 
Similar to Section 5.1, (a) comes from the fact that the quantity in the parenthesis in (11) is , the value function of the policy at state in the augmented MDP . Note that the dependence of on comes from the definition of the cost function in . We now derive an expression for , which in turn will give us an expression for . The gradient of w.r.t. the Lagrangian parameter may be written as
(12) 
See Appendix B.2.
From Lemma 5.2 and (11), it is easy to see that is an unbiased estimate of . An issue with this estimator is that its value is fixed to all along a system trajectory, and only changes at the end to . This may affect the incremental nature of our actorcritic algorithm. To address this issue, we propose a different approach to estimate the gradients w.r.t. and in Sec. 5.4 (of course this does not come for free).
Another important issue is that the above estimator is unbiased only if the samples are generated from the distribution . If we just follow the policy, then we may use as an estimate for (see (20) and (22) in Algorithm 2). Note that this is an issue for all discounted actorcritic algorithms that their (likelihood ratio based) estimate for the gradient is unbiased only if the samples are generated from , and not just when we simply follow the policy. Although this issue was known in the community, there is a recent paper that investigates it in details [36]. Moreover, this might be a main reason that we have no convergence analysis (to the best of our knowledge) for (likelihood ratio based) discounted actorcritic algorithms.^{2}^{2}2Note that the discounted actorcritic algorithm with convergence proof in [6] is based on SPSA.
5.3 SubGradient w.r.t. the VaR Parameter
We may rewrite the subgradient of our objective function w.r.t. the VaR parameters in (6) as
(13) 
From the definition of the augmented MDP , the probability in (13) may be written as , where is the part of the state in when we reach a terminal state, i.e., (see Section 5.1). Thus, we may rewrite (13) as
(14) 
From (14), it is easy to see that is an unbiased estimate of the subgradient of w.r.t. . An issue with this (unbiased) estimator is that it can be only applied at the end of a system trajectory (i.e., when we reach the terminal state ), and thus, using it prevents us of having a fully incremental algorithm. In fact, this is the estimator that we use in our semi trajectorybased actorcritic algorithm (see (21) in Algorithm 2).
One approach to estimate this subgradient incrementally, hence having a fully incremental algorithm, is to use simultaneous perturbation stochastic approximation (SPSA) method [9]. The idea of SPSA is to estimate the subgradient using two values of at and , where is a positive perturbation (see Sec. 5.5 for the detailed description of ).^{3}^{3}3SPSAbased gradient estimate was first proposed in [33] and has been widely used in various settings, especially those involving highdimensional parameter. The SPSA estimate described above is twosided. It can also be implemented singlesided, where we use the values of the function at and . We refer the readers to [9] for more details on SPSA and to [21] for its application in learning in risksensitive MDPs. In order to see how SPSA can help us to estimate our subgradient incrementally, note that
(15) 
Similar to Sections 5.1 and 5.2, (a) comes from the fact that the quantity in the parenthesis in (15) is , the value function of the policy at state in the augmented MDP . Since the critic uses a linear approximation for the value function, i.e., , in our actorcritic algorithms (see Section 5.1 and Algorithm 2), the SPSA estimate of the subgradient would be of the form (see (18) in Algorithm 2).
5.4 An Alternative Approach to Compute the Gradients
In this section, we present an alternative way to compute the gradients, especially those w.r.t. and . This allows us to estimate the gradient w.r.t. in a (more) incremental fashion (compared to the method of Section 5.2), with the cost of the need to use two different linear function approximators (instead of one used in Algorithm 2). In this approach, we define the augmented MDP slightly different than the one in Section 5.2. The only difference is in the definition of the cost function, which is defined here as (note that has been replaced by and has been removed)
where is any terminal state of the original MDP . It is easy to see that the term appearing in the gradients of (5)(7) is the value function of the policy at state in this augmented MDP. As a result, we have
Gradient w.r.t. : It is easy to see that now this gradient (5) is the gradient of the value function of the original MDP, , plus times the gradient of the value function of the augmented MDP, , both at the initial states of these MDPs (with abuse of notation, we use for the value function of both MDPs). Thus, using linear approximators and for the value functions of the original and augmented MDPs, can be estimated as , where and are the TDerrors of these MDPs.
Gradient w.r.t. : Similar to the case for , it is easy to see that this gradient (7) is plus the value function of the augmented MDP, , and thus, can be estimated incrementally as .
SubGradient w.r.t. : This subgradient (6) is times one plus the gradient w.r.t. of the value function of the augmented MDP, , and thus using SPSA, can be estimated incrementally as .
5.5 ActorCritic Algorithms
In this section, we present two actorcritic algorithms for optimizing the risksensitive measure (4). These algorithms are based on the gradient estimates of Sections 5.15.3. While the first algorithm (SPSAbased) is fully incremental and updates all the parameters at each timestep, the second one updates at each timestep and updates and only at the end of each trajectory, thus given the name semi trajectorybased. Algorithm 2 contains the pseudocode of these algorithms. The projection operators , , and are defined as in Section 4 and are necessary to ensure the convergence of the algorithms. The stepsize schedules satisfy the standard conditions for stochastic approximation algorithms, and ensures that the critic update is on the fastest timescale , the policy and VaR parameter updates are on the intermediate timescale, with update being faster than update , and finally the Lagrange multiplier update is on the slowest timescale (see Appendix B.1 for the conditions on these stepsize schedules). This results in four timescale stochastic approximation algorithms. We prove that these actorcritic algorithms converge to a (local) saddle point of the risksensitive objective function (see Appendix B.4).
TD Error:  (16)  
Critic Update:  (17)  
Actor Updates:  (18)  
(19)  
(20) 
Update:  (21)  
Update:  (22) 
6 Experimental Results
We consider an optimal stopping problem in which the state at each time step consists of the cost and time , i.e., , where is the stopping time. The agent (buyer) should decide either to accept the present cost or wait. If she accepts or when , the system reaches a terminal state and the cost is received, otherwise, she receives the cost and the new state is , where is w.p. and w.p. ( and are constants). Moreover, there is a discounted factor to account for the increase in the buyer’s affordability. The problem has been described in more details in Appendix C. Note that if we change cost to reward and minimization to maximization, this is exactly the American option pricing problem, a standard testbed to evaluate risksensitive algorithms (e.g., [34]). Since the state space is continuous, solving for an exact solution via DP is infeasible, and thus, it requires approximation and sampling techniques.
We compare the performance of our risksensitive policy gradient Alg. 1 (PGCVaR) and two actorcritic Algs. 2 (ACCVaRSPSA,ACCVaRSemiTraj) with their riskneutral counterparts (PG and AC) (see Appendix C for the details of these experiments). Fig. 1 shows the distribution of the discounted cumulative cost for the policy learned by each of these algorithms. From left to right, the columns display the first two moments, the whole (distribution), and zoom on the righttail of these distributions. The results indicate that the risksensitive algorithms yield a higher expected loss, but less variance, compared to the riskneutral methods. More precisely, the loss distributions of the risksensitive algorithms have lower righttail than their riskneutral counterparts. Table 1 summarizes the performance of these algorithms. The numbers reiterate what we concluded from Fig. 1.
PG  0.8780  0.2647  2.0855  0.058 
PGCVaR  1.1128  0.1109  1.7620  0.012 
AC  1.1963  0.6399  2.6479  0.029 
ACCVaRSPSA  1.2031  0.2942  2.3865  0.031 
ACCVaRSemiTraj.  1.2169  0.3747  2.3889  0.026 
7 Conclusions and Future Work
We proposed novel policy gradient and actor critic (AC) algorithms for CVaR optimization in MDPs. We provided proofs of convergence (in the appendix) to locally risksensitive optimal policies for the proposed algorithms. Further, using an optimal stopping problem, we observed that our algorithms resulted in policies whose loss distributions have lower righttail compared to their riskneutral counterparts. This is extremely important for a risk averse decisionmaker, especially if the righttail contains catastrophic losses. Future work includes: 1) Providing convergence proofs for our AC algorithms when the samples are generated by following the policy and not from its discounted visiting distribution (this can be wasteful in terms of samples), 2) Here we established asymptotic limits for our algorithms. To the best of our knowledge, there are no convergence rate results available for multitimescale stochastic approximation schemes, and hence, for AC algorithms. This is true even for the AC algorithms that do not incorporate any risk criterion. It would be an interesting research direction to obtain finitetime bounds on the quality of the solution obtained by these algorithms, 3) Since interesting losses in the CVaR optimization problems are those that exceed the VaR, in order to compute more accurate estimates of the gradients, it is necessary to generate more samples in the righttail of the loss distribution (events that are observed with a very low probability). Although importance sampling methods have been used to address this problem [3, 35], several issues, particularly related to the choice of the sampling distribution, have remained unsolved that are needed to be investigated, and finally, 4) Evaluating our algorithms in more challenging problems.
References

Altman et al. [2004]
Eitan Altman, Konstantin E Avrachenkov, and Rudesindo NúñezQueija.
Perturbation analysis for denumerable markov chains with application to queueing models.
Advances in Applied Probability, pages 839–853, 2004.  Artzner et al. [1999] P. Artzner, F. Delbaen, J. Eber, and D. Heath. Coherent measures of risk. Journal of Mathematical Finance, 9(3):203–228, 1999.
 Bardou et al. [2009] O. Bardou, N. Frikha, and G. Pagès. Computing VaR and CVaR using stochastic approximation and adaptive unconstrained importance sampling. Monte Carlo Methods and Applications, 15(3):173–210, 2009.
 Bäuerle and Ott [2011] N. Bäuerle and J. Ott. Markov decision processes with averagevalueatrisk criteria. Mathematical Methods of Operations Research, 74(3):361–379, 2011.
 Bertsekas [1999] D. Bertsekas. Nonlinear programming. Athena Scientific, 1999.
 Bhatnagar [2010] S. Bhatnagar. An actorcritic algorithm with function approximation for discounted cost constrained Markov decision processes. Systems & Control Letters, 59(12):760–766, 2010.
 Bhatnagar and Lakshmanan [2012] S. Bhatnagar and K. Lakshmanan. An online actorcritic algorithm with function approximation for constrained Markov decision processes. Journal of Optimization Theory and Applications, pages 1–21, 2012.
 Bhatnagar et al. [2009] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Natural actorcritic algorithms. Automatica, 45(11):2471–2482, 2009.
 Bhatnagar et al. [2013] S. Bhatnagar, H. Prasad, and L.A. Prashanth. Stochastic Recursive Algorithms for Optimization, volume 434. Springer, 2013.
 Boda and Filar [2006] K. Boda and J. Filar. Time consistent dynamic risk measures. Mathematical Methods of Operations Research, 63(1):169–186, 2006.
 Borkar [2001] V. Borkar. A sensitivity formula for the risksensitive cost and the actorcritic algorithm. Systems & Control Letters, 44:339–346, 2001.
 Borkar [2002] V. Borkar. Qlearning for risksensitive control. Mathematics of Operations Research, 27:294–311, 2002.
 Borkar [2005] V. Borkar. An actorcritic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3):207–213, 2005.
 Borkar [2008] V. Borkar. Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press, 2008.
 Borkar and Jain [2014] V. Borkar and R. Jain. Riskconstrained Markov decision processes. IEEE Transaction on Automatic Control, 2014.
 Filar et al. [1989] J. Filar, L. Kallenberg, and H. Lee. Variancepenalized Markov decision processes. Mathematics of Operations Research, 14(1):147–161, 1989.
 Filar et al. [1995] J. Filar, D. Krass, and K. Ross. Percentile performance criteria for limiting average Markov decision processes. IEEE Transaction of Automatic Control, 40(1):2–10, 1995.
 Howard and Matheson [1972] R. Howard and J. Matheson. Risk sensitive Markov decision processes. Management Science, 18(7):356–369, 1972.
 Khalil and Grizzle [2002] Hassan K Khalil and JW Grizzle. Nonlinear systems, volume 3. Prentice hall Upper Saddle River, 2002.
 Kushner and Yin [1997] Harold J Kushner and G George Yin. Stochastic approximation algorithms and applications. Springer, 1997.
 L.A. and Ghavamzadeh [2013] Prashanth L.A. and M. Ghavamzadeh. Actorcritic algorithms for risksensitive MDPs. In Proceedings of Advances in Neural Information Processing Systems 26, pages 252–260, 2013.
 Markowitz [1959] H. Markowitz. Portfolio Selection: Efficient Diversification of Investment. John Wiley and Sons, 1959.
 Milgrom and Segal [2002] Paul Milgrom and Ilya Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583–601, 2002.

Morimura et al. [2010]
T. Morimura, M. Sugiyama, M. Kashima, H. Hachiya, and T. Tanaka.
Nonparametric return distribution approximation for reinforcement
learning.
In
Proceedings of the 27th International Conference on Machine Learning
, pages 799–806, 2010.  Ott [2010] J. Ott. A Markov Decision Model for a Surveillance Application and RiskSensitive Markov Decision Processes. PhD thesis, Karlsruhe Institute of Technology, 2010.
 Peters et al. [2005] J. Peters, S. Vijayakumar, and S. Schaal. Natural actorcritic. In Proceedings of the Sixteenth European Conference on Machine Learning, pages 280–291, 2005.

Petrik and Subramanian [2012]
M. Petrik and D. Subramanian.
An approximate solution method for large riskaverse Markov
decision processes.
In
Proceedings of the 28th International Conference on Uncertainty in Artificial Intelligence
, 2012.  Rockafellar and Uryasev [2000] R. Rockafellar and S. Uryasev. Conditional valueatrisk for general loss distributions. Journal of Banking and Finance, 2:s1–41, 2000.
 Rockafellar and Uryasev [2002] R. Rockafellar and S. Uryasev. Optimization of conditional valueatrisk. Journal of Risk, 26:1443–1471, 2002.
 Ryan [1998] EP Ryan. An integral invariance principle for differential inclusions with applications in adaptive control. SIAM Journal on Control and Optimization, 36(3):960–980, 1998.
 Shardlow and Stuart [2000] Tony Shardlow and Andrew M Stuart. A perturbation theory for ergodic markov chains and application to numerical approximations. SIAM journal on numerical analysis, 37(4):1120–1137, 2000.
 Sobel [1982] M. Sobel. The variance of discounted Markov decision processes. Applied Probability, pages 794–802, 1982.
 Spall [1992] J. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3):332–341, 1992.
 Tamar et al. [2012] A. Tamar, D. Di Castro, and S. Mannor. Policy gradients with variance related risk criteria. In Proceedings of the TwentyNinth International Conference on Machine Learning, pages 387–396, 2012.
 Tamar et al. [2014] A. Tamar, Y. Glassner, and S. Mannor. Policy gradients beyond expectations: Conditional valueatrisk. arXiv:1404.3862v1, 2014.
 Thomas [2014] P. Thomas. Bias in natural actorcritic algorithms. In Proceedings of the ThirtyFirst International Conference on Machine Learning, 2014.
Appendix A Technical Details of the Trajectorybased Policy Gradient Algorithm
a.1 Assumptions
We make the following assumptions for the stepsize schedules in our algorithms:
(A1) For any stateaction pair , is continuously differentiable in and is a Lipschitz function in for every and .
(A2) The Markov chain induced by any policy is irreducible and aperiodic.
(A3) The step size schedules , , and satisfy
(23)  
(24)  
(25) 
a.2 Computing the Gradients
i) : Gradient of w.r.t.
By expanding the expectations in the definition of the objective function in (4), we obtain
By taking gradient with respect to , we have
This gradient can rewritten as
(26) 
where
ii) : Subdifferential of w.r.t.
From the definition of , we can easily see that is a convex function in for any fixed . Note that for every fixed and any , we have
where is any element in the set of subderivatives:
Since is finitevalued for any , by the additive rule of subderivatives, we have
(27) 
In particular for , we may write the subgradient of w.r.t.