A standard optimization criterion for an infinite horizon Markov decision process (MDP) is theexpected sum of (discounted) costs (i.e., finding a policy that minimizes the value function of the initial state of the system). However in many applications, we may prefer to minimize some measure of risk in addition to this standard optimization criterion. In such cases, we would like to use a criterion that incorporates a penalty for the variability (due to the stochastic nature of the system) induced by a given policy. In risk-sensitive MDPs , the objective is to minimize a risk-sensitive criterion such as the expected exponential utility , a variance-related measure [32, 16], or the percentile performance . The issue of how to construct such criteria in a manner that will be both conceptually meaningful and mathematically tractable is still an open question.
Although most losses (returns) are not normally distributed, the typical Markiowitz mean-variance optimization
, that relies on the first two moments of the loss (return) distribution, has dominated the risk management for overyears. Numerous alternatives to mean-variance optimization have emerged in the literature, but there is no clear leader amongst these alternative risk-sensitive objective functions. Value-at-risk (VaR) and conditional value-at-risk (CVaR) are two promising such alternatives that quantify the losses that might be encountered in the tail of the loss distribution, and thus, have received high status in risk management. For (continuous) loss distributions, while VaR measures risk as the maximum loss that might be incurred w.r.t. a given confidence level , CVaR measures it as the expected loss given that the loss is greater or equal to VaR. Although VaR is a popular risk measure, CVaR’s computational advantages over VaR has boosted the development of CVaR optimization techniques. We provide the exact definitions of these two risk measures and briefly discuss some of the VaR’s shortcomings in Section 2. CVaR minimization was first developed by Rockafellar and Uryasev  and its numerical effectiveness was demonstrated in portfolio optimization and option hedging problems. Their work was then extended to objective functions consist of different combinations of the expected loss and the CVaR, such as the minimization of the expected loss subject to a constraint on CVaR. This is the objective function that we study in this paper, although we believe that our proposed algorithms can be easily extended to several other CVaR-related objective functions. Boda and Filar  and Bäuerle and Ott [25, 4] extended the results of  to MDPs (sequential decision-making). While the former proposed to use dynamic programming (DP) to optimize CVaR, an approach that is limited to small problems, the latter showed that in both finite and infinite horizon MDPs, there exists a deterministic history-dependent optimal policy for CVaR optimization (see Section 3 for more details).
Most of the work in risk-sensitive sequential decision-making has been in the context of MDPs (when the model is known) and much less work has been done within the reinforcement learning (RL) framework. In risk-sensitive RL, we can mention the work by Borkar[11, 12] who considered the expected exponential utility and those by Tamar et al.  and Prashanth and Ghavamzadeh  on several variance-related risk measures. CVaR optimization in RL is a rather novel subject. Morimura et al.  estimate the return distribution while exploring using a CVaR-based risk-sensitive policy. Their algorithm does not scale to large problems. Petrik and Subramanian  propose a method based on stochastic dual DP to optimize CVaR in large-scale MDPs. However, their method is limited to linearly controllable problems. Borkar and Jain  consider a finite-horizon MDP with CVaR constraint and sketch a stochastic approximation algorithm to solve it. Finally, Tamar et al.  have recently proposed a policy gradient algorithm for CVaR optimization.
In this paper, we develop policy gradient (PG) and actor-critic (AC) algorithms for mean-CVaR optimization in MDPs. We first derive a formula for computing the gradient of this risk-sensitive objective function. We then propose several methods to estimate this gradient both incrementally and using system trajectories (update at each time-step vs. update after observing one or more trajectories). We then use these gradient estimations to devise PG and AC algorithms that update the policy parameters in the descent direction. Using the ordinary differential equations (ODE) approach, we establish the asymptotic convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in an optimal stopping problem. In comparison to, while they develop a PG algorithm for CVaR optimization in stochastic shortest path problems that only considers continuous loss distributions, uses a biased estimator for VaR, is not incremental, and has no convergence proof, here we study mean-CVaR optimization, consider both discrete and continuous loss distributions, devise both PG and (several) AC algorithms (trajectory-based and incremental – plus AC helps in reducing the variance of PG algorithms), and establish convergence proof for our algorithms.
We consider problems in which the agent’s interaction with the environment is modeled as a MDP. A MDP is a tuple , where and are the state and action spaces;
is the bounded cost random variable whose expectation is denoted by;
is the transition probability distribution; andis the initial state distribution. For simplicity, we assume that the system has a single initial state , i.e., . All the results of the paper can be easily extended to the case that the system has more than one initial state. We also need to specify the rule according to which the agent selects actions at each state. A stationary policy is a probability distribution over actions, conditioned on the current state. In policy gradient and actor-critic methods, we define a class of parameterized stochastic policies , estimate the gradient of a performance measure w.r.t. the policy parameters from the observed system trajectories, and then improve the policy by adjusting its parameters in the direction of the gradient. Since in this setting a policy is represented by its
-dimensional parameter vector, policy dependent functions can be written as a function of in place of . So, we use and interchangeably in the paper. We denote by and the -discounted visiting distribution of state and state-action pair under policy , respectively.
Let be a bounded-mean random variable, i.e.,
, with the cumulative distribution function(e.g., one may think of as the loss of an investment strategy ). We define the value-at-risk at the confidence level as VaR. Here the minimum is attained because is non-decreasing and right-continuous in . When is continuous and strictly increasing, VaR is the unique satisfying , otherwise, the VaR equation can have no solution or a whole range of solutions. Although VaR is a popular risk measure, it suffers from being unstable and difficult to work with numerically when is not normally distributed, which is often the case as loss distributions tend to exhibit fat tails or empirical discreteness. Moreover, VaR is not a coherent risk measure  and more importantly does not quantify the losses that might be suffered beyond its value at the -tail of the distribution . An alternative measure that addresses most of the VaR’s shortcomings is conditional value-at-risk, CVAR, which is the mean of the -tail distribution of . If there is no probability atom at VaR, CVaR has a unique value that is defined as CVaR. Rockafellar and Uryasev  showed that
Note that as a function of , is finite and convex (hence continuous).
3 CVaR Optimization in MDPs
For a policy , we define the loss of a state (state-action pair ) as the sum of (discounted) costs encountered by the agent when it starts at state (state-action pair ) and then follows policy , i.e., and . The expected value of these two random variables are the value and action-value functions of policy , i.e., and . The goal in the standard discounted formulation is to find an optimal policy .
For CVaR optimization in MDPs, we consider the following optimization problem: For a given confidence level and loss tolerance ,
where is the Lagrange multiplier. The goal here is to find the saddle point of , i.e., a point that satisfies . This is achieved by descending in and ascending in using the gradients of w.r.t. , , and , i.e.,111The notation in (6) means that the right-most term is a member of the sub-gradient set .
We assume that there exists a policy such that CVaR (feasibility assumption). As discussed in Section 1, Bäuerle and Ott [25, 4] showed that there exists a deterministic history-dependent optimal policy for CVaR optimization. The important point is that this policy does not depend on the complete history, but only on the current time step , current state of the system , and accumulated discounted cost .
In the following, we present a policy gradient (PG) algorithm (Sec. 4) and several actor-critic (AC) algorithms (Sec. 5.5) to optimize (4). While the PG algorithm updates its parameters after observing several trajectories, the AC algorithms are incremental and update their parameters at each time-step.
4 A Trajectory-based Policy Gradient Algorithm
In this section, we present a policy gradient algorithm to solve the optimization problem (4). The unit of observation in this algorithm is a system trajectory generated by following the current policy. At each iteration, the algorithm generates trajectories by following the current policy, use them to estimate the gradients in (5)-(7), and then use these estimates to update the parameters .
Let be a trajectory generated by following the policy , where and is usually a terminal state of the system. After visits the terminal state, it enters a recurring sink state at the next time step, incurring zero cost, i.e., , . Time index is referred as the stopping time of the MDP. Since the transition is stochastic, is a non-deterministic quantity. Here we assume that the policy is proper, i.e., for every . This further means that with probability , the MDP exits the transient states and hits (and stays in ) in finite time . For simplicity, we assume that the agent incurs zero cost in the terminal state. Analogous results for the general case with a non-zero terminal cost can be derived using identical arguments. The loss and probability of are defined as and , respectively. It can be easily shown that .
Algorithm 1 contains the pseudo-code of our proposed policy gradient algorithm. What appears inside the parentheses on the right-hand-side of the update equations are the estimates of the gradients of w.r.t. (estimates of (5)-(7)) (see Appendix A.2). is an operator that projects a vector to the closest point in a compact and convex set , and and are projection operators to and , respectively. These projection operators are necessary to ensure the convergence of the algorithm. The step-size schedules satisfy the standard conditions for stochastic approximation algorithms, and ensures that the VaR parameter update is on the fastest time-scale , the policy parameter update is on the intermediate time-scale , and the Lagrange multiplier update is on the slowest time-scale (see Appendix A.1 for the conditions on the step-size schedules). This results in a three time-scale stochastic approximation algorithm. We prove that our policy gradient algorithm converges to a (local) saddle point of the risk-sensitive objective function (see Appendix A.3).
5 Incremental Actor-Critic Algorithms
As mentioned in Section 4, the unit of observation in our policy gradient algorithm (Algorithm 1) is a system trajectory. This may result in high variance for the gradient estimates, especially when the length of the trajectories is long. To address this issue, in this section, we propose actor-critic algorithms that use linear approximation for some quantities in the gradient estimates and update the parameters incrementally (after each state-action transition). To develop our actor-critic algorithms, we should show how the gradients of (5)-(7) are estimated in an incremental fashion. We show this in the next four subsections, followed by a subsection that contains the algorithms.
5.1 Gradient w.r.t. the Policy Parameters
The gradient of our objective function w.r.t. the policy parameters in (5) may be rewritten as
Given the original MDP and the parameter , we define the augmented MDP as , , , and
where is any terminal state of the original MDP and is the value of the part of the state when a policy reaches a terminal state after steps, i.e., . We define a class of parameterized stochastic policies for this augmented MDP. Thus, the total (discounted) loss of this trajectory can be written as
From (9), it is clear that the quantity in the parenthesis of (8) is the value function of the policy at state in the augmented MDP , i.e., . Thus, it is easy to show that (the proof of the second equality can be found in the literature, e.g., )
where is the discounted visiting distribution (defined in Section 2) and is the action-value function of policy in the augmented MDP . We can show that
is an unbiased estimate of, where is the temporal-difference (TD) error in , and is an unbiased estimator of (see e.g. ). In our actor-critic algorithms, the critic uses linear approximation for the value function , where the feature vector is from low-dimensional space .
5.2 Gradient w.r.t. the Lagrangian Parameter
We may rewrite the gradient of our objective function w.r.t. the Lagrangian parameters in (7) as
Similar to Section 5.1, (a) comes from the fact that the quantity in the parenthesis in (11) is , the value function of the policy at state in the augmented MDP . Note that the dependence of on comes from the definition of the cost function in . We now derive an expression for , which in turn will give us an expression for . The gradient of w.r.t. the Lagrangian parameter may be written as
See Appendix B.2.
From Lemma 5.2 and (11), it is easy to see that is an unbiased estimate of . An issue with this estimator is that its value is fixed to all along a system trajectory, and only changes at the end to . This may affect the incremental nature of our actor-critic algorithm. To address this issue, we propose a different approach to estimate the gradients w.r.t. and in Sec. 5.4 (of course this does not come for free).
Another important issue is that the above estimator is unbiased only if the samples are generated from the distribution . If we just follow the policy, then we may use as an estimate for (see (20) and (22) in Algorithm 2). Note that this is an issue for all discounted actor-critic algorithms that their (likelihood ratio based) estimate for the gradient is unbiased only if the samples are generated from , and not just when we simply follow the policy. Although this issue was known in the community, there is a recent paper that investigates it in details . Moreover, this might be a main reason that we have no convergence analysis (to the best of our knowledge) for (likelihood ratio based) discounted actor-critic algorithms.222Note that the discounted actor-critic algorithm with convergence proof in  is based on SPSA.
5.3 Sub-Gradient w.r.t. the VaR Parameter
We may rewrite the sub-gradient of our objective function w.r.t. the VaR parameters in (6) as
From the definition of the augmented MDP , the probability in (13) may be written as , where is the part of the state in when we reach a terminal state, i.e., (see Section 5.1). Thus, we may rewrite (13) as
From (14), it is easy to see that is an unbiased estimate of the sub-gradient of w.r.t. . An issue with this (unbiased) estimator is that it can be only applied at the end of a system trajectory (i.e., when we reach the terminal state ), and thus, using it prevents us of having a fully incremental algorithm. In fact, this is the estimator that we use in our semi trajectory-based actor-critic algorithm (see (21) in Algorithm 2).
One approach to estimate this sub-gradient incrementally, hence having a fully incremental algorithm, is to use simultaneous perturbation stochastic approximation (SPSA) method . The idea of SPSA is to estimate the sub-gradient using two values of at and , where is a positive perturbation (see Sec. 5.5 for the detailed description of ).333SPSA-based gradient estimate was first proposed in  and has been widely used in various settings, especially those involving high-dimensional parameter. The SPSA estimate described above is two-sided. It can also be implemented single-sided, where we use the values of the function at and . We refer the readers to  for more details on SPSA and to  for its application in learning in risk-sensitive MDPs. In order to see how SPSA can help us to estimate our sub-gradient incrementally, note that
Similar to Sections 5.1 and 5.2, (a) comes from the fact that the quantity in the parenthesis in (15) is , the value function of the policy at state in the augmented MDP . Since the critic uses a linear approximation for the value function, i.e., , in our actor-critic algorithms (see Section 5.1 and Algorithm 2), the SPSA estimate of the sub-gradient would be of the form (see (18) in Algorithm 2).
5.4 An Alternative Approach to Compute the Gradients
In this section, we present an alternative way to compute the gradients, especially those w.r.t. and . This allows us to estimate the gradient w.r.t. in a (more) incremental fashion (compared to the method of Section 5.2), with the cost of the need to use two different linear function approximators (instead of one used in Algorithm 2). In this approach, we define the augmented MDP slightly different than the one in Section 5.2. The only difference is in the definition of the cost function, which is defined here as (note that has been replaced by and has been removed)
where is any terminal state of the original MDP . It is easy to see that the term appearing in the gradients of (5)-(7) is the value function of the policy at state in this augmented MDP. As a result, we have
Gradient w.r.t. : It is easy to see that now this gradient (5) is the gradient of the value function of the original MDP, , plus times the gradient of the value function of the augmented MDP, , both at the initial states of these MDPs (with abuse of notation, we use for the value function of both MDPs). Thus, using linear approximators and for the value functions of the original and augmented MDPs, can be estimated as , where and are the TD-errors of these MDPs.
Gradient w.r.t. : Similar to the case for , it is easy to see that this gradient (7) is plus the value function of the augmented MDP, , and thus, can be estimated incrementally as .
Sub-Gradient w.r.t. : This sub-gradient (6) is times one plus the gradient w.r.t. of the value function of the augmented MDP, , and thus using SPSA, can be estimated incrementally as .
5.5 Actor-Critic Algorithms
In this section, we present two actor-critic algorithms for optimizing the risk-sensitive measure (4). These algorithms are based on the gradient estimates of Sections 5.1-5.3. While the first algorithm (SPSA-based) is fully incremental and updates all the parameters at each time-step, the second one updates at each time-step and updates and only at the end of each trajectory, thus given the name semi trajectory-based. Algorithm 2 contains the pseudo-code of these algorithms. The projection operators , , and are defined as in Section 4 and are necessary to ensure the convergence of the algorithms. The step-size schedules satisfy the standard conditions for stochastic approximation algorithms, and ensures that the critic update is on the fastest time-scale , the policy and VaR parameter updates are on the intermediate time-scale, with -update being faster than -update , and finally the Lagrange multiplier update is on the slowest time-scale (see Appendix B.1 for the conditions on these step-size schedules). This results in four time-scale stochastic approximation algorithms. We prove that these actor-critic algorithms converge to a (local) saddle point of the risk-sensitive objective function (see Appendix B.4).
6 Experimental Results
We consider an optimal stopping problem in which the state at each time step consists of the cost and time , i.e., , where is the stopping time. The agent (buyer) should decide either to accept the present cost or wait. If she accepts or when , the system reaches a terminal state and the cost is received, otherwise, she receives the cost and the new state is , where is w.p. and w.p. ( and are constants). Moreover, there is a discounted factor to account for the increase in the buyer’s affordability. The problem has been described in more details in Appendix C. Note that if we change cost to reward and minimization to maximization, this is exactly the American option pricing problem, a standard testbed to evaluate risk-sensitive algorithms (e.g., ). Since the state space is continuous, solving for an exact solution via DP is infeasible, and thus, it requires approximation and sampling techniques.
We compare the performance of our risk-sensitive policy gradient Alg. 1 (PG-CVaR) and two actor-critic Algs. 2 (AC-CVaR-SPSA,AC-CVaR-Semi-Traj) with their risk-neutral counterparts (PG and AC) (see Appendix C for the details of these experiments). Fig. 1 shows the distribution of the discounted cumulative cost for the policy learned by each of these algorithms. From left to right, the columns display the first two moments, the whole (distribution), and zoom on the right-tail of these distributions. The results indicate that the risk-sensitive algorithms yield a higher expected loss, but less variance, compared to the risk-neutral methods. More precisely, the loss distributions of the risk-sensitive algorithms have lower right-tail than their risk-neutral counterparts. Table 1 summarizes the performance of these algorithms. The numbers reiterate what we concluded from Fig. 1.
7 Conclusions and Future Work
We proposed novel policy gradient and actor critic (AC) algorithms for CVaR optimization in MDPs. We provided proofs of convergence (in the appendix) to locally risk-sensitive optimal policies for the proposed algorithms. Further, using an optimal stopping problem, we observed that our algorithms resulted in policies whose loss distributions have lower right-tail compared to their risk-neutral counterparts. This is extremely important for a risk averse decision-maker, especially if the right-tail contains catastrophic losses. Future work includes: 1) Providing convergence proofs for our AC algorithms when the samples are generated by following the policy and not from its discounted visiting distribution (this can be wasteful in terms of samples), 2) Here we established asymptotic limits for our algorithms. To the best of our knowledge, there are no convergence rate results available for multi-timescale stochastic approximation schemes, and hence, for AC algorithms. This is true even for the AC algorithms that do not incorporate any risk criterion. It would be an interesting research direction to obtain finite-time bounds on the quality of the solution obtained by these algorithms, 3) Since interesting losses in the CVaR optimization problems are those that exceed the VaR, in order to compute more accurate estimates of the gradients, it is necessary to generate more samples in the right-tail of the loss distribution (events that are observed with a very low probability). Although importance sampling methods have been used to address this problem [3, 35], several issues, particularly related to the choice of the sampling distribution, have remained unsolved that are needed to be investigated, and finally, 4) Evaluating our algorithms in more challenging problems.
Altman et al. 
Eitan Altman, Konstantin E Avrachenkov, and Rudesindo Núñez-Queija.
Perturbation analysis for denumerable markov chains with application to queueing models.Advances in Applied Probability, pages 839–853, 2004.
- Artzner et al.  P. Artzner, F. Delbaen, J. Eber, and D. Heath. Coherent measures of risk. Journal of Mathematical Finance, 9(3):203–228, 1999.
- Bardou et al.  O. Bardou, N. Frikha, and G. Pagès. Computing VaR and CVaR using stochastic approximation and adaptive unconstrained importance sampling. Monte Carlo Methods and Applications, 15(3):173–210, 2009.
- Bäuerle and Ott  N. Bäuerle and J. Ott. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74(3):361–379, 2011.
- Bertsekas  D. Bertsekas. Nonlinear programming. Athena Scientific, 1999.
- Bhatnagar  S. Bhatnagar. An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes. Systems & Control Letters, 59(12):760–766, 2010.
- Bhatnagar and Lakshmanan  S. Bhatnagar and K. Lakshmanan. An online actor-critic algorithm with function approximation for constrained Markov decision processes. Journal of Optimization Theory and Applications, pages 1–21, 2012.
- Bhatnagar et al.  S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica, 45(11):2471–2482, 2009.
- Bhatnagar et al.  S. Bhatnagar, H. Prasad, and L.A. Prashanth. Stochastic Recursive Algorithms for Optimization, volume 434. Springer, 2013.
- Boda and Filar  K. Boda and J. Filar. Time consistent dynamic risk measures. Mathematical Methods of Operations Research, 63(1):169–186, 2006.
- Borkar  V. Borkar. A sensitivity formula for the risk-sensitive cost and the actor-critic algorithm. Systems & Control Letters, 44:339–346, 2001.
- Borkar  V. Borkar. Q-learning for risk-sensitive control. Mathematics of Operations Research, 27:294–311, 2002.
- Borkar  V. Borkar. An actor-critic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3):207–213, 2005.
- Borkar  V. Borkar. Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press, 2008.
- Borkar and Jain  V. Borkar and R. Jain. Risk-constrained Markov decision processes. IEEE Transaction on Automatic Control, 2014.
- Filar et al.  J. Filar, L. Kallenberg, and H. Lee. Variance-penalized Markov decision processes. Mathematics of Operations Research, 14(1):147–161, 1989.
- Filar et al.  J. Filar, D. Krass, and K. Ross. Percentile performance criteria for limiting average Markov decision processes. IEEE Transaction of Automatic Control, 40(1):2–10, 1995.
- Howard and Matheson  R. Howard and J. Matheson. Risk sensitive Markov decision processes. Management Science, 18(7):356–369, 1972.
- Khalil and Grizzle  Hassan K Khalil and JW Grizzle. Nonlinear systems, volume 3. Prentice hall Upper Saddle River, 2002.
- Kushner and Yin  Harold J Kushner and G George Yin. Stochastic approximation algorithms and applications. Springer, 1997.
- L.A. and Ghavamzadeh  Prashanth L.A. and M. Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Proceedings of Advances in Neural Information Processing Systems 26, pages 252–260, 2013.
- Markowitz  H. Markowitz. Portfolio Selection: Efficient Diversification of Investment. John Wiley and Sons, 1959.
- Milgrom and Segal  Paul Milgrom and Ilya Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583–601, 2002.
Morimura et al. 
T. Morimura, M. Sugiyama, M. Kashima, H. Hachiya, and T. Tanaka.
Nonparametric return distribution approximation for reinforcement
Proceedings of the 27th International Conference on Machine Learning, pages 799–806, 2010.
- Ott  J. Ott. A Markov Decision Model for a Surveillance Application and Risk-Sensitive Markov Decision Processes. PhD thesis, Karlsruhe Institute of Technology, 2010.
- Peters et al.  J. Peters, S. Vijayakumar, and S. Schaal. Natural actor-critic. In Proceedings of the Sixteenth European Conference on Machine Learning, pages 280–291, 2005.
Petrik and Subramanian 
M. Petrik and D. Subramanian.
An approximate solution method for large risk-averse Markov
Proceedings of the 28th International Conference on Uncertainty in Artificial Intelligence, 2012.
- Rockafellar and Uryasev  R. Rockafellar and S. Uryasev. Conditional value-at-risk for general loss distributions. Journal of Banking and Finance, 2:s1–41, 2000.
- Rockafellar and Uryasev  R. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk, 26:1443–1471, 2002.
- Ryan  EP Ryan. An integral invariance principle for differential inclusions with applications in adaptive control. SIAM Journal on Control and Optimization, 36(3):960–980, 1998.
- Shardlow and Stuart  Tony Shardlow and Andrew M Stuart. A perturbation theory for ergodic markov chains and application to numerical approximations. SIAM journal on numerical analysis, 37(4):1120–1137, 2000.
- Sobel  M. Sobel. The variance of discounted Markov decision processes. Applied Probability, pages 794–802, 1982.
- Spall  J. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3):332–341, 1992.
- Tamar et al.  A. Tamar, D. Di Castro, and S. Mannor. Policy gradients with variance related risk criteria. In Proceedings of the Twenty-Ninth International Conference on Machine Learning, pages 387–396, 2012.
- Tamar et al.  A. Tamar, Y. Glassner, and S. Mannor. Policy gradients beyond expectations: Conditional value-at-risk. arXiv:1404.3862v1, 2014.
- Thomas  P. Thomas. Bias in natural actor-critic algorithms. In Proceedings of the Thirty-First International Conference on Machine Learning, 2014.
Appendix A Technical Details of the Trajectory-based Policy Gradient Algorithm
We make the following assumptions for the step-size schedules in our algorithms:
(A1) For any state-action pair , is continuously differentiable in and is a Lipschitz function in for every and .
(A2) The Markov chain induced by any policy is irreducible and aperiodic.
(A3) The step size schedules , , and satisfy
a.2 Computing the Gradients
i) : Gradient of w.r.t.
By expanding the expectations in the definition of the objective function in (4), we obtain
By taking gradient with respect to , we have
This gradient can rewritten as
ii) : Sub-differential of w.r.t.
From the definition of , we can easily see that is a convex function in for any fixed . Note that for every fixed and any , we have
where is any element in the set of sub-derivatives:
Since is finite-valued for any , by the additive rule of sub-derivatives, we have
In particular for , we may write the sub-gradient of w.r.t.