1 Introduction
Risksensitive optimization considers problems in which the objective involves a risk measure of the random cost, in contrast to the typical expected cost objective. Such problems are important when the decisionmaker wishes to manage the variability of the cost, in addition to its expected outcome, and are standard in various applications of finance and operations research. In reinforcement learning (RL) [33]
, risksensitive objectives have gained popularity as a means to regularize the variability of the total (discounted) cost/reward in a Markov decision process (MDP).
Many risk objectives have been investigated in the literature and applied to RL, such as the celebrated Markowitz meanvariance model [19], ValueatRisk (VaR) and Conditional Value at Risk (CVaR) [22, 35, 26, 12, 10, 36]. The view taken in this paper is that the preference of one risk measure over another is problemdependent
and depends on factors such as the cost distribution, sensitivity to rare events, ease of estimation from data, and computational tractability of the optimization problem. However, the highly influential paper of Artzner et al.
[2] identified a set of natural properties that are desirable for a risk measure to satisfy. Risk measures that satisfy these properties are termed coherent and have obtained widespread acceptance in financial applications, among others. We focus on such coherent measures of risk in this work.For sequential decision problems, such as MDPs, another desirable property of a risk measure is time consistency. A timeconsistent risk measure satisfies a “dynamic programming” style property: if a strategy is riskoptimal for an stage problem, then the component of the policy from the th time until the end (where ) is also riskoptimal (see principle of optimality in [5]). The recently proposed class of dynamic Markov coherent risk measures [30] satisfies both the coherence and time consistency properties.
In this work, we present policy gradient algorithms for RL with a coherent risk objective. Our approach applies to the whole class of coherent risk measures, thereby generalizing and unifying previous approaches that have focused on individual risk measures. We consider both static coherent risk of the total discounted return from an MDP and timeconsistent dynamic Markov coherent risk. Our main contribution is formulating the risksensitive policygradient under the coherentrisk framework. More specifically, we provide:

A new formula for the gradient of static coherent risk that is convenient for approximation using sampling.

An algorithm for the gradient of general static coherent risk that involves sampling with convex programming and a corresponding consistency result.

A new policy gradient theorem for Markov coherent risk, relating the gradient to a suitable value function and a corresponding actorcritic algorithm.
Several previous results are special cases of the results presented here; our approach allows to rederive them in greater generality and simplicity.
Related Work
Risksensitive optimization in RL for specific risk functions has been studied recently by several authors. [8] studied exponential utility functions, [22], [35], [26] studied meanvariance models, [10], [36] studied CVaR in the static setting, and [25], [11] studied dynamic coherent risk for systems with linear dynamics. Our paper presents a general method for the whole class of coherent risk measures (both static and dynamic) and is not limited to a specific choice within that class, nor to particular system dynamics.
Reference [24] showed that an MDP with a dynamic coherent risk objective is essentially a robust MDP. The planning for large scale MDPs was considered in [37], using an approximation of the value function. For many problems, approximation in the policy space is more suitable (see, e.g., [18]). Our samplingbased RLstyle approach is suitable for approximations both in the policy and value function, and scalesup to large or continuous MDPs. We do, however, make use of a technique of [37] in a part of our method.
Optimization of coherent risk measures was thoroughly investigated by Ruszczynski and Shapiro [31] (see also [32]) for the stochastic programming case in which the policy parameters do not affect the distribution of the stochastic system (i.e., the MDP trajectory), but only the reward function, and thus, this approach is not suitable for most RL problems. For the case of MDPs and dynamic risk, [30]
proposed a dynamic programming approach. This approach does not scaleup to large MDPs, due to the “curse of dimensionality”. For further motivation of risksensitive policy gradient methods, we refer the reader to
[22, 35, 26, 10, 36].2 Preliminaries
Consider a probability space
, where is the set of outcomes (sample space), is a algebra over representing the set of events we are interested in, and , whereis the set of probability distributions, is a probability measure over
parameterized by some tunable parameter . In the following, we suppress the notation of in dependent quantities.To ease the technical exposition, in this paper we restrict our attention to finite probability spaces, i.e., has a finite number of elements. Our results can be extended to the normed spaces without loss of generality, but the details are omitted for brevity.
Denote by
the space of random variables
defined over the probability space . In this paper, a random variable is interpreted as a cost, i.e., the smaller the realization of , the better. For , we denote by the pointwise partial order, i.e., for all . We denote by a weighted expectation of .An MDP is a tuple , where and are the state and action spaces; is a bounded, deterministic, and statedependent cost; is the transition probability distribution; is a discount factor; and is the initial state.^{1}^{1}1Our results may easily be extended to random costs, stateaction dependent costs, and random initial states. Actions are chosen according to a parameterized stationary Markov^{2}^{2}2For the dynamic Markov risk we study, an optimal policy is stationary Markov, while this is not necessarily the case for the static risk. Our results can be extended to historydependent policies or stationary Markov policies on a state space augmented with the accumulated cost. The latter has shown to be sufficient for optimizing the CVaR risk [4]. policy . We denote by a trajectory of length drawn by following the policy in the MDP.
2.1 Coherent Risk Measures
A risk measure is a function that maps an uncertain outcome to the extended real line , e.g., the expectation or the conditional valueatrisk (CVaR) . A risk measure is called coherent, if it satisfies the following conditions for all [2]:
 A1

Convexity: ;
 A2

Monotonicity: if , then ;
 A3

Translation invariance: ;
 A4

Positive homogeneity: if , then .
Intuitively, these condition ensure the “rationality” of singleperiod risk assessments: A1 ensures that diversifying an investment will reduce its risk; A2 guarantees that an asset with a higher cost for every possible scenario is indeed riskier; A3, also known as ‘cash invariance’, means that the deterministic part of an investment portfolio does not contribute to its risk; the intuition behind A4 is that doubling a position in an asset doubles its risk. We further refer the reader to [2] for a more detailed motivation of coherent risk.
The following representation theorem [32] shows an important property of coherent risk measures that is fundamental to our gradientbased approach.
Theorem 2.1.
A risk measure is coherent if and only if there exists a convex bounded and closed set such that^{3}^{3}3When we study risk in MDPs, the risk envelop in Eq. 1 also depends on the state .
(1) 
The result essentially states that any coherent risk measure is an expectation w.r.t. a worstcase density function , chosen adversarially from a suitable set of test density functions , referred to as risk envelope. Moreover, it means that any coherent risk measure is uniquely represented by its risk envelope. Thus, in the sequel, we shall interchangeably refer to coherent riskmeasures either by their explicit functional representation, or by their corresponding riskenvelope.
In this paper, we assume that the risk envelop is given in a canonical convex programming formulation, and satisfies the following conditions.
Assumption 2.2 (The General Form of Risk Envelope).
For each given policy parameter , the risk envelope of a coherent risk measure can be written as
(2) 
where each constraint is an affine function in , each constraint is a convex function in , and there exists a strictly feasible point . and here denote the sets of equality and inequality constraints, respectively. Furthermore, for any given , and are twice differentiable in , and there exists a such that
Assumption 2.2 implies that the risk envelope is known in an explicit form. From Theorem 6.6 of [32], in the case of a finite probability space, is a coherent risk if and only if is a convex and compact set. This justifies the affine assumption of and the convex assumption of . Moreover, the additional assumption on the smoothness of the constraints holds for many popular coherent risk measures, such as the CVaR, the meansemideviation, and spectral risk measures [1].
2.2 Dynamic Risk Measures
The risk measures defined above do not take into account any temporal structure that the random variable might have, such as when it is associated with the return of a trajectory in the case of MDPs. In this sense, such risk measures are called static. Dynamic risk measures, on the other hand, explicitly take into account the temporal nature of the stochastic outcome. A primary motivation for considering such measures is the issue of time consistency, usually defined as follows [30]: if a certain outcome is considered less risky in all states of the world at stage , then it should also be considered less risky at stage . Example 2.1 in [16] shows the importance of time consistency in the evaluation of risk in a dynamic setting. It illustrates that for multiperiod decisionmaking, optimizing a static measure can lead to “timeinconsistent” behavior. Similar paradoxical results could be obtained with other risk metrics; we refer the readers to [30] and [16] for further insights.
Markov Coherent Risk Measures.
Markov risk measures were introduced in [30] and are a useful class of dynamic timeconsistent risk measures that are particularly important for our study of risk in MDPs. For a length horizon and MDP , the Markov coherent risk measure is
(3) 
where is a static coherent risk measure that satisfies Assumption 2.2 and is a trajectory drawn from the MDP under policy . It is important to note that in (3), each static coherent risk at state is induced by the transition probability . We also define , which is welldefined since and the cost is bounded. We further assume that in (3) is a Markov risk measure, i.e., the evaluation of each static coherent risk measure is not allowed to depend on the whole past.
3 Problem Formulation
In this paper, we are interested in solving two risksensitive optimization problems. Given a random variable and a static coherent risk measure as defined in Section 2, the static risk problem (SRP) is given by
(4) 
For example, in an RL setting, may correspond to the cumulative discounted cost of a trajectory induced by an MDP with a policy parameterized by .
For an MDP and a dynamic Markov coherent risk measure as defined by Eq. 3, the dynamic risk problem (DRP) is given by
(5) 
Except for very limited cases, there is no reason to hope that neither the SRP in (4) nor the DRP in (5) should be tractable problems, since the dependence of the risk measure on may be complex and nonconvex. In this work, we aim towards a more modest goal and search for a locally optimal . Thus, the main problem that we are trying to solve in this paper is how to calculate the gradients of the SRP’s and DRP’s objective functions
We are interested in nontrivial cases in which the gradients cannot be calculated analytically. In the static case, this would correspond to a nontrivial dependence of on . For dynamic risk, we also consider cases where the state space is too large for a tractable computation. Our approach for dealing with such difficult cases is through sampling. We assume that in the static case, we may obtain i.i.d. samples of the random variable . For the dynamic case, we assume that for each state and action of the MDP, we may obtain i.i.d. samples of the next state . We show that sampling may indeed be used in both cases to devise suitable estimators for the gradients.
To finally solve the SRP and DRP problems, a gradient estimate may be plugged into a standard stochastic gradient descent (SGD) algorithm for learning a locally optimal solution to (
4) and (5). From the structure of the dynamic risk in Eq. 3, one may think that a gradient estimator for may help us to estimate the gradient . Indeed, we follow this idea and begin with estimating the gradient in the static risk case.4 Gradient Formula for Static Risk
In this section, we consider a static coherent risk measure and propose samplingbased estimators for . We make the following assumption on the policy parametrization, which is standard in the policy gradient literature [18].
Assumption 4.1.
The likelihood ratio is welldefined and bounded for all .
Moreover, our approach implicitly assumes that given some , may be easily calculated. This is also a standard requirement for policy gradient algorithms [18] and is satisfied in various applications such as queueing systems, inventory management, and financial engineering (see, e.g., the survey by Fu [14]).
Using Theorem 2.1 and Assumption 2.2, for each , we have that is the solution to the convex optimization problem (1) (for that value of ). The Lagrangian function of (1), denoted by , may be written as
(6) 
The convexity of (1) and its strict feasibility due to Assumption 2.2 implies that has a nonempty set of saddle points . The next theorem presents a formula for the gradient . As we shall subsequently show, this formula is particularly convenient for devising sampling based estimators for .
The proof of this theorem, given in the supplementary material, involves an application of the Envelope theorem [21] and a standard ‘likelihoodratio’ trick. We now demonstrate the utility of Theorem 4.2 with several examples in which we show that it generalizes previously known results, and also enables deriving new useful gradient formulas.
4.1 Example 1: CVaR
The CVaR at level of a random variable , denoted by , is a very popular coherent risk measure [28], defined as
When is continuous, is wellknown to be the mean of the tail distribution of , , where is a
quantile of
. Thus, selecting a small makes CVaR particularly sensitive to rare, but very high costs.The risk envelope for CVaR is known to be [32] Furthermore, [32] show that the saddle points of (6) satisfy when , and when , where is any quantile of . Plugging this result into Theorem 4.2, we can easily show that
This formula was recently proved in [36] for the case of continuous distributions by an explicit calculation of the conditional expectation, and under several additional smoothness assumptions. Here we show that it holds regardless of these assumptions and in the discrete case as well. Our proof is also considerably simpler.
4.2 Example 2: MeanSemideviation
The semideviation of a random variable is defined as . The semideviation captures the variation of the cost only above its mean
, and is an appealing alternative to the standard deviation, which does not distinguish between the variability of upside and downside deviations. For some
, the meansemideviation risk measure is defined as , and is a coherent risk measure [32]. We have the following result:Proposition 4.3.
Under Assumption 4.1, with , we have
This proposition can be used to devise a sampling based estimator for by replacing all the expectations with sample averages. The algorithm along with the proof of the proposition are in the supplementary material. In Section 6 we provide a numerical illustration of optimization with a meansemideviation objective.
4.3 General Gradient Estimation Algorithm
In the two previous examples, we obtained a gradient formula by analytically calculating the Lagrangian saddle point (6) and plugging it into the formula of Theorem 4.2. We now consider a general coherent risk for which, in contrast to the CVaR and meansemideviation cases, the Lagrangian saddlepoint is not known analytically. We only assume that we know the structure of the riskenvelope as given by (2). We show that in this case, may be estimated using a sample average approximation (SAA; [32]) of the formula in Theorem 4.2.
Assume that we are given i.i.d. samples , , and let denote the corresponding empirical distribution. Also, let the sample risk envelope be defined according to Eq. 2 with replaced by . Consider the following SAA version of the optimization in Eq. 1:
(7) 
Note that (7) defines a convex optimization problem with variables and constraints. In the following, we assume that a solution to (7) may be computed efficiently using standard convex programming tools such as interior point methods [9]. Let denote a solution to (7) and denote the corresponding KKT multipliers, which can be obtained from the convex programming algorithm [9]. We propose the following estimator for the gradientbased on Theorem 4.2:
(8)  
Thus, our gradient estimation algorithm is a twostep procedure involving both sampling and convex programming. In the following, we show that under some conditions on the set , is a consistent estimator of . The proof has been reported in the supplementary material.
Proposition 4.4.
Let Assumptions 2.2 and 4.1 hold. Suppose there exists a compact set such that: (I) The set of Lagrangian saddle points is nonempty and bounded. (II) The functions for all and for all are finitevalued and continuous (in ) on . (III) For large enough, the set is nonempty and w.p. 1. Further assume that: (IV) If and converges w.p. 1 to a point , then . We then have that and w.p. 1.
The set of assumptions for Proposition 4.4 is large, but rather mild. Note that (I) is implied by the Slater condition of Assumption 2.2. For satisfying (III), we need that the risk be welldefined for every empirical distribution, which is a natural requirement. Since always converges to uniformly on , (IV) essentially requires smoothness of the constraints. We remark that in particular, constraints (I) to (IV) are satisfied for the popular CVaR, meansemideviation, and spectral risk measures.
To summarize this section, we have seen that by exploiting the special structure of coherent risk measures in Theorem 2.1 and by the envelopetheorem style result of Theorem 4.2, we were able to derive samplingbased, likelihoodratio style algorithms for estimating the policy gradient of coherent static risk measures. The gradient estimation algorithms developed here for static risk measures will be used as a subroutine in our subsequent treatment of dynamic risk measures.
5 Gradient Formula for Dynamic Risk
In this section, we derive a new formula for the gradient of the Markov coherent dynamic risk measure, . Our approach is based on combining the static gradient formula of Theorem 4.2, with a dynamicprogramming decomposition of .
The risksensitive valuefunction for an MDP under the policy is defined as , where with a slight abuse of notation, denotes the Markovcoherent dynamic risk in (3) when the initial state is . It is shown in [30] that due to the structure of the Markov dynamic risk , the value function is the unique solution to the risksensitive Bellman equation
(9) 
where the expectation is taken over the next state transition. Note that by definition, we have , and thus, .
We now develop a formula for ; this formula extends the wellknown “policy gradient theorem” [34, 17], developed for the expected return, to Markovcoherent dynamic risk measures. We make a standard assumption, analogous to Assumption 4.1 of the static case.
Assumption 5.1.
The likelihood ratio is welldefined and bounded for all and .
For each state , let denote a saddle point of (6), corresponding to the state , with replacing in (6) and replacing . The next theorem presents a formula for ; the proof is in the supplementary material.
Theorem 5.2.
Under Assumptions 2.2 and 5.1, we have
where
denotes the expectation w.r.t. trajectories generated by the Markov chain with transition probabilities
, and the stagewise cost function is defined asTheorem 5.2 may be used to develop an actorcritic style [34, 17] samplingbased algorithm for solving the DRP problem (5), composed of two interleaved procedures:
Critic: For a given policy , calculate the risksensitive value function , and
Actor: Using the critic’s and Theorem 5.2, estimate and update .
Space limitation restricts us from specifying the full details of our actorcritic algorithm and its analysis. In the following, we highlight only the key ideas and results. For the full details, we refer the reader to the full paper version, provided in the supplementary material.
For the critic, the main challenge is calculating the value function when the state space is large and dynamic programming cannot be applied due to the ‘curse of dimensionality’. To overcome this, we exploit the fact that is equivalent to the value function in a robust MDP [24] and modify a recent algorithm in [37] to estimate it using function approximation.
For the actor, the main challenge is that in order to estimate the gradient using Thm. 5.2, we need to sample from an MDP with weighted transitions. Also, involves an expectation for each and . Therefore, we propose a twophase sampling procedure to estimate in which we first use the critic’s estimate of to derive , and sample a trajectory from an MDP with weighted transitions. For each state in the trajectory, we then sample several next states to estimate .
The convergence analysis of the actorcritic algorithm and the gradient error incurred from function approximation of are reported in the supplementary material.
6 Numerical Illustration
In this section, we illustrate our approach with a numerical example. The purpose of this illustration is to emphasize the importance of flexibility in designing risk criteria for selecting an appropriate riskmeasure – such that suits both the user’s risk preference and the problemspecific properties.
We consider a trading agent that can invest in one of three assets (see Figure 1 for their distributions). The returns of the first two assets, and
, are normally distributed:
and . The return of the third asset has a Pareto distribution: , with . The mean of the return from is 3 and its variance is infinite; such heavytailed distributions are widely used in financial modeling [27]. The agent selects an action randomly, with probability , where is the policy parameter. We trained three different policies , , and . Policy is riskneutral, i.e., , and it was trained using standard policy gradient [18]. Policy is riskaverse and had a meansemideviation objective , and was trained using the algorithm in Section 4. Policy is also riskaverse, with a meanstandarddeviation objective, as proposed in [35, 26], , and was trained using the algorithm of [35]. For each of these policies, Figure 1 shows the probability of selecting each asset vs. training iterations. Although has the highest mean return, the riskaverse policy chooses , since it has a lower downside, as expected. However, because of the heavy uppertail of , policy opted to choose instead. This is counterintuitive as a rational investor should not avert high returns. In fact, in this case stochastically dominates [15].7 Conclusion
We presented algorithms for estimating the gradient of both static and dynamic coherent risk measures using two new policy gradient style formulas that combine sampling with convex programming. Thereby, our approach extends risksensitive RL to the whole class of coherent risk measures, and generalizes several recent studies that focused on specific risk measures.
On the technical side, an important future direction is to improve the convergence rate of gradient estimates using importance sampling methods. This is especially important for risk criteria that are sensitive to rare events, such as the CVaR [3].
From a more conceptual point of view, the coherentrisk framework explored in this work provides the decision maker with flexibility in designing risk preference. As our numerical example shows, such flexibility is important for selecting appropriate problemspecific risk measures for managing the cost variability. However, we believe that our approach has much more potential than that.
In almost every realworld application, uncertainty emanates from stochastic dynamics, but also, and perhaps more importantly, from modeling errors (model uncertainty). A prudent policy should protect against both types of uncertainties. The representation duality of coherentrisk (Theorem 2.1), naturally relates the risk to model uncertainty. In [24], a similar connection was made between modeluncertainty in MDPs and dynamic Markov coherent risk. We believe that by carefully shaping the riskcriterion, the decision maker may be able to take uncertainty into account in a broad sense. Designing a principled procedure for such riskshaping is not trivial, and is beyond the scope of this paper. However, we believe that there is much potential to risk shaping as it may be the key for handling model misspecification in dynamic decision making.
References
 [1] C. Acerbi. Spectral measures of risk: a coherent representation of subjective risk aversion. Journal of Banking & Finance, 26(7):1505–1518, 2002.
 [2] P. Artzner, F. Delbaen, J. Eber, and D. Heath. Coherent measures of risk. Mathematical finance, 9(3):203–228, 1999.
 [3] O. Bardou, N. Frikha, and G. Pagès. Computing VaR and CVaR using stochastic approximation and adaptive unconstrained importance sampling. Monte Carlo Methods and Applications, 15(3):173–210, 2009.
 [4] N. Bäuerle and J. Ott. Markov decision processes with averagevalueatrisk criteria. Mathematical Methods of Operations Research, 74(3):361–379, 2011.
 [5] D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 4th edition, 2012.
 [6] D. Bertsekas and J. Tsitsiklis. NeuroDynamic Programming. Athena Scientific, 1996.
 [7] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor critic algorithms. Automatica, 45(11):2471– 2482, 2009.
 [8] V. Borkar. A sensitivity formula for risksensitive cost and the actor–critic algorithm. Systems & Control Letters, 44(5):339–346, 2001.
 [9] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2009.
 [10] Y. Chow and M. Ghavamzadeh. Algorithms for CVaR optimization in MDPs. In NIPS 27, 2014.
 [11] Y. Chow and M. Pavone. A unifying framework for timeconsistent, riskaverse model predictive control: theory and algorithms. In American Control Conference, 2014.
 [12] E. Delage and S. Mannor. Percentile optimization for Markov decision processes with parameter uncertainty. Operations Research, 58(1):203–213, 2010.
 [13] A. Fiacco. Introduction to sensitivity and stability analysis in nonlinear programming. Elsevier, 1983.
 [14] M. Fu. Gradient estimation. In Simulation, volume 13 of Handbooks in Operations Research and Management Science, pages 575 – 616. Elsevier, 2006.
 [15] J. Hadar and W. R. Russell. Rules for ordering uncertain prospects. The American Economic Review, pages 25–34, 1969.
 [16] D. Iancu, M. Petrik, and D. Subramanian. Tight approximations of dynamic risk measures. arXiv:1106.6102, 2011.
 [17] V. Konda and J. Tsitsiklis. Actorcritic algorithms. In NIPS, 2000.
 [18] P. Marbach and J. Tsitsiklis. Simulationbased optimization of Markov reward processes. IEEE Transactions on Automatic Control, 46(2):191–209, 1998.
 [19] H. Markowitz. Portfolio Selection: Efficient Diversification of Investment. John Wiley and Sons, 1959.
 [20] F. Meng and H. Xu. A regularized sample average approximation method for stochastic mathematical programs with nonsmooth equality constraints. SIAM Journal on Optimization, 17(3):891–919, 2006.
 [21] P. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583–601, 2002.
 [22] J. Moody and M. Saffell. Learning to trade via direct reinforcement. Neural Networks, IEEE Transactions on, 12(4):875–889, 2001.
 [23] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780– 798, 2005.
 [24] T. Osogami. Robustness and risksensitivity in Markov decision processes. In NIPS, 2012.
 [25] M. Petrik and D. Subramanian. An approximate solution method for large riskaverse Markov decision processes. In UAI, 2012.
 [26] L. Prashanth and M. Ghavamzadeh. Actorcritic algorithms for risksensitive MDPs. In NIPS 26, 2013.
 [27] S. Rachev and S. Mittnik. Stable Paretian models in finance. John Willey & Sons, New York, 2000.
 [28] R. Rockafellar and S. Uryasev. Optimization of conditional valueatrisk. Journal of risk, 2:21–42, 2000.
 [29] R. Rockafellar, R. Wets, and M. Wets. Variational analysis, volume 317. Springer, 1998.
 [30] A. Ruszczyński. Riskaverse dynamic programming for Markov decision processes. Mathematical Programming, 125(2):235–261, 2010.
 [31] A. Ruszczyński and A. Shapiro. Optimization of convex risk functions. Math. OR, 31(3):433–452, 2006.
 [32] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on Stochastic Programming, chapter 6, pages 253–332. SIAM, 2009.
 [33] R. Sutton and A. Barto. Reinforcement learning: An introduction. Cambridge Univ Press, 1998.
 [34] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In NIPS 13, 2000.

[35]
A. Tamar, D. Di Castro, and S. Mannor.
Policy gradients with variance related risk criteria.
In
International Conference on Machine Learning
, 2012.  [36] A. Tamar, Y. Glassner, and S. Mannor. Optimizing the CVaR via sampling. In AAAI, 2015.
 [37] A. Tamar, S. Mannor, and H. Xu. Scaling up robust MDPs using function approximation. In International Conference on Machine Learning, 2014.
Appendix A Proof of Theorem 4.2
First note from Assumption 2.2 that
 (i)

Slater’s condition holds in the primal optimization problem (1),
 (ii)

is convex in and concave in .
Thus by the duality result in convex optimization [9], the above conditions imply strong duality and we have . From Assumption 2.2, one can also see that the family of functions is equidifferentiable in , is Lipschitz, as a result, an absolutely continuous function in , and thus, is continuous and bounded at each . Then for every selection of saddle point of (6), using the Envelop theorem for saddlepoint problems (see Theorem 4 of [21]), we have
(10) 
The result follows by writing the gradient in (10) explicitly, and using the likelihoodratio trick:
where the last equality is justified by Assumption 4.1.
Appendix B Gradient Results for Static MeanSemideviation
In this section we consider the meansemideviation risk measure, defined as follows:
(11) 
Following the derivation in [32], note that , where denotes the norm of the space . The norm may also be written as:
and hence
It follows that Eq. (1) holds with
For this case it will be more convenient to write Eq. (1) in the following form
(12) 
Let denote an optimal solution for (12). In [32] it is shown that is a contact point of , that is
and we have that
(13) 
Note that is not necessarily a probability distribution, but for , it can be shown [32] that always is.
In the following we show that may be used to write the gradient as an expectation, which will lead to a sampling algorithm for the gradient.
Proposition B.1.
Proof.
Note that in Eq. (12) the constraints do not depend on . Therefore, using the envelope theorem we obtain that
(14) 
We now write each of the terms in Eq. (14) as an expectation. We start with the following standard likelihoodratio result:
Also, we have that
therefore, by the derivative of a product rule:
By the likelihoodratio trick and Eq. (13) we have that
Proposition 4.3 naturally leads to a samplingbased gradient estimation algortihm, which we term GMSD (Gradient of Mean SemiDeviation). The algorithm is described in Algorithm 1.
1: Given:

Risk level

An i.i.d. sequence .
2: Set
3: Set
4: Set
5: Return:
Appendix C Consistency Proof
Let denote the probability space of the SAA functions (i.e., the randomness due to sampling).
Let denote the Lagrangian of the SAA problem
(15) 
Recall that denotes the set of saddle points of the true Lagrangian (6). Let denote the set of SAA Lagrangian (15) saddle points.
Suppose that there exists a compact set , where and such that:
 (i)

The set of Lagrangian saddle points is nonempty and bounded.
 (ii)

The functions for all and for all are finite valued and continuous (in ) on .
 (iii)

For large enough the set is nonempty and w.p. 1.
Recall from Assumption 2.2 that for each fixed , both and are continuous in . Furthermore, by the S.L.L.N. of Markov chains, for each policy parameter, we have w.p. 1. From the definition of the Lagrangian function and continuity of constraint functions, one can easily see that for each , w.p. 1. Denote with the deviation of set from set , i.e., . Further assume that:
 (iv)

If and converges w.p. 1 to a point , then .
According to the discussion in Page 161 of [32], the Slater condition of Assumption 2.2 guarantees the following condition:
 (v)

For some point there exists a sequence such that w.p. 1,
and from Theorem 6.6 in [32], we know that both sets and are convex and compact. Furthermore, note that we have
 (vi)

The objective function on (1) is linear, finite valued and continuous in on (these conditions obviously hold for almost all in the integrand function ).
 (vii)

S.L.L.N. holds pointwise for any .
From (i,iv,v,vi,vii), and under the same lines of proof as in Theorem 5.5 of [32], we have that
(16) 
(17) 
In part 1 and part 2 of the following proof, we show, by following similar derivations as in Theorem 5.2, Theorem 5.3 and Theorem 5.4 of [32], that w.p. 1 and
Comments
There are no comments yet.