1 Introduction
Decision making within the Markov decision process (MDP) framework typically involves the minimization of a riskneutral performance objective, namely the expected total discounted cost [3]. This approach, while very popular, natural, and attractive from a computational standpoint, neither takes into account the variability of the cost (i.e., fluctuations around the mean), nor its sensitivity to modeling errors, which may significantly affect overall performance [12]. Risksensitive MDPs [9] address the first aspect by replacing the riskneutral expectation with a riskmeasure
of the total discounted cost, such as variance, ValueatRisk (VaR), or ConditionalVaR (CVaR). Robust MDPs
[15], on the other hand, address the second aspect by defining a set of plausible MDP parameters, and optimize decision with respect to the worstcase scenario.In this work we consider risksensitive MDPs with a CVaR objective, referred to as CVaR MDPs. CVaR [1, 19] is a riskmeasure that is rapidly gaining popularity in various engineering applications, e.g., finance, due to its favorable computational properties [1] and superior ability to safeguard a decision maker from the “outcomes that hurt the most” [21]. In this paper, by relating risk to robustness, we derive a novel result that further motivates the usage of a CVaR objective in a decisionmaking context. Specifically, we show that the CVaR of a discounted cost in an MDP is equivalent
to the expected value of the same discounted cost in presence of worstcase perturbations of the MDP parameters (specifically, transition probabilities), provided that such perturbations are within a certain error budget. This result suggests CVaR MDP as a method for decision making under
both cost variability and model uncertainty, motivating it as unified framework for planning under uncertainty.Literature review: Risksensitive MDPs have been studied for over four decades, with earlier efforts focusing on exponential utility [9], meanvariance [23], and percentile risk criteria [7] . Recently, for the reasons explained above, several authors have investigated CVaR MDPs [19]. Specifically, in [4], the authors propose a dynamic programming algorithm for finitehorizon riskconstrained MDPs where risk is measured according to CVaR. The algorithm is proven to asymptotically converge to an optimal riskconstrained policy. However, the algorithm involves computing integrals over continuous variables (Algorithm 1 in [4]) and, in general, its implementation appears particularly difficult. In [2], the authors investigate the structure of CVaR optimal policies and show that a Markov policy is optimal on an augmented state space, where the additional (continuous) state variable is represented by the running cost. In [8]
, the authors leverage such result to design an algorithm for CVaR MDPs that relies on discretizing occupation measures in the augmentedstate MDP. This approach, however, involves solving a nonconvex program via a sequence of linearprogramming approximations, which can only shown to converge asymptotically. A different approach is taken by
[5] and [24], which consider a finite dimensional parameterization of control policies, and show that a CVaR MDP can be optimized to a localoptimum using stochastic gradient descent (policy gradient). A recent result by Pflug and Pichler
[17] showed that CVaR MDPs admit a dynamic programming formulation by using a stateaugmentation procedure different from the one in [2]. The augmented state is also continuous, making the design of a solution algorithm challenging.Contributions: The contribution of this paper is twofold. First, as discussed above, we provide a novel interpretation for CVaR MDPs in terms of robustness to modeling errors. This result is of independent interest and further motivates the usage of CVaR MDPs for decision making under uncertainty. Second, we provide a new optimization algorithm for CVaR MDPs, which leverages the state augmentation procedure introduced by Pflug and Pichler [17]. We overcome the aforementioned computational challenges (due to the continuous augmented state) by designing an algorithm that merges approximate value iteration [3]
with linear interpolation. Remarkably, we are able to provide explicit error bounds and convergence rates based on contractionstyle arguments. In comparison to the algorithms in
[4, 8, 5, 24], our approach leads to finitetime error guarantees, with respect to the globally optimal policy. In addition, our algorithm is significantly simpler than previous methods, and calculates the optimal policy for allCVaR confidence intervals and initial states simultaneously. The practicality of our approach is demonstrated in numerical experiments involving planning a path on a grid with thousand of states. To the best of our knowledge, this is the first algorithm to compute globallyoptimal policies for nontrivial CVaR MDPs.
Organization: This paper is structured as follows. In Section 2 we provide background on CVaR and MDPs, we state the problem we wish to solve (i.e., CVaR MDPs), and motivate the CVaR MDP formulation by establishing a novel relation between CVaR and model perturbations. Section 3 provides the basis for our solution algorithm, based on a Bellmanstyle equation for the CVaR. Then, in Section 4 we present our algorithm and correctness analysis. in Section 5 we evaluate our approach via numerical experiments. Finally, in Section 6, we draw some conclusions and discuss directions for future work.
2 Preliminaries, Problem Formulation, and Motivation
2.1 Conditional ValueatRisk
Let
be a boundedmean random variable, i.e.,
, on a probability space, with cumulative distribution function
. In this paper we interpret as a cost. The valueatrisk (VaR) at confidence level is the quantile of , i.e., VaR. The conditional valueatrisk (CVaR) at confidence level is defined as [19]:(1) 
where represents the positive part of . If there is no probability atom at VaR, it is well known that CVaR. Therefore, CVaR may be interpreted as the worst case expected value of , conditioned on the portion of the tail distribution. It is well known that CVaR is decreasing in , equals to , and tends to as . During the last decade, the CVaR riskmeasure has gained popularity in financial applications, among others. It is especially useful for controlling rare, but potentially disastrous events, which occur below the quantile, and are neglected by the VaR [21]. Furthermore, CVaR enjoys desirable axiomatic properties, such as coherence [1]. We refer to [25] for further motivation about CVaR and a comparison with other risk measures such as VaR.
A useful property of CVaR, which we exploit in this paper, is its alternative dual representation [1]:
(2) 
where denotes the weighted expectation of , and the risk envelop is given by Thus, the CVaR of a random variable may be interpreted as the worstcase expectation of , under a perturbed distribution .
In this paper, we are interested in the CVaR of the total discounted cost in a sequential decisionmaking setting, as discussed next.
2.2 Markov Decision Processes
An MDP is a tuple , where and are finite state and action spaces; is a bounded deterministic cost;
is the transition probability distribution;
is the discounting factor, and is the initial state. (Our results easily generalize to random initial states and random costs.)Let the space of admissible histories up to time be , for , and . A generic element is of the form . Let be the set of all deterministic historydependent policies with the property that at each time the control is a function of . In other words, . We also let be the set of all history dependent policies.
2.3 Problem Formulation
Let the sequence of random variables denote the stagewise costs observed along a state/control trajectory in the MDP model, and let denote the total discounted cost up to time . The risksensitive discountedcost problem we wish to address is as follows:
(3) 
where is the policy sequence with actions for . We refer to problem (3) as CVaR MDP. (One may also consider a related formulation combining mean and CVaR, the details of which are presented in the supplementary material.)
The problem formulation in (3) directly addresses the aspect of risk sensitivity, as demonstrated by the numerous applications of CVaR optimization in finance (see, e.g., [20, 11, 6]) and the recent approaches for CVaR optimization in MDPs [4, 8, 5, 24]. In the following, we show a new result providing additional motivation for CVaR MDPs, from the point of view of robustness to modeling errors.
2.4 Motivation  Robustness to Modeling Errors
We show a new result relating the CVaR objective in (3) to the worstcase expected discountedcost in presence of worstcase perturbations of the MDP parameters, where the perturbations are budgeted according to the “number of things that can go wrong.” Thus, by minimizing CVaR, the decision maker also guarantees robustness of the policy.
Consider a trajectory in a finitehorizon MDP problem with transitions . We explicitly denote the time index of the transition matrices for reasons that will become clear shortly. The total probability of the trajectory is , and we let denote its discounted cost, as defined above.
We consider an adversarial setting, where an adversary is allowed to change the transition probabilities at each stage, under some budget constraints. We will show that, for a specific budget and perturbation structure, the expected cost under the worstcase perturbation is equivalent to the CVaR of the cost. Thus, we shall establish that, in this perspective, being risk sensitive is equivalent to being robust against model perturbations.
For each stage , consider a perturbed transition matrix , where is a multiplicative probability perturbation and is the Hadamard product, under the condition that
is a stochastic matrix. Let
denote the set of perturbation matrices that satisfy this condition, and let the set of all possible perturbations to the trajectory distribution.We now impose a budget constraint on the perturbations as follows. For some budget , we consider the constraint
(4) 
Essentially, the product in Eq. (4) states that the worst cannot happen at each time. Instead, the perturbation budget has to be split (multiplicatively) along the trajectory. We note that Eq. (4) is in fact a constraint on the perturbation matrices, and we denote by the set of perturbations that satisfy this constraint with budget . The following result shows an equivalence between the CVaR and the worstcase expected loss.
Proposition (Interpretation of CVaR as a Robustness Measure)
The proof of Proposition 2.4 is in the supplementary material. It is instructive to compare Proposition 2.4 with the dual representation of CVaR in (2). Note, in particular, that the perturbation budget in Proposition 2.4 has a temporal structure, which constrains the adversary from choosing the worst perturbation at each time step.
An equivalence between robustness and risksensitivity was previously suggested by Osogami [16]. In that study, the iterated (dynamic) coherent risk was shown to be equivalent to a robust MDP [10] with a rectangular uncertainty set. The iterated risk (and, correspondingly, the rectangular uncertainty set) is very conservative [26], in the sense that the worst can happen at each time step. In contrast, the perturbations considered here are much less conservative. In general, solving robust MDPs without the rectangularity assumption is NPhard. Nevertheless, Mannor et. al. [13] showed that, for cases where the number of perturbations to the parameters along a trajectory is upper bounded (budgetconstrained perturbation), the corresponding robust MDP problem is tractable. Analogous to the constraint set (1) in [13], the perturbation set in Proposition 2.4 limits the total number of logperturbations along a trajectory. Accordingly, we shall later see that optimizing problem (3) with perturbation structure (4) is indeed also tractable.
Next section provides the fundamental theoretical ideas behind our approach to the solution of (3).
3 Bellman Equation for CVaR
In this section, by leveraging a recent result from [17], we present a dynamic programming (DP) formulation for the CVaR MDP problem in (3). As we shall see, the value function in this formulation depends on both the state and the CVaR confidence level . We then establish important properties of such DP formulation, which will later enable us to derive an efficient DPbased approximate solution algorithm and provide correctness guarantees on the approximation error. All proofs are presented in the supplementary material.
Our starting point is a recursive decomposition of CVaR, whose proof is detailed in Theorem 10 of [17].
[CVaR Decomposition Theorem, [17]] For any , denote by the cost sequence from time onwards. The conditional CVaR under policy , i.e., , obeys the following decomposition:
where is the action induced by policy , and the expectation is with respect to . Theorem 3 concerns a fixed policy ; we now extend it to a general DP formulation. Note that in the recursive decomposition in Theorem 3 the righthand side involves CVaR terms with different confidence levels than that in the lefthand side. Accordingly, we augment the state space with an additional continuous state , which corresponds to the confidence level. For any and , the valuefunction for the augmented state is defined as:
Similar to standard DP, it is convenient to work with operators defined on the space of value functions [3]. In our case, Theorem 3 leads to the following definition of CVaR Bellman operator :
(6) 
We now establish several useful properties for the Bellman operator . [Properties of CVaR Bellman Operator] The Bellman operator has the following properties:

(Contraction.) where .

(Concavity preserving in .) For any , suppose is concave in . Then the maximization problem in (6) is concave. Furthermore, is concave in .
The first property in Lemma 6 is similar to standard DP [3], and is instrumental to the design of a converging valueiteration approach. The second property is nonstandard and specific to our approach. It will be used to show that the computation of valueiteration updates involves concave, and therefore tractable optimization problems. Furthermore, it will be used to show that a linearinterpolation of in the augmented state has a bounded error.
Equipped with the results in Theorem 3 and Lemma 6, we can now show that the fixed point solution of is unique, and equals to the solution of the CVaR MDP problem (3) with and . [Optimality Condition] For any and , the solution to is unique, and equals to . Next, we show that the optimal value of the CVaR MDP problem (3) can be attained by a stationary Markov policy, defined as a greedy policy with respect to the value function . Thus, while the original problem is defined over the intractable space of historydependent policies, a stationary Markov policy (over the augmented state space) is optimal, and can be readily derived from . Furthermore, an optimal historydependent policy can be readily obtained from an (augmented) optimal Markov policy according to the following theorem. [Optimal Policies] Let be a historydependent policy recursively defined as:
(7) 
with initial conditions and , and state transitions
(8) 
where the stationary Markovian policy and risk factor are solution to the minmax optimization problem in the CVaR Bellman operator . Then, is an optimal policy for problem (3) with initial state and CVaR confidence level .
Theorems 3 and 3 suggest that a valueiteration DP method [3] can be used to solve the CVaR MDP problem (3). Let an initial valuefunction guess be chosen arbitrarily. Value iteration proceeds recursively as follows:
(9) 
Specifically, by combining the contraction property in Lemma 6 and uniqueness result of fixed point solutions from Theorem 3, one concludes that . By selecting and , one immediately obtains . Furthermore, an optimal policy may be derived from according to the policy construction procedure in Theorem 3.
Unfortunately, while value iteration is conceptually appealing, its direct implementation in our setting is generally impractical since, e.g., the state is continuous. In the following, we pursue an approximation to the value iteration algorithm (9), based on a linear interpolation scheme for .
4 Value Iteration with Linear Interpolation
In this section we present an approximate DP algorithm for solving CVaR MDPs, based on the theoretical results of Section 3. The value iteration algorithm in Eq. (9) presents two main implementation challenges. The first is due to the fact that the augmented state is continuous. We handle this challenge by using interpolation, and exploit the concavity of to bound the error introduced by this procedure. The second challenge stems from the the fact that applying involves maximizing over . Our strategy is to exploit the concavity of the maximization problem to guarantee that such optimization can indeed be performed effectively.
As discussed, our approach relies on the fact that the Bellman operator preserves concavity as established in Lemma 6. Accordingly, we require the following assumption for the initial guess ,
Assumption 1
The guess for the initial value function satisfies the following properties: 1) is concave in and 2) is continuous in for any .
Assumption 1 may easily be satisfied, for example, by choosing , where is any arbitrary bounded random variable.
1: Given:

interpolation points for every with , and .

Initial value function that satisfies Assumption 1.
2: For
3: Set the converged value iteration estimate as , for any , and .
As stated earlier, a key difficulty in applying value iteration (9) is that, for each state , the Bellman operator has to be calculated for each , and is continuous. As an approximation, we propose to calculate the Bellman operator only for a finite set of values , and interpolate the value function in between such interpolation points.
Formally, let denote the number of interpolation points. For every , denote by the set of interpolation points. We denote by the linear interpolation of the function on these points, i.e.,
where . The interpolation of instead of is key to our approach. The motivation is twofold: first, it can be shown [19]
that for a discrete random variable
, is piecewise linear in . Second, one can show that the Lipschitzness of is preserved during value iteration, and exploit this fact to bound the linear interpolation error.We now define the interpolated Bellman operator as follows:
(10) 
Notice that by L’Hospital’s rule one has . This implies that at the interpolated Bellman operator is equivalent to the original Bellman operator, i.e.,
Algorithm 1 presents CVaR value iteration with linear interpolation. The only difference between this algorithm and standard value iteration (9) is the linear interpolation procedure described above. In the following, we show that Algorithm 1 converges, and bound the error due to interpolation. We begin by showing that the useful properties established in Lemma 6 for the Bellman operator extend to the interpolated Bellman operator . [Properties of Interpolated Bellman Operator] has the same properties of as in Lemma 6, namely 1) contraction and 2) concavity preservation.
Lemma 1 implies several important consequences for Algorithm 1. The first one is that the maximization problem in (10) is concave, and thus may be solved efficiently at each step. This guarantees that the algorithm is tractable. Second, the contraction property in Lemma 1 guarantees that Algorithm 1 converges, i.e., there exists a value function such that . In addition, the convergence rate is geometric and equals to .
The following theorem provides an error bound between approximate value iteration and exact value iteration (3) in terms of the interpolation resolution. [Convergence and Error Bound] Suppose the initial value function satisfies Assumption 1 and let be an error tolerance parameter. For any state and step , choose such that and update the interpolation points according to the logarithmic rule: , , with uniform constant . Then, Algorithm 1 has the following error bound:
and the following finite time convergence error bound:
Theorem 1 shows that 1) the interpolationbased value function is a conservative estimate for the optimal solution to problem (3); 2) the interpolation procedure is consistent, i.e., when the number of interpolation points is arbitrarily large (specifically, and ), the approximation error tends to zero; and 3) the approximation error bound is , where is the logdifference of the interpolation points, i.e., , .
For a prespecified , the condition may be satisfied by a simple adaptive procedure for selecting the interpolation points . At each iteration , after calculating in Algorithm 1, at each state in which the condition does not hold, add a new interpolation point , and additional points between and such that the condition is maintained. Since all the additional points belong to the segment , the linearly interpolated remains unchanged, and Algorithm 1 proceeds as is. For bounded costs and , the number of additional points required is bounded.
The full proof of Theorem 1 is detailed in the supplementary material; we highlight the main ideas and challenges involved. In the first part of the proof we bound, for all , the Lipschitz constant of in . The key to this result is to show that the Bellman operator preserves the Lipschitz property for . Using the Lipschitz bound and the concavity of , we then bound the error for all . The condition on is required for this bound to hold when . Finally, we use this result to bound . The results of Theorem 1 follow from contraction arguments, similar to approximate dynamic programming [3].
5 Experiments
We validate Algorithm 1 on a rectangular grid world, where states represent grid points on a 2D terrain map. An agent (e.g., a robotic vehicle) starts in a safe region and its objective is to travel to a given destination. At each time step the agent can move to any of its four neighboring states. Due to sensing and control noise, however, with probability a move to a random neighboring state occurs. The stagewise cost of each move until reaching the destination is , to account for fuel usage. In between the starting point and the destination there are a number of obstacles that the agent should avoid. Hitting an obstacle costs and terminates the mission. The objective is to compute a safe (i.e., obstaclefree) path that is fuel efficient.
For our experiments, we choose a gridworld (see Figure 1), for a total of 3,312 states. The destination is at position , and there are obstacles plotted in yellow. By leveraging Theorem 1, we use logspaced interpolation points for Algorithm 1 in order to achieve a small value function error. We choose , and a discount factor for an effective horizon of 200 steps. Furthermore, we set the penalty cost equal to –such choice trades off high penalty for collisions and computational complexity (that increases as increases).
In Figure 1 we plot the value function for three different values of the CVaR confidence parameter , and the corresponding paths starting from the initial position . The first three figures in Figure 1 show how by decreasing the confidence parameter the average travel distance (and hence fuel consumption) slightly increases but the collision probability decreases, as expected. We next discuss robustness to modeling errors. We conducted simulations in which with probability each obstacle position is perturbed in a random direction to one of the neighboring grid cells. This emulates, for example, measurement errors in the terrain map. We then trained both the riskaverse () and riskneutral () policies on the nominal (i.e., unperturbed) terrain map, and evaluated them on perturbed scenarios ( perturbed maps with Monte Carlo evaluations each). While the riskneutral policy finds a shorter route (with average cost equal to on successful runs), it is vulnerable to perturbations and fails more often (with over failed runs). In contrast, the riskaverse policy chooses slightly longer routes (with average cost equal to on successful runs), but is much more robust to model perturbations (with only failed runs).
For the computation of Algorithm 1 we represented the concave piecewise linear maximization problem in (10) as a linear program, and concatenated several problems to reduce repeated overhead stemming from the initialization of the CPLEX linear programming solver. This resulted in a computation time on the order of two hours. We believe there is ample room for improvement, for example by leveraging parallelization and samplingbased methods. Overall, we believe our proposed approach is currently the most practical method available for solving CVaR MDPs (as a comparison, the recently proposed method in [8] involves infinite dimensional optimization). The Matlab code used for the experiments is provided in the supplementary material.
6 Conclusion
In this paper we presented an algorithm for CVaR MDPs, based on approximate valueiteration on an augmented state space. We established convergence of our algorithm, and derived finitetime error bounds. These bounds are useful to stop the algorithm at a desired error threshold.
In addition, we uncovered an interesting relationship between the CVaR of the total cost and the worstcase expected cost under adversarial model perturbations. In this formulation, the perturbations are correlated in time, and lead to a robustness framework significantly less conservative than the popular robustMDP framework, where the uncertainty is temporally independent.
Collectively, our work suggests CVaR MDPs as a unifying and practical framework for computing control policies that are robust with respect to both stochasticity and model perturbations. Future work should address extensions to large statespaces. We conjecture that a samplingbased approximate DP approach [3] should be feasible since, as proven in this paper, the CVaR Bellman equation is contracting (as required by approximate DP methods).
References
 Artzner et al. [1999] P. Artzner, F. Delbaen, J. Eber, and D. Heath. Coherent measures of risk. Mathematical finance, 9(3):203–228, 1999.
 Bäuerle and Ott [2011] N. Bäuerle and J. Ott. Markov decision processes with averagevalueatrisk criteria. Mathematical Methods of Operations Research, 74(3):361–379, 2011.
 Bertsekas [2012] D. Bertsekas. Dynamic programming and optimal control, Vol II. Athena Scientific, 4th edition, 2012.
 Borkar and Jain [2014] V. Borkar and R. Jain. Riskconstrained Markov decision processes. IEEE Transaction of Automatic Control, 59(9):2574 – 2579, 2014.
 Chow and Ghavamzadeh [2014] Y. Chow and M. Ghavamzadeh. Algorithms for CVaR optimization in MDPs. In Advances in Neural Information Processing Systems 27, pages 3509–3517, 2014.
 Dowd [2007] K. Dowd. Measuring market risk. John Wiley & Sons, 2007.
 Filar et al. [1995] J. Filar, D. Krass, and K. Ross. Percentile performance criteria for limiting average Markov decision processes. Automatic Control, IEEE Transactions on, 40(1):2–10, 1995.
 Haskell and Jain [2014] W. Haskell and R. Jain. A convex analytic approach to riskaware Markov decision processes. SIAM Journal of Control and Optimization, 2014.
 Howard and Matheson [1972] R. A. Howard and J. E. Matheson. Risksensitive Markov decision processes. Management Science, 18(7):356–369, 1972.
 Iyengar [2005] G. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
 Iyengar and Ma [2013] G. Iyengar and A. Ma. Fast gradient descent method for meanCVaR optimization. Annals of Operations Research, 205(1):203–212, 2013.
 Mannor et al. [2007] S. Mannor, D. Simester, P. Sun, and J. Tsitsiklis. Bias and variance approximation in value function estimates. Management Science, 53(2):308–322, 2007.

Mannor et al. [2012]
S. Mannor, O. Mebel, and H. Xu.
Lightning does not strike twice: Robust MDPs with coupled
uncertainty.
In
International Conference on Machine Learning
, pages 385–392, 2012.  Milgrom and Segal [2002] P. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583–601, 2002.
 Nilim and El Ghaoui [2005] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780– 798, 2005.
 Osogami [2012] T. Osogami. Robustness and risksensitivity in markov decision processes. In Advances in Neural Information Processing Systems, pages 233–241, 2012.
 Pflug and Pichler [2012] G. Pflug and A. Pichler. Time consistent decisions and temporal decomposition of coherent risk functionals. Optimization online, 2012.
 Phillips [2003] M. Phillips. Interpolation and approximation by polynomials, volume 14. Springer Science & Business Media, 2003.
 Rockafellar and Uryasev [2000] R. Rockafellar and S. Uryasev. Optimization of conditional valueatrisk. Journal of risk, 2:21–42, 2000.
 Rockafellar et al. [2006] R. Rockafellar, S. Uryasev, and M. Zabarankin. Master funds in portfolio analysis with general deviation measures. Journal of Banking & Finance, 30(2):743–778, 2006.
 Serraino and Uryasev [2013] G. Serraino and S. Uryasev. Conditional valueatrisk (CVaR). In Encyclopedia of Operations Research and Management Science, pages 258–266. Springer, 2013.
 Shapiro et al. [2009] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on stochastic programming. SIAM, 2009.
 Sobel [1982] M. Sobel. The variance of discounted Markov decision processes. Journal of Applied Probability, pages 794–802, 1982.
 Tamar et al. [2015] A. Tamar, Y. Glassner, and S. Mannor. Optimizing the CVaR via sampling. In AAAI, 2015.
 Uryasev et al. [2010] S. Uryasev, S. Sarykalin, G. Serraino, and K. Kalinchenko. VaR vs CVaR in risk management and optimization. In CARISMA conference, 2010.
 Xu and Mannor [2006] H. Xu and S. Mannor. The robustnessperformance tradeoff in Markov decision processes. In Advances in Neural Information Processing Systems, pages 1537–1544, 2006.
Appendix A Proofs of Theoretical Results
a.1 Proof of Proposition 2.4
By definition, we have that
Note that by definition of the set , for any we have that , and
Thus,
where the last equality is by the representation theorem for CVaR [22].
a.2 Proof of Lemma 6
The proof of monotonicity and constant shift properties follow directly from the definitions of the Bellman operator, by noting that is nonnegative and for any . For the contraction property, denote . Since
by monotonicity and constant shift property,
This further implies that
and the contraction property follows.
Now, we prove the concavity preserving property. Assume that is concave in for any . Let , and , and define . We have
where the first inequality is by concavity of the , and the second is by the concavity assumption. Now, define . When and , we have that and . We thus have
Finally, to show that the inner problem in (6) is a concave maximization, we need to show that
is a concave function in for any given , and . Suppose is a concave function in . Immediately we can see that is concave in when . Also notice that when , since the transition probability is nonnegative, we have the result that is concave in . This further implies
is concave in . Furthermore by combining the result with the fact that the feasible set of is a polytope, we complete the proof of this claim.
a.3 Proof of Theorem 3
The first part of the proof is to show that for any ,
(11) 
by induction, where the initial condition is and control action is induced by . For , we have that from definition. By induction hypothesis, assume the above expression holds at . For ,
(12) 
where the initial state condition is given by . Thus, the equality in (11) is proved by induction.
The second part of the proof is to show that . Recall