I Introduction
Autonomous systems are being increasingly deployed in realworld settings. Hence, the associated risk that stems from unknown and unforeseen circumstances is correspondingly on the rise. This demands for autonomous systems that can make appropriately conservative decisions when faced with uncertainty in their environment and behavior. Mathematically speaking, risk can be quantified in numerous ways, such as chance constraints [wang2020non] and distributional robustness [NIPS2010_19f3cd30]. However, applications in autonomy and robotics require more “nuanced assessments of risk” [majumdar2020should]. Artzner et. al. [artzner1999coherent] characterized a set of natural properties that are desirable for a risk measure, called a coherent risk measure, and have obtained widespread acceptance in finance and operations research, among other fields.
A popular model for representing sequential decision making under uncertainty is a Markov decision processes (MDP) [Puterman94]. MDPs with coherent risk objectives were studied in [tamar2016sequential, tamar2015policy], where the authors proposed a samplingbased algorithm for finding saddle point solutions using policy gradient methods. However, [tamar2016sequential] requires the risk envelope appearing in the dual representation of the coherent risk measure to be known with an explicit canonical convex programming formulation. While this may be the case for CVaR, meansemideviation, and spectral risk measures [shapiro2014lectures], such explicit form is not known for general coherent risk measures, such as EVaR. Furthermore, it is not clear whether the saddle point solutions are a lower bound or upper bound to the optimal value. Also, policygradient based methods require calculating the gradient of the coherent risk measure, which is not available in explicit form in general. For the CVaR measure, MDPs with risk constraints and total expected costs were studied in [prashanth2014policy, chow2014algorithms] and locally optimal solutions were found via policy gradients, as well. However, this method also leads to saddle point solutions (which cannot be shown to be upper bounds or lower bounds of the optimal value) and cannot be applied to general coherent risk measures. In addition, because the objective and the constraints are in terms of different coherent risk measures, the authors assume there exists a policy that satisfies the CVaR constraint (feasibility assumption), which may not be the case in general. Following the footsteps of [pflug2016time], a promising approach based on approximate value iteration was proposed for MDPs with CVaR objectives in [chow2015risk]. A policy iteration algorithm for finding policies that minimize total coherent risk measures for MDPs was studied in [ruszczynski2010risk] and a computational nonsmooth Newton method was proposed in [ruszczynski2010risk].
When the states of the agent and/or the environment are not directly observable, a partially observable MDP (POMDP) can be used to study decision making under uncertainty introduced by the partial state observability [krishnamurthy2016partially, ahmadi2020control]. POMDPs with coherent risk measure objectives were studied in [fan2018process, fan2018risk]. Despite the elegance of the theory, no computational method was proposed to design policies for general coherent risk measures. In [ahmadi2020risk], we proposed a method for finding finitestate controllers for POMDPs with objectives defined in terms of coherent risk measures, which takes advantage of convex optimization techniques. However, the method can only be used if the risk transition mapping is affine in the policy.
Summary of Contributions: In this paper, we consider MDPs and POMDPs with both objectives and constraints in terms of coherent risk measures. Our contributions are fourfold:

For MDPs, we use the Lagrangian framework and reformulate the problem into a infsup problem. For Markov risk transition mappings, we propose an optimizationbased method to design Markovian policies that lowerbound the constrained riskaverse problem;

For MDPs, we evince that the optimization problems are in the special form of DCPs and can be solved by the DCCP method. We also demonstrate that these results generalize linear programs for constrained MDPs with total discounted expected costs and constraints;

For POMDPs, we demonstrate that, if the coherent risk measures can be defined as a Markov risk transition mapping, an infinitedimensional optimization can be used to design Markovian beliefbased policies, which in theory requires infinite memory to synthesize (in accordance with classical POMDP complexity results);

For POMDPs with stochastic finitestate controllers (FSCs), we show that the latter optimization converts to a (finitedimensional) DCP and can be solved by the DCCP framework. We incorporate these DCPs in a policy iteration algorithm to design riskaverse FSCs for POMDPs.
We assess the efficacy of the proposed method with numerical experiments involving conditionalvalueatrisk (CVaR) and entropicvalueatrisk (EVaR) risk measures.
Preliminary results on riskaverse MDPs were presented in [ahmadi2021aaai]. This paper, in addition to providing detailed proofs and new numerical analysis in the MDP case, generalizes [ahmadi2021aaai] to partially observable systems (POMDPs) with dynamic coherent risk objectives and constraints.
The rest of the paper is organized as follows. In the next section, we briefly review some notions used in the paper. In Section III, we formulate the problem under study. In Section IV, we present the optimizationbased method for designing riskaverse policies for MDPs. In Section V, we describe a policy iteration method for designing finitememory controllers for riskaverse POMDPs. In Section VI, we illustrate the proposed methodology via numerical experiments. Finally, in Section VII, we conclude the paper and give directions for future research.
Notation: We denote by the dimensional Euclidean space and
the set of nonnegative integers. Throughout the paper, we use bold font to denote a vector and
for its transpose, e.g., , with . For a vector , we use to denote elementwise nonnegativity (nonpositivity) and to show all elements of are zero. For two vectors , we denote their inner product by , i.e., . For a finite set , we denote its power set by , i.e., the set of all subsets of. For a probability space
and a constant ,denotes the vector space of real valued random variables
for which .Ii Preliminaries
In this section, we briefly review some notions and definitions used throughout the paper.
Iia Markov Decision Processes
An MDP is a tuple consisting of a set of states of the autonomous agent(s) and world model, actions available to the agent, a transition function , and describing the initial distribution over the states.
This paper considers finite Markov decision processes, where and are finite sets. For each action the probability of making a transition from state to state under action is given by . The probabilistic components of a MDP must satisfy the following:
IiB Partially Observable MDPs
A POMDP is a tuple consisting of an MDP , observations , and an observation model . We consider finite POMDPs, where is a finite set. Then, for each state , an observation is generated independently with probability , which satisfies
In POMDPs, the states are not directly observable. The beliefs , i.e., the probability of being in different states, with
being the set of probability distributions over
, for all can be computed using the Bayes’ law as follows:(1)  
(2) 
for all .
IiC Finite State Control of POMDPs
It is well established that designing optimal policies for POMDPs based on the (continuous) belief states require uncountably infinite memory or internal states [CassandraKL94, MADANI20035]. This paper focuses on a particular class of POMDP controllers, namely, FSCs.
A stochastic finite state controller for is given by the tuple , where is a finite set of internal states (Istates), is a function of internal stochastic finite state controller states and observation , such that is a probability distribution over . The next internal state and action pair is chosen by independent sampling of . By abuse of notation, will denote the probability of transitioning to internal stochastic finite state controller state and taking action , when the current internal state is and observation is received. chooses the starting internal FSC state , by independent sampling of , given initial distribution of , and will denote the probability of starting the FSC in internal state when the initial POMDP distribution is .
IiD Coherent Risk Measures
Consider a probability space , a filteration , and an adapted sequence of random variables (stagewise costs) , where . For , we further define the spaces , , and . We assume that the sequence is almost surely bounded (with exceptions having probability zero), i.e.,
In order to describe how one can evaluate the risk of subsequence from the perspective of stage , we require the following definitions.
Definition 1 (Conditional Risk Measure).
A mapping , where , is called a conditional risk measure, if it has the following monoticity property:
Definition 2 (Dynamic Risk Measure).
A dynamic risk measure is a sequence of conditional risk measures , .
One fundamental property of dynamic risk measures is their consistency over time [ruszczynski2010risk, Definition 3]. That is, if will be as good as from the perspective of some future time , and they are identical between time and , then should not be worse than from the perspective at time .
In this paper, we focus on time consistent, coherent risk measures, which satisfy four nice mathematical properties, as defined below [shapiro2014lectures, p. 298].
Definition 3 (Coherent Risk Measure).
We call the onestep conditional risk measures , a coherent risk measure if it satisfies the following conditions

Convexity: , for all and all ;

Monotonicity: If then for all ;

Translational Invariance: for all and ;

Positive Homogeneity: for all and .
We are interested in the discounted infinite horizon problems. Let be a given discount factor. For , we define the functional
(3) 
Finally, we have total discounted risk functional defined as
(4) 
From [ruszczynski2010risk, Theorem 3], we have that is convex, monotone, and positive homogeneous.
IiE Examples of Coherent Risk Measures
Next, we briefly review three examples of coherent risk measures that will be used in this paper.
Total Conditional Expectation: The simplest risk measure is the total conditional expectation given by
(5) 
It is easy to see that total conditional expectation satisfies the properties of a coherent risk measure as outlined in Definition 3. Unfortunately, total conditional expectation is agnostic to realization fluctuations of the random variable and is only concerned with the mean value of at large number of realizations. Thus, it is a riskneutral measure of performance.
Conditional ValueatRisk: Let be a random variable. For a given confidence level , valueatrisk () denotes the
quantile value of the random variable
. Unfortunately, working with VaR for nonnormal random variables is numerically unstable and optimizing models involving VaR is intractable in high dimensions [rockafellar2000optimization].In contrast, CVaR overcomes the shortcomings of VaR. CVaR with confidence level denoted measures the expected loss in the tail given that the particular threshold has been crossed, i.e., . An optimization formulation for CVaR was proposed in [rockafellar2000optimization]. That is, is given by
(6) 
where . A value of corresponds to a riskneutral case, i.e., ; whereas, a value of is rather a riskaverse case, i.e., [rockafellar2002conditional]. Figure 1 illustrates these notions for an example variable with distribution .
Entropic ValueatRisk: Unfortunately, CVaR ignores the losses below the VaR threshold. EVaR is the tightest upper bound in the sense of Chernoff inequality for VaR and CVaR and its dual representation is associated with the relative entropy. In fact, it was shown in [ahmadi2017analytical] that and are equal only if there are no losses () below the threshold. In addition, EVaR is a strictly monotone risk measure; whereas, CVaR is only monotone [ahmadi2019portfolio]. is given by
(7) 
Similar to , for , corresponds to a riskneutral case; whereas, corresponds to a riskaverse case. In fact, it was demonstrated in [ahmadi2012entropic, Proposition 3.2] that .
Iii Problem Formulation
In the past two decades, coherent risk and dynamic risk measures have been developed and used in microeconomics and mathematical finance fields [vose2008risk]. Generally speaking, riskaverse decision making is concerned with the behavior of agents, e.g. consumers and investors, who, when exposed to uncertainty, attempt to lower that uncertainty. The agents may avoid situations with unknown payoffs, in favor of situations with payoffs that are more predictable.
The core idea in riskaverse planning is to replace the conventional riskneutral conditional expectation of the cumulative cost objectives with the more general coherent risk measures. In path planning scenarios, in particular, we will show in our numerical experiments that considering coherent risk measures will lead to significantly more robustness to environment uncertainty and collisions leading to mission failures.
In addition to total cost riskaversity, an agent is often subject to constraints, e.g. fuel, communication, or energy budgets [7452536]. These constraints can also represent mission objectives, e.g. explore an area or reach a goal.
Consider a stationary controlled Markov process , (an MDP or a POMDP) with initial probability distribution , wherein policies, transition probabilities, and cost functions do not depend explicitly on time. Each policy leads to cost sequences , and , , . We define the dynamic risk of evaluating the discounted cost of a policy as
(8) 
and the discounted dynamic risk constraints of executing policy as
(9) 
where is defined in equation (4), , and , , are given constants. We assume that and , , are nonnegative and upperbounded. For a discount factor , an initial condition , and a policy , we infer from [ruszczynski2010risk, Theorem 3] that both and are welldefined (bounded), if and are bounded.
In this work, we are interested in addressing the following problem:
Problem 1.
We call a controlled Markov process with the “nested” objective (8) and constraints (9) a constrained riskaverse Markov process.
For MDPs, [chow2015risk, osogami2012robustness] show that such coherent risk measure objectives can account for modeling errors and parametric uncertainties. We can also interpret Problem 1 as designing policies that minimize the accrued costs in a riskaverse sense and at the same time ensuring that the system constraints, e.g., fuel constraints, are not violated even in the rare but costly scenarios.
Note that in Problem 1 both the objective function and the constraints are in general nondifferentiable and nonconvex in policy (with the exception of total expected cost as the coherent risk measure [altman1999constrained]). Therefore, finding optimal policies in general may be hopeless. Instead, we find suboptimal polices by taking advantage of a Lagrangian formulation and then using an optimization form of Bellman’s equations.
Next, we show that the constrained riskaverse problem is equivalent to a nonconstrained infsup riskaverse problem thanks to the Lagrangian method.
Proposition 1.
Let be the value of Problem 1 for a given initial distribution and discount factor . Then, (i) the value function satisfies
(11) 
where
(12) 
is the Lagrangian function.
(ii) Furthermore, a policy is optimal for Problem 1, if and only if .
Proof.
(i) If for some Problem 1 is not feasible, then . In fact, if the th constraint is not satisfied, i.e., , we can achieve the latter supremum by choosing , while keeping the rest of s constant or zero. If Problem 1 is feasible for some , then the supremum is achieved by setting . Hence, and
which implies (i).
(ii) If is optimal, then, from (11), we have
Conversely, if for some , then from (11), we have . Hence, is the optimal policy. ∎
Iv Constrained RiskAverse MDPs
At any time , the value of is measurable and is allowed to depend on the entire history of the process and we cannot expect to obtain a Markov optimal policy [ott2010markov, bauerle2011markov]. In order to obtain Markov policies, we need the following property [ruszczynski2010risk].
Definition 4 (Markov Risk Measure).
Let such that and A onestep conditional risk measure is a Markov risk measure with respect to the controlled Markov process , , if there exist a risk transition mapping such that for all and , we have
(13) 
where is called the controlled kernel.
In fact, if is a coherent risk measure, also satisfies the properties of a coherent risk measure (Definition 3). In this paper, since we are concerned with MDPs, the controlled kernel is simply the transition function .
Assumption 1.
The onestep coherent risk measure is a Markov risk measure.
The simplest case of the risk transition mapping is in the conditional expectation case , i.e.,
(14) 
Note that in the total discounted expectation case is a linear function in rather than a convex function, which is the case for a general coherent risk measures. For example, for the CVaR risk measure, the Markov risk transition mapping is given by
where is a convex function in .
If is a coherent, Markov risk measure, then the Markov policies are sufficient to ensure optimality [ruszczynski2010risk].
In the next result, we show that we can find a lower bound to the solution to Problem 1 via solving an optimization problem.
Theorem 1.
Proof.
From Proposition 1, we have know that (11) holds. Hence, we have
(17) 
wherein the fourth, fifth, and sixth inequalities above we used the positive homogeneity property of , subadditivity property of , and the minimax inequality respectively. Since does not depend on , to find the solution the infimum it suffices to find the solution to
where . The value to the above optimization can be obtained by solving the following Bellman equation [ruszczynski2010risk, Theorem 4]
Next, we show that the solution to the above Bellman equation can be alternatively obtained by solving the convex optimization
subject to  
(18) 
Define
and for all . From [ruszczynski2010risk, Lemma 1], we infer that and are nondecreasing; i.e., for , we have and . Therefore, if , then . By repeated application of , we obtain
Any feasible solution to (IV) must satisfy and hence must satisfy . Thus, given that all entries of are positive, is the optimal solution to (IV). Substituting (IV) back in the last inequality in (IV) yields the result. ∎
Once the values of and are found by solving optimization problem (1), we can find the policy as
(19) 
One interesting observation is that if the coherent risk measure is the total discounted expectation, then Theorem 1 is consistent with the classical result by [altman1999constrained] on constrained MDPs.
Corollary 1.
Proof.
From the derivation in (IV), we observe the two inequalities are from the application of (a) the subadditivity property of and (b) the maxmin inequality. Next, we show that in the case of total expectation both of these properties lead to an equality.
(a) Subadditivity property of : for total expectation, we have
Thus, equality holds.
(b) Maxmin inequality: in the case, both the objective function and the constraints are linear in the decision variables and . Therefore, the sixth line in (IV) reads as
(20) 
Since the expression inside parantheses above is convex in ( is linear in the policy) and concave (linear) in . From Minimax Theorem [du2013minimax], we have that the following equality holds
In [ahmadi2021aaai], we presented a method based on difference convex programs to solve (1), when is an arbitrary coherent risk measure and we described the specific structure of the optimization problem for conditional expectation, CVaR, and EVaR. In fact, it was shown that (1) can be written in a standard DCP format as
subject to  
(21) 
Optimization problem (IV) is a standard DCP [horst1999dc]
. DCPs arise in many applications, such as feature selection in machine learning
[le2008dc]and inverse covariance estimation in statistics
[thai2014inverse]. Although DCPs can be solved globally [horst1999dc], e.g. using branch and bound algorithms [lawler1966branch], a locally optimal solution can be obtained based on techniques of nonlinear optimization [Bertsekas99] more efficiently. In particular, in this work, we use a variant of the convexconcave procedure [lipp2016variations, shen2016disciplined], wherein the concave terms are replaced by a convex upper bound and solved. In fact, the disciplined convexconcave programming (DCCP) [shen2016disciplined] technique linearizes DCP problems into a (disciplined) convex program (carried out automatically via the DCCP Python package [shen2016disciplined]), which is then converted into an equivalent cone program by replacing each function with its graph implementation. Then, the cone program can be solved readily by available convex programming solvers, such as CVXPY [diamond2016cvxpy].V Constrained RiskAverse POMDPs
Next, we show that, in the case of POMDPs, we can find a lower bound to the solution to Problem 1 via solving an infinitedimensional optimization problem. Note that a POMDP is equivalent to a belief MDP , , where is defined in (2).
Theorem 2.
Proof.
Note that a POMDP can be represented as an MDP over the belief states (2) with initial distribution (1). Hence, a POMDP is a controlled Markov process with states , where the controlled belief transition probability is described as
with
The rest of the proof follows the same footsteps on Theorem 1 over the belief MDP with as defined above. ∎
Unfortunately, since and hence , optimization (2) is infinitedimensional and we cannot solve it efficiently.
If the onestep coherent risk measure is the total discounted expectation, we can show that optimization problem (2) simplifies to an infinitedimensional linear program and equality holds in (23). This can be proved following the same lines as the proof of Corollary 1 but for the belief MDP. Hence, Theorem 2 also provides an optimization based solution to the constrained POMDP problem.
Va RiskAverse FSC Synthesis via Policy Iteration
In order to synthesize riskaverse FSCs, we employ a policy iteration algorithm. Policy iteration incrementally improves a controller by alternating between two steps: Policy Evaluation (computing value functions by fixing the policy) and Policy Improvement (computing the policy by fixing the value functions), until convergence to a satisfactory policy [bertsekas76]. For a riskaverse POMDP, policy evaluation can be carried out by solving (2). However, as mentioned earlier, (2) is difficult to use directly as it must be computed at each (continuous) belief state in the belief space, which is uncountably infinite.
In the following, we show that if instead of considering policies with infinitememory, we search over finitememory policies, then we can find suboptimal solutions to Problem 1 that lowerbound . To this end, we consider stochastic but finitememory controllers as described in Section II.C.
Closing the loop around a POMDP with an FSC
induces a Markov chain. The global Markov chain
(or simply , where the stochastic finite state controller and the POMDP are clear from the context) with execution . The probability of initial global state isThe state transition probability, , is given by
VB Risk Value Function Computation
Under an FSC, the POMDP is transformed into a Markov chain with design probability distributions and . The closedloop Markov chain is a controlled Markov process with , . In this setting, the total risk functional (8) becomes a function of and FSC , i.e.,
(24) 
where s and s are drawn from the probability distribution . The constraint functionals , can also be defined similarly.
Let be the value of Problem 1 under a FSC . Then, it is evident that , since FSCs restrict the search space of the policy . That is, they can only be as good as the (infinitedimensional) beliefbased policy as (infinitememory).
Risk Value Function Optimization: For POMDPs controlled by stochastic finite state controllers, the dynamic program is developed in the global state space . From Theorem 1, we see that for a given FSC, , and POMDP , the value function can be computed by solving the following finite dimensional optimization
subject to  
(25) 
where and