Autonomous systems are being increasingly deployed in real-world settings. Hence, the associated risk that stems from unknown and unforeseen circumstances is correspondingly on the rise. This demands for autonomous systems that can make appropriately conservative decisions when faced with uncertainty in their environment and behavior. Mathematically speaking, risk can be quantified in numerous ways, such as chance constraints [wang2020non] and distributional robustness [NIPS2010_19f3cd30]. However, applications in autonomy and robotics require more “nuanced assessments of risk” [majumdar2020should]. Artzner et. al. [artzner1999coherent] characterized a set of natural properties that are desirable for a risk measure, called a coherent risk measure, and have obtained widespread acceptance in finance and operations research, among other fields.
A popular model for representing sequential decision making under uncertainty is a Markov decision processes (MDP) [Puterman94]. MDPs with coherent risk objectives were studied in [tamar2016sequential, tamar2015policy], where the authors proposed a sampling-based algorithm for finding saddle point solutions using policy gradient methods. However, [tamar2016sequential] requires the risk envelope appearing in the dual representation of the coherent risk measure to be known with an explicit canonical convex programming formulation. While this may be the case for CVaR, mean-semi-deviation, and spectral risk measures [shapiro2014lectures], such explicit form is not known for general coherent risk measures, such as EVaR. Furthermore, it is not clear whether the saddle point solutions are a lower bound or upper bound to the optimal value. Also, policy-gradient based methods require calculating the gradient of the coherent risk measure, which is not available in explicit form in general. For the CVaR measure, MDPs with risk constraints and total expected costs were studied in [prashanth2014policy, chow2014algorithms] and locally optimal solutions were found via policy gradients, as well. However, this method also leads to saddle point solutions (which cannot be shown to be upper bounds or lower bounds of the optimal value) and cannot be applied to general coherent risk measures. In addition, because the objective and the constraints are in terms of different coherent risk measures, the authors assume there exists a policy that satisfies the CVaR constraint (feasibility assumption), which may not be the case in general. Following the footsteps of [pflug2016time], a promising approach based on approximate value iteration was proposed for MDPs with CVaR objectives in [chow2015risk]. A policy iteration algorithm for finding policies that minimize total coherent risk measures for MDPs was studied in [ruszczynski2010risk] and a computational non-smooth Newton method was proposed in [ruszczynski2010risk].
When the states of the agent and/or the environment are not directly observable, a partially observable MDP (POMDP) can be used to study decision making under uncertainty introduced by the partial state observability [krishnamurthy2016partially, ahmadi2020control]. POMDPs with coherent risk measure objectives were studied in [fan2018process, fan2018risk]. Despite the elegance of the theory, no computational method was proposed to design policies for general coherent risk measures. In [ahmadi2020risk], we proposed a method for finding finite-state controllers for POMDPs with objectives defined in terms of coherent risk measures, which takes advantage of convex optimization techniques. However, the method can only be used if the risk transition mapping is affine in the policy.
Summary of Contributions: In this paper, we consider MDPs and POMDPs with both objectives and constraints in terms of coherent risk measures. Our contributions are fourfold:
For MDPs, we use the Lagrangian framework and reformulate the problem into a inf-sup problem. For Markov risk transition mappings, we propose an optimization-based method to design Markovian policies that lower-bound the constrained risk-averse problem;
For MDPs, we evince that the optimization problems are in the special form of DCPs and can be solved by the DCCP method. We also demonstrate that these results generalize linear programs for constrained MDPs with total discounted expected costs and constraints;
For POMDPs, we demonstrate that, if the coherent risk measures can be defined as a Markov risk transition mapping, an infinite-dimensional optimization can be used to design Markovian belief-based policies, which in theory requires infinite memory to synthesize (in accordance with classical POMDP complexity results);
For POMDPs with stochastic finite-state controllers (FSCs), we show that the latter optimization converts to a (finite-dimensional) DCP and can be solved by the DCCP framework. We incorporate these DCPs in a policy iteration algorithm to design risk-averse FSCs for POMDPs.
We assess the efficacy of the proposed method with numerical experiments involving conditional-value-at-risk (CVaR) and entropic-value-at-risk (EVaR) risk measures.
Preliminary results on risk-averse MDPs were presented in [ahmadi2021aaai]. This paper, in addition to providing detailed proofs and new numerical analysis in the MDP case, generalizes [ahmadi2021aaai] to partially observable systems (POMDPs) with dynamic coherent risk objectives and constraints.
The rest of the paper is organized as follows. In the next section, we briefly review some notions used in the paper. In Section III, we formulate the problem under study. In Section IV, we present the optimization-based method for designing risk-averse policies for MDPs. In Section V, we describe a policy iteration method for designing finite-memory controllers for risk-averse POMDPs. In Section VI, we illustrate the proposed methodology via numerical experiments. Finally, in Section VII, we conclude the paper and give directions for future research.
Notation: We denote by the -dimensional Euclidean space and
the set of non-negative integers. Throughout the paper, we use bold font to denote a vector andfor its transpose, e.g., , with . For a vector , we use to denote element-wise non-negativity (non-positivity) and to show all elements of are zero. For two vectors , we denote their inner product by , i.e., . For a finite set , we denote its power set by , i.e., the set of all subsets of
. For a probability spaceand a constant ,
denotes the vector space of real valued random variablesfor which .
In this section, we briefly review some notions and definitions used throughout the paper.
Ii-a Markov Decision Processes
An MDP is a tuple consisting of a set of states of the autonomous agent(s) and world model, actions available to the agent, a transition function , and describing the initial distribution over the states.
This paper considers finite Markov decision processes, where and are finite sets. For each action the probability of making a transition from state to state under action is given by . The probabilistic components of a MDP must satisfy the following:
Ii-B Partially Observable MDPs
A POMDP is a tuple consisting of an MDP , observations , and an observation model . We consider finite POMDPs, where is a finite set. Then, for each state , an observation is generated independently with probability , which satisfies
In POMDPs, the states are not directly observable. The beliefs , i.e., the probability of being in different states, with
being the set of probability distributions over, for all can be computed using the Bayes’ law as follows:
for all .
Ii-C Finite State Control of POMDPs
It is well established that designing optimal policies for POMDPs based on the (continuous) belief states require uncountably infinite memory or internal states [CassandraKL94, MADANI20035]. This paper focuses on a particular class of POMDP controllers, namely, FSCs.
A stochastic finite state controller for is given by the tuple , where is a finite set of internal states (I-states), is a function of internal stochastic finite state controller states and observation , such that is a probability distribution over . The next internal state and action pair is chosen by independent sampling of . By abuse of notation, will denote the probability of transitioning to internal stochastic finite state controller state and taking action , when the current internal state is and observation is received. chooses the starting internal FSC state , by independent sampling of , given initial distribution of , and will denote the probability of starting the FSC in internal state when the initial POMDP distribution is .
Ii-D Coherent Risk Measures
Consider a probability space , a filteration , and an adapted sequence of random variables (stage-wise costs) , where . For , we further define the spaces , , and . We assume that the sequence is almost surely bounded (with exceptions having probability zero), i.e.,
In order to describe how one can evaluate the risk of sub-sequence from the perspective of stage , we require the following definitions.
Definition 1 (Conditional Risk Measure).
A mapping , where , is called a conditional risk measure, if it has the following monoticity property:
Definition 2 (Dynamic Risk Measure).
A dynamic risk measure is a sequence of conditional risk measures , .
One fundamental property of dynamic risk measures is their consistency over time [ruszczynski2010risk, Definition 3]. That is, if will be as good as from the perspective of some future time , and they are identical between time and , then should not be worse than from the perspective at time .
In this paper, we focus on time consistent, coherent risk measures, which satisfy four nice mathematical properties, as defined below [shapiro2014lectures, p. 298].
Definition 3 (Coherent Risk Measure).
We call the one-step conditional risk measures , a coherent risk measure if it satisfies the following conditions
Convexity: , for all and all ;
Monotonicity: If then for all ;
Translational Invariance: for all and ;
Positive Homogeneity: for all and .
We are interested in the discounted infinite horizon problems. Let be a given discount factor. For , we define the functional
Finally, we have total discounted risk functional defined as
From [ruszczynski2010risk, Theorem 3], we have that is convex, monotone, and positive homogeneous.
Ii-E Examples of Coherent Risk Measures
Next, we briefly review three examples of coherent risk measures that will be used in this paper.
Total Conditional Expectation: The simplest risk measure is the total conditional expectation given by
It is easy to see that total conditional expectation satisfies the properties of a coherent risk measure as outlined in Definition 3. Unfortunately, total conditional expectation is agnostic to realization fluctuations of the random variable and is only concerned with the mean value of at large number of realizations. Thus, it is a risk-neutral measure of performance.
Conditional Value-at-Risk: Let be a random variable. For a given confidence level , value-at-risk () denotes the
-quantile value of the random variable. Unfortunately, working with VaR for non-normal random variables is numerically unstable and optimizing models involving VaR is intractable in high dimensions [rockafellar2000optimization].
In contrast, CVaR overcomes the shortcomings of VaR. CVaR with confidence level denoted measures the expected loss in the -tail given that the particular threshold has been crossed, i.e., . An optimization formulation for CVaR was proposed in [rockafellar2000optimization]. That is, is given by
where . A value of corresponds to a risk-neutral case, i.e., ; whereas, a value of is rather a risk-averse case, i.e., [rockafellar2002conditional]. Figure 1 illustrates these notions for an example variable with distribution .
Entropic Value-at-Risk: Unfortunately, CVaR ignores the losses below the VaR threshold. EVaR is the tightest upper bound in the sense of Chernoff inequality for VaR and CVaR and its dual representation is associated with the relative entropy. In fact, it was shown in [ahmadi2017analytical] that and are equal only if there are no losses () below the threshold. In addition, EVaR is a strictly monotone risk measure; whereas, CVaR is only monotone [ahmadi2019portfolio]. is given by
Similar to , for , corresponds to a risk-neutral case; whereas, corresponds to a risk-averse case. In fact, it was demonstrated in [ahmadi2012entropic, Proposition 3.2] that .
Iii Problem Formulation
In the past two decades, coherent risk and dynamic risk measures have been developed and used in microeconomics and mathematical finance fields [vose2008risk]. Generally speaking, risk-averse decision making is concerned with the behavior of agents, e.g. consumers and investors, who, when exposed to uncertainty, attempt to lower that uncertainty. The agents may avoid situations with unknown payoffs, in favor of situations with payoffs that are more predictable.
The core idea in risk-averse planning is to replace the conventional risk-neutral conditional expectation of the cumulative cost objectives with the more general coherent risk measures. In path planning scenarios, in particular, we will show in our numerical experiments that considering coherent risk measures will lead to significantly more robustness to environment uncertainty and collisions leading to mission failures.
In addition to total cost risk-aversity, an agent is often subject to constraints, e.g. fuel, communication, or energy budgets . These constraints can also represent mission objectives, e.g. explore an area or reach a goal.
Consider a stationary controlled Markov process , (an MDP or a POMDP) with initial probability distribution , wherein policies, transition probabilities, and cost functions do not depend explicitly on time. Each policy leads to cost sequences , and , , . We define the dynamic risk of evaluating the -discounted cost of a policy as
and the -discounted dynamic risk constraints of executing policy as
where is defined in equation (4), , and , , are given constants. We assume that and , , are non-negative and upper-bounded. For a discount factor , an initial condition , and a policy , we infer from [ruszczynski2010risk, Theorem 3] that both and are well-defined (bounded), if and are bounded.
In this work, we are interested in addressing the following problem:
For MDPs, [chow2015risk, osogami2012robustness] show that such coherent risk measure objectives can account for modeling errors and parametric uncertainties. We can also interpret Problem 1 as designing policies that minimize the accrued costs in a risk-averse sense and at the same time ensuring that the system constraints, e.g., fuel constraints, are not violated even in the rare but costly scenarios.
Note that in Problem 1 both the objective function and the constraints are in general non-differentiable and non-convex in policy (with the exception of total expected cost as the coherent risk measure [altman1999constrained]). Therefore, finding optimal policies in general may be hopeless. Instead, we find sub-optimal polices by taking advantage of a Lagrangian formulation and then using an optimization form of Bellman’s equations.
Next, we show that the constrained risk-averse problem is equivalent to a non-constrained inf-sup risk-averse problem thanks to the Lagrangian method.
Let be the value of Problem 1 for a given initial distribution and discount factor . Then, (i) the value function satisfies
is the Lagrangian function.
(ii) Furthermore, a policy is optimal for Problem 1, if and only if .
(i) If for some Problem 1 is not feasible, then . In fact, if the th constraint is not satisfied, i.e., , we can achieve the latter supremum by choosing , while keeping the rest of s constant or zero. If Problem 1 is feasible for some , then the supremum is achieved by setting . Hence, and
which implies (i).
(ii) If is optimal, then, from (11), we have
Conversely, if for some , then from (11), we have . Hence, is the optimal policy. ∎
Iv Constrained Risk-Averse MDPs
At any time , the value of is -measurable and is allowed to depend on the entire history of the process and we cannot expect to obtain a Markov optimal policy [ott2010markov, bauerle2011markov]. In order to obtain Markov policies, we need the following property [ruszczynski2010risk].
Definition 4 (Markov Risk Measure).
Let such that and A one-step conditional risk measure is a Markov risk measure with respect to the controlled Markov process , , if there exist a risk transition mapping such that for all and , we have
where is called the controlled kernel.
In fact, if is a coherent risk measure, also satisfies the properties of a coherent risk measure (Definition 3). In this paper, since we are concerned with MDPs, the controlled kernel is simply the transition function .
The one-step coherent risk measure is a Markov risk measure.
The simplest case of the risk transition mapping is in the conditional expectation case , i.e.,
Note that in the total discounted expectation case is a linear function in rather than a convex function, which is the case for a general coherent risk measures. For example, for the CVaR risk measure, the Markov risk transition mapping is given by
where is a convex function in .
If is a coherent, Markov risk measure, then the Markov policies are sufficient to ensure optimality [ruszczynski2010risk].
In the next result, we show that we can find a lower bound to the solution to Problem 1 via solving an optimization problem.
From Proposition 1, we have know that (11) holds. Hence, we have
wherein the fourth, fifth, and sixth inequalities above we used the positive homogeneity property of , sub-additivity property of , and the minimax inequality respectively. Since does not depend on , to find the solution the infimum it suffices to find the solution to
where . The value to the above optimization can be obtained by solving the following Bellman equation [ruszczynski2010risk, Theorem 4]
Next, we show that the solution to the above Bellman equation can be alternatively obtained by solving the convex optimization
and for all . From [ruszczynski2010risk, Lemma 1], we infer that and are non-decreasing; i.e., for , we have and . Therefore, if , then . By repeated application of , we obtain
Any feasible solution to (IV) must satisfy and hence must satisfy . Thus, given that all entries of are positive, is the optimal solution to (IV). Substituting (IV) back in the last inequality in (IV) yields the result. ∎
Once the values of and are found by solving optimization problem (1), we can find the policy as
One interesting observation is that if the coherent risk measure is the total discounted expectation, then Theorem 1 is consistent with the classical result by [altman1999constrained] on constrained MDPs.
From the derivation in (IV), we observe the two inequalities are from the application of (a) the sub-additivity property of and (b) the max-min inequality. Next, we show that in the case of total expectation both of these properties lead to an equality.
(a) Sub-additivity property of : for total expectation, we have
Thus, equality holds.
(b) Max-min inequality: in the case, both the objective function and the constraints are linear in the decision variables and . Therefore, the sixth line in (IV) reads as
Since the expression inside parantheses above is convex in ( is linear in the policy) and concave (linear) in . From Minimax Theorem [du2013minimax], we have that the following equality holds
In [ahmadi2021aaai], we presented a method based on difference convex programs to solve (1), when is an arbitrary coherent risk measure and we described the specific structure of the optimization problem for conditional expectation, CVaR, and EVaR. In fact, it was shown that (1) can be written in a standard DCP format as
Optimization problem (IV) is a standard DCP [horst1999dc]le2008dc]
and inverse covariance estimation in statistics[thai2014inverse]. Although DCPs can be solved globally [horst1999dc], e.g. using branch and bound algorithms [lawler1966branch], a locally optimal solution can be obtained based on techniques of nonlinear optimization [Bertsekas99] more efficiently. In particular, in this work, we use a variant of the convex-concave procedure [lipp2016variations, shen2016disciplined], wherein the concave terms are replaced by a convex upper bound and solved. In fact, the disciplined convex-concave programming (DCCP) [shen2016disciplined] technique linearizes DCP problems into a (disciplined) convex program (carried out automatically via the DCCP Python package [shen2016disciplined]), which is then converted into an equivalent cone program by replacing each function with its graph implementation. Then, the cone program can be solved readily by available convex programming solvers, such as CVXPY [diamond2016cvxpy].
V Constrained Risk-Averse POMDPs
Next, we show that, in the case of POMDPs, we can find a lower bound to the solution to Problem 1 via solving an infinite-dimensional optimization problem. Note that a POMDP is equivalent to a belief MDP , , where is defined in (2).
Note that a POMDP can be represented as an MDP over the belief states (2) with initial distribution (1). Hence, a POMDP is a controlled Markov process with states , where the controlled belief transition probability is described as
The rest of the proof follows the same footsteps on Theorem 1 over the belief MDP with as defined above. ∎
Unfortunately, since and hence , optimization (2) is infinite-dimensional and we cannot solve it efficiently.
If the one-step coherent risk measure is the total discounted expectation, we can show that optimization problem (2) simplifies to an infinite-dimensional linear program and equality holds in (23). This can be proved following the same lines as the proof of Corollary 1 but for the belief MDP. Hence, Theorem 2 also provides an optimization based solution to the constrained POMDP problem.
V-a Risk-Averse FSC Synthesis via Policy Iteration
In order to synthesize risk-averse FSCs, we employ a policy iteration algorithm. Policy iteration incrementally improves a controller by alternating between two steps: Policy Evaluation (computing value functions by fixing the policy) and Policy Improvement (computing the policy by fixing the value functions), until convergence to a satisfactory policy [bertsekas76]. For a risk-averse POMDP, policy evaluation can be carried out by solving (2). However, as mentioned earlier, (2) is difficult to use directly as it must be computed at each (continuous) belief state in the belief space, which is uncountably infinite.
In the following, we show that if instead of considering policies with infinite-memory, we search over finite-memory policies, then we can find suboptimal solutions to Problem 1 that lower-bound . To this end, we consider stochastic but finite-memory controllers as described in Section II.C.
Closing the loop around a POMDP with an FSC
induces a Markov chain. The global Markov chain(or simply , where the stochastic finite state controller and the POMDP are clear from the context) with execution . The probability of initial global state is
The state transition probability, , is given by
V-B Risk Value Function Computation
Under an FSC, the POMDP is transformed into a Markov chain with design probability distributions and . The closed-loop Markov chain is a controlled Markov process with , . In this setting, the total risk functional (8) becomes a function of and FSC , i.e.,
where s and s are drawn from the probability distribution . The constraint functionals , can also be defined similarly.
Let be the value of Problem 1 under a FSC . Then, it is evident that , since FSCs restrict the search space of the policy . That is, they can only be as good as the (infinite-dimensional) belief-based policy as (infinite-memory).
Risk Value Function Optimization: For POMDPs controlled by stochastic finite state controllers, the dynamic program is developed in the global state space . From Theorem 1, we see that for a given FSC, , and POMDP , the value function can be computed by solving the following finite dimensional optimization