I Introduction
With the rise of autonomous systems being deployed in realworld settings, the associated risk that stems from unknown and unforeseen circumstances is correspondingly on the rise. In particular, in safetycritical scenarios, such as aerospace applications, decision making should account for risk. For example, spacecraft control technology relies heavily on a relatively large and highly skilled mission operations team that generates detailed timeordered and eventdriven sequences of commands. This approach will not be viable in the future with increasing number of missions and a desire to limit the operations team and Deep Space Network (DSN) costs. Future spaceflight missions will be at large distances and light time delays from Earth, requiring novel capabilities for astronaut crews and ground operators to manage spacecraft consumables such as power, water, propellant, and life support systems to prevent mission failure. In order to maximize the science returns under these conditions, the ability to deal with emergencies and safely explore remote regions are becoming more and more important [mcghan2016resilient]. Even in Mars rover navigation problems, finding planning policies that minimize risk is of utmost importance due to the uncertainties present in Mars surface data [ono2018mars] as illustrated in Figure 1.
Risk can be quantified in numerous ways. For example, mission risks can be mathematically characterized in terms of chance constraints [ono2013probabilistic, ono2012closed, ono2015chance]
. The preference of one risk measure over another depends on factors such as sensitivity to rare events, ease of estimation from data, and computational tractability. Artzner
et. al. [artzner1999coherent] characterized a set of natural properties that are desirable for a risk measure, called a coherent risk measure, and have henceforth obtained widespread acceptance in finance and operations research, among others. An important example of a coherent risk measure is the conditional valueatrisk (CVaR) that has received significant attention in decision making problems such as Markov decision processes (MDPs) [chow2015risk, chow2014algorithms, prashanth2014policy, bauerle2011markov]. General coherent risk measures for MDPs were studied in [ruszczynski2010risk], wherein it was further assumed the risk measure is time consistent, similar to the dynamic programming property. Following the footsteps of the latter contribution, [tamar2016sequential] proposed a samplingbased algorithm for MDPs with static and dynamic coherent risk measures using policy gradient and actorcritic methods, respectively (also, see a model predictive control technique for linear dynamical systems with coherent risk objectives [singh2018framework]).However, in many aerospace applications, sensing constraints does not allow for fullstate observation and decision making involves partial observation [ahmadi2019safe, nilsson2018toward]. These problems can be represented as a partially observable Markov decision process (POMDP), where decision making is subject to uncertainty stemming from stochastic outcomes as well as partial observation [krishnamurthy2016partially]. In this paper, we propose a method based on bounded policy iteration to design suboptimal riskaverse policies for POMDPs. To this end, we first discuss that the problem of designing riskaverse optimal policies is undecidable in general. Then, we show that a stochastic but finitememory controller can be synthesized to upperbound the dynamic risk. Given a memory budget, we propose a policy iteration method to synthesize these finitestate controllers that can increase the number of memory states to improve riskaversity. We illustrate our proposed method with a numerical example of path planning under uncertainty.
The rest of the paper is organized as follows. The next section reviews some preliminary notions and definitions used in the sequel. In Section III, we discuss POMDPs with coherent risk measures. In Section IV, we propose suboptimal stochastic finite state controllers that minimize the upperbound on the coherent risk. In Section V, a bounded policy iteration algorithm is formulated to design riskaverse stochastic finite controllers. In Section VI, we elucidate our results with a numerical example. Finally, in Section VII, we conclude the paper and give directions for future research.
Ii Preliminaries
In this section, we briefly review some notions and definitions used throughout the paper.
Iia Markov Chains
, the transition probability defined as the conditional distribution
such that , and the initial distribution such that . An infinite path, denoted by the superscript , of the Markov chain is a sequence of states such that for all and . The probability space over such paths is the defined as follows. The sample space is the set of infinite paths with initial state , i.e., . is the least algebra on containing , where is the cylinder set. Finally, in order to specify the probability measure over all sets of events in , it is sufficient to provide the probability of each cylinder set, which can be computed as Once the probability measure is defined over the cylinder sets, the expectation operator is also uniquely defined. In the sequel, we remove the subscript whenever the Markov chain is clear from the context.IiB Partially Observable Markov Decision Process
Definition 1 (Pomdp)
A POMDP, , consists of:

States of the autonomous agent(s) and world model,

Actions available to the robot,

Observations ,

A Transition function ,

A cost, , for each state and action .
This paper considers finite POMDPs where , , and are finite sets. For each action the probability of making a transition from state to state under action is given by . For each state , an observation is generated independently with probability . The starting world state is given by the distribution . The probabilistic components of a POMDP model must satisfy the following:
Given a POMDP, we can define beliefs or distributions over states at each time step to keep track of sufficient statistics with finite description [astrom65]. The beliefs , with
being the set of probability distributions over
, for all can be computed using the Bayes’ law as follows:(1)  
(2) 
for all . It is also worth mentioning that (2) is referred to as the belief update equation.
IiC Stochastic Finite State Control of POMDPs
It is well established that designing optimal policies for POMDPs based on the (continuous) belief states require uncountably infinite memory or internal states [CassandraKL94, KLC98, MADANI20035]. This paper focuses on a particular class of POMDP controllers, namely, stochastic finite state controllers. These controllers lead to a finite state space Markov chain for the closed loop controlled system.
Definition 2 (Stochastic Finite State Controller)
Let be a POMDP with observations , actions , and initial distribution . A stochastic finite state controller for is given by the tuple where

is a finite set of internal states (Istates).

is a function of internal stochastic finite state controller states and observation , such that is a probability distribution over . The next internal state and action pair is chosen by independent sampling of . By abuse of notation, will denote the probability of transitioning to internal stochastic finite state controller state and taking action , when the current internal state is and observation is received.

chooses the starting internal FSC state , by independent sampling of , given initial distribution of . will denote the probability of starting the FSC in internal state when the initial POMDP distribution is .
Closing the loop around a POMDP with a stochastic finite state controller yields the following transition system.
Definition 3 (Global Markov Chain)
Let POMDP have state space and let be the Istates of stochastic finite state controller . The global Markov chain (or simply , where the stochastic finite state controller and the POMDP are clear from the context) with execution evolves as follows:

The probability of initial global state is

The state transition probability, , is given by
Note that the global Markov chain arising from a finite state space POMDP also has a finite state space.
IiD Coherent Risk Measures
Consider a probability space , a filteration
, and an adapted sequence of random variables (stagewise costs)
, where . We further define the spaces , , and let and . We further assume that the sequence is almost surely bounded, i.e.,In order to describe how one can evaluate the risk of subsequence from the perspective of stage , we require the following definitions.
Definition 4 (Conditional Risk Measure)
A mapping , where , is called a conditional risk measure, if it has the following monoticity property:
(3) 
where the inequalities should be understood componentwise.
Definition 5 (Dynamic Risk Measure)
A dynamic risk measure is a sequence of conditional risk measures , .
One fundamental property of dynamic risk measures is their consistency over time. That is, if will be as good as from the perspective of some future time , and they are identical between time and , then should not be worse than from the current time’s perspective.
Definition 6 (TimeConsistent Risk Measure)
A dynamic risk measure is called timeconsistent if for all and all sequences the conditions
imply
(4) 
If a risk measure is timeconsistent, we can define the onestep conditional risk measure , as follows:
(5) 
and for all , we obtain:
(6) 
Note that the timeconsistent risk measure is completely defined by onestep conditional risk measures , and, in particular, for , (6) define a risk measure of the entire sequence .
At this point, we are ready to define a coherent risk measure.
Definition 7 (Coherent Risk Measure)
We call the onestep conditional risk measures , as in (6) a coherent risk measure if it satisfies the following conditions

Convexity: , for all and all ;

Monotonicity: If then for all ;

Time Consistency: for all and ;

Positive Homogeneity: for all and .
Henceforth, all the risk measures considered are assumed to be coherent. In this paper, we are interested in the discounted infinite horizon problems. Let be a given discount factor. For , we define the functionals
(7) 
which are the same as (6) for , but with discounting applied to each . Finally, we have total discounted risk functional defined as
(8) 
From [ruszczynski2010risk, Theorem 3], we have that is convex, monotone, and positive homogenoeus.
Iii RiskAverse POMDPs
Notions of coherent risk and dynamic risk measures discussed in the previous section have been developed and applied in microeconomics and mathematical finance fields in the past two decades [vose2008risk]. Generally speaking, riskaverse decision making is concerned with the behavior of agents, e.g. consumers and investors, who, when exposed to uncertainty, attempt to lower that uncertainty. The agent averts to agree to a situation with an unknown payoff rather than another situation with a more predictable payoff but possibly lower expected payoff. In a Markov decision making setting, the main idea in riskaverse control is to replace the conventional conditional expectation of the cumulative reward or cost objectives with more general risk measures.
Consider a stationary (policies, transition probabilities, and cost functions do not depend explicitly on time) controlled Markov process , . Each policy leads to a cost sequence , . We define the dynamic risk of evaluating the discounted cost of the policy as
(9) 
where is defined in (8). In this work, we are interested in addressing the following problem:
Problem 1
For a given POMDP , a discount factor , and a total risk functional as in (9) with being coherent risk measures, compute
We refer to a controlled Markov process with the “nested” objective (9) a riskaverse Markov process. Many applications such as portfolio allocation problems [gonzalo2019differences] and organ transplant decisions [heilman2017potential]
require a riskaverse Markov model. It was also previously demonstrated in
[chow2015risk, osogami2012robustness] that coherent risk measure objectives can account for modeling errors and parametric uncertainty in MDPs.The main challenge is that at any time , the value of is measurable and is allowed to depend on the entire history of the process and we cannot expect to obtain a Markov optimal policy [ott2010markov].
In order to obtain Markov optimal policies, we need to make the following assumption (see [ruszczynski2010risk, Section 4] for more details):
Assumption 1
For any function , we have
(10) 
where . The function is called a Markov risk transition mapping.
Note that the Markov risk transition mapping depends on the function , the states
, and probability vector
. The dot in on the right hand side of (10) represents a dummy variable that is integrated/summed out with respect to the
th row of the transition probability matrix . The simplest case of the Markov risk transition mapping is the conditional expectation , i.e.,If is a coherent risk measure as described in Definition 7, then the Markov policies are sufficient to ensure optimality [ruszczynski2010risk]. In particular, for the CVaR risk measure, the Markov risk transition mapping is given by
(11) 
The riskaverse formulation can be extended to POMDPs as follows.
Theorem 1
Note that a POMDP can be represented as an MDP over the belief states (2). Hence, a POMDP is a controlled Markov process with states , where the controlled belief transition probability is described as
(13) 
with
Then, given that is nonnegative and upperbounded, from [krishnamurthy2016partially, Theorem 8.6.2] and [ruszczynski2010risk, Theorem 4], we infer that from the Bellman equations (12) we can obtain the optimal policies.
We can use a method based on policy iteration to solve the dynamic programming equations (12) to design riskaverse optimal policies. To this end, for , given a stationary Markov policy , we calculate the corresponding value function as
(14a)  
Then, we compute the next policy as  
(14b) 
Unfortunately, the problem of designing riskaverse optimal Markovian policies for POMDPs is undecidable in general. This follows from [MADANI20035, Theorem 4.4] by noting that .
In the subsequent section, we demonstrate that, if instead of considering policies with infinitememory we search over finitememory policies, then we can minimize upperbounds on the total risk cost functional (9).
Iv RiskAverse Stochastic Finite State Controllers
Under a stochastic finite state controller, the POMDP is transformed into a Markov chain with design probability distributions and . We define the total risk functional of this parametric Markov chain as
(15) 
where s and s are drawn from the probability distribution . In this setting, Problem 1 can be expressed as
Problem 2
For a given POMDP , a stochastic finite state controller , a discount factor , and a total risk functional as in (15) with being coherent risk measures, compute
The optimal value of Problem 2 provides an upperbound to that of Problem 1, since a stochastic finite state controller only contains finite memory states and it can be at best as good as the beliefbased optimal policy (with infinite memory). The latter claim can also be shown using [hansen1998solving, Theorem 1], which indicates that any improvement in the parameters of a stochastic finite state controller (in the sense of optimizing the value functions) is at most as good as the belief value function.
For POMDPs controlled by stochastic finite state controllers, the dynamic program is developed in the global state space . The value function is defined over this global state space, and policy iteration techniques must also be carried out in the global state space. For a given stochastic finite state controller, , and the POMDP , the value function is the discounted dynamic risk measure under , and can be computed by solving a set of equations:
(16) 
where
Then, for each , the optimal value function over the induced Markov Chain can be computed by taking the minimum of the above equation over all Istates
(17) 
Since is convex (because is a coherent risk measure), (16) can be solved by a convex optimization.
We end this section by demonstrating that the optimal values obtained using the stochastic finite state controllers upperbound those of the beliefbased (infinitememory) policy.
Proposition 1
Consider the POMDP and the Markov chain induced by the stochastic finite state controller . Then, for all , we have .
The value function of the induced Markov chain satisfies (16) for all . For each Istate , the value function in beliefs can be computed as
and the optimal value function given by
Applying Hölder inequality to the righthand side of the above equality, we obtain
where in the last inequality we used the fact that since and the fact that is nonnegative (since is nonnegative). From (17), we infer .
V A Bounded Policy Iteration Algorithm for RiskAverse stochastic finite state controllers
So far, we showed that synthesizing an infinite memory controller for POMDPs with coherent risk objectives is undecidable. On the other hand, a stochastic finite state controller can upperbound the coherent risk for a POMDP. In this section, we provide a computational method based on bounded policy iteration to design riskaverse stochastic finite state controllers. Furthermore, we propose techniques for minimizing the upper bound on the total coherent risk by adding Istates to the algorithm in order to escape local minima.
Policy iteration incrementally improves a controller by alternating between two steps: Policy Evaluation and Policy Improvement, until convergence to an optimal policy [bertsekas76]. During policy improvement, a dynamic programming update using the so called dynamic programming backup equation (DP Backup) is used. For a riskaverse POMDP, the DP Backup is given by
The r.h.s. of the DP Backup can be applied to any risk value function. The effect is a risk reduction (if possible) at every belief state. However, DP Backup is difficult to use directly as it must be computed at each belief state in the belief space, which is uncountably infinite.
In [PoupartB03, hansen08], a methodology called the Bounded Policy Iteration is proposed for stochastic finite state controllers, which allows stochastic finite state controllers with fewer Istates to have comparable performance in comparison with deterministic finite state controllers, while allowing the stochastic finite state controller to grow in a bounded fashion – only one (or a few) Istate(s) need to be added at a time to escape a local minima.
Before presenting our proposed bounded policy iteration method for riskaverse stochastic finite state controllers, we recall the following important definition.
Definition 8 (Tangent Belief State)
A belief state is called a tangent belief state, if touches the DP Backup of from above. Since must equal for some , we also say that the Istate is tangent to the backed up value function at .
Equipped with this definition, the two steps involved in our algorithm is described next.
Va IStates Improvement via Convex Optimization
Let denote the vectorized in . We say that an Istate is improved, if the tunable stochastic finite state controller parameters associated with that Istate can be adjusted so that decreases.
As a first step, we point out that the search over can be dropped. This is simply because the initial Istate is chosen by computing the best valued Istate for the given initial belief, i.e., , where
After initialization, we pose the improvement as a convex optimization as follows:
Istate Improvement Convex Optimization: For the Istate , the following convex optimization is constructed over the variables , ,
subject to  
Improvement Constraint:  
Probability Constraints:  
(18) 
The above convex optimization searches for values that improve the Istate value vector by maximizing the decision variable . If an improvement is found, i.e., , the parameters of the Istate are updated by the corresponding minimizing .
Algorithm 1 outlines the main steps in the bounded policy iteration for riskaverse stochastic finite state controllers. The algorithm has two distinct parts. First, for fixed parameters of the stochastic finite state controller (), policy evaluation is carried out, in which is computed using the following convex optimization (Steps 2, 10 and 18): For each Istate , we have the following:
subject to  
(19) 
In fact, the above optimization solves (16) for . Second, after evaluating the current coherent risk function, an improvement is carried out either by changing the parameters of existing Istates, or if no new parameters can improve any Istate, then a fixed number of Istates are added to escape the local minima (Steps 1417). This is described in Section VB.
VB Escaping Local Minima by Adding IStates
At some point of running the algorithm, no Istate may be improved with further iterations, i.e., , the corresponding convex optimization (VA) yields an optimal value of . Then, policy iteration has reached a local minimum if and only if is tangent to the backed up value function for all [PoupartB03]. The dual variables corresponding to the Improvement Constraints in (VA) provide those belief states that are tangent to the risk function. The process for adding Istates involves forwarding the tangent beliefs one step and then checking if the value of those forwarded beliefs can be improved. The procedure for adding Istates is provided in Algorithm 2.
Algorithm 2 can be understood as follows. Assume that a tangent belief exists for some Istate . Instead of directly improving the value of the tangent belief, the algorithm tries to improve the value of forwarded beliefs reachable in one step from the tangent beliefs. First, the forwarded beliefs are computed (Step 48). Then, the corresponding risk value functions are applied to a DP Backup (Steps 911). If some action and successor Istate can in fact reduce the risk value (Step 12), then a new Istate is added which deterministically leads to this action and successor Istate (Steps 1314). Note that at the end of the algorithm, the newly added Istates, have no incoming edges, i.e., no preexisting Istates transition to . However, when the other Istates are improved in subsequent policy improvement steps, they generate transitions to any added. This new Istate then improves the value of the original tangent belief.
Vi Numerical Example
An agent (e.g. a robot) has to autonomously navigate a two dimensional terrain map (e.g. Mars surface) represented by a grid world ( states) with obstacles of different shapes. At each time step the agent can move to any of its eight neighboring states (diagonal moves are allowed). Due to sensing and control noise, however, with probability a move to a random neighboring state occurs. The stagewise cost of each move until reaching the destination is , to account for fuel usage. In between the starting point and the destination, there are a number of obstacles that the agent should avoid. Hitting an obstacle incurs the cost of leading to termination, while the goal grid region has reward . The discount factor is . After a move is chosen, the observation of the agent is assumed to be binary, i.e., either an obstacle is detected in the next cell that the robot is moving to or not. Similar to [chow2015risk], in our simulations, we included an obstacle and target position perturbation in a random direction to one of the neighboring grid cells with probability to represent uncertainty in the terrain map (recall the uncertainty in Mars terrain maps as shown in Figure 1).
The objective is to compute a safe (i.e., obstaclefree) path that is fuel efficient. To this end, we consider CVaR as the coherent risk measure. CVaR is given by
(20) 
where and the infimum should be understood pointwise. In general, the confidence level may be measurable function with values in the interval . Here, we assume . A value of corresponds to a riskneutral policy; whereas, a value of is rather a riskaverse policy. For CVaR risk measure, (16) can be computed as
where the infimum on the right hand side of the above equation can either be solved by line search techniques or by representation in terms of an elementary linear programming problem since it is convex in
[rockafellar2000optimization, Theorem 1] (the function is increasing and convex [ott2010markov, Lemma A.1., p. 117]).Figure 2 depicts the policies and the value functions computed for the grid world based on the bounded policy iteration technique in Section V. For these experiments, we used internal states for the stochastic finite state controller and the corresponding convex optimizations were solved using CVX toolbox [cvx] in MATLAB.
As it can be observed from Figure 2, the riskneutral policy leads to shorter paths from different cells to the target. However, on perturbed scenarios, it performed poorly with failures. On the other hand, the riskaverse policy leads to longer routes from cells to the target chooses, but it resulted only in failed scenarios. These results parallel those obtained in [chow2015risk], wherein riskaverse policies in terms of CVaR for MDPs were studied.
Vii Conclusions
We proposed a method based on bounded policy iteration and convex optimization to design riskaverse stochastic finite state controllers for POMDPs. Future research will explore riskaverse polices for POMDPs that maximize the satisfaction probability of a set of highlevel mission specifications in terms of temporal logic formulae [sharan14, MSB19]. Furthermore, the riskaverse policy synthesis technique will be applied for designing riskaverse planning policies for traversing on uncertain Mars surface (as depicted in Figure 1).
Comments
There are no comments yet.