With the rise of autonomous systems being deployed in real-world settings, the associated risk that stems from unknown and unforeseen circumstances is correspondingly on the rise. In particular, in safety-critical scenarios, such as aerospace applications, decision making should account for risk. For example, spacecraft control technology relies heavily on a relatively large and highly skilled mission operations team that generates detailed time-ordered and event-driven sequences of commands. This approach will not be viable in the future with increasing number of missions and a desire to limit the operations team and Deep Space Network (DSN) costs. Future spaceflight missions will be at large distances and light- time delays from Earth, requiring novel capabilities for astronaut crews and ground operators to manage spacecraft consumables such as power, water, propellant, and life support systems to prevent mission failure. In order to maximize the science returns under these conditions, the ability to deal with emergencies and safely explore remote regions are becoming more and more important [mcghan2016resilient]. Even in Mars rover navigation problems, finding planning policies that minimize risk is of utmost importance due to the uncertainties present in Mars surface data [ono2018mars] as illustrated in Figure 1.
Risk can be quantified in numerous ways. For example, mission risks can be mathematically characterized in terms of chance constraints [ono2013probabilistic, ono2012closed, ono2015chance]
. The preference of one risk measure over another depends on factors such as sensitivity to rare events, ease of estimation from data, and computational tractability. Artzneret. al. [artzner1999coherent] characterized a set of natural properties that are desirable for a risk measure, called a coherent risk measure, and have henceforth obtained widespread acceptance in finance and operations research, among others. An important example of a coherent risk measure is the conditional value-at-risk (CVaR) that has received significant attention in decision making problems such as Markov decision processes (MDPs) [chow2015risk, chow2014algorithms, prashanth2014policy, bauerle2011markov]. General coherent risk measures for MDPs were studied in [ruszczynski2010risk], wherein it was further assumed the risk measure is time consistent, similar to the dynamic programming property. Following the footsteps of the latter contribution, [tamar2016sequential] proposed a sampling-based algorithm for MDPs with static and dynamic coherent risk measures using policy gradient and actor-critic methods, respectively (also, see a model predictive control technique for linear dynamical systems with coherent risk objectives [singh2018framework]).
However, in many aerospace applications, sensing constraints does not allow for full-state observation and decision making involves partial observation [ahmadi2019safe, nilsson2018toward]. These problems can be represented as a partially observable Markov decision process (POMDP), where decision making is subject to uncertainty stemming from stochastic outcomes as well as partial observation [krishnamurthy2016partially]. In this paper, we propose a method based on bounded policy iteration to design sub-optimal risk-averse policies for POMDPs. To this end, we first discuss that the problem of designing risk-averse optimal policies is undecidable in general. Then, we show that a stochastic but finite-memory controller can be synthesized to upper-bound the dynamic risk. Given a memory budget, we propose a policy iteration method to synthesize these finite-state controllers that can increase the number of memory states to improve risk-aversity. We illustrate our proposed method with a numerical example of path planning under uncertainty.
The rest of the paper is organized as follows. The next section reviews some preliminary notions and definitions used in the sequel. In Section III, we discuss POMDPs with coherent risk measures. In Section IV, we propose sub-optimal stochastic finite state controllers that minimize the upper-bound on the coherent risk. In Section V, a bounded policy iteration algorithm is formulated to design risk-averse stochastic finite controllers. In Section VI, we elucidate our results with a numerical example. Finally, in Section VII, we conclude the paper and give directions for future research.
In this section, we briefly review some notions and definitions used throughout the paper.
Ii-a Markov Chains
, the transition probability defined as the conditional distributionsuch that , and the initial distribution such that . An infinite path, denoted by the superscript , of the Markov chain is a sequence of states such that for all and . The probability space over such paths is the defined as follows. The sample space is the set of infinite paths with initial state , i.e., . is the least -algebra on containing , where is the cylinder set. Finally, in order to specify the probability measure over all sets of events in , it is sufficient to provide the probability of each cylinder set, which can be computed as Once the probability measure is defined over the cylinder sets, the expectation operator is also uniquely defined. In the sequel, we remove the subscript whenever the Markov chain is clear from the context.
Ii-B Partially Observable Markov Decision Process
Definition 1 (Pomdp)
A POMDP, , consists of:
States of the autonomous agent(s) and world model,
Actions available to the robot,
A Transition function ,
A cost, , for each state and action .
This paper considers finite POMDPs where , , and are finite sets. For each action the probability of making a transition from state to state under action is given by . For each state , an observation is generated independently with probability . The starting world state is given by the distribution . The probabilistic components of a POMDP model must satisfy the following:
Given a POMDP, we can define beliefs or distributions over states at each time step to keep track of sufficient statistics with finite description [astrom65]. The beliefs , with
being the set of probability distributions over, for all can be computed using the Bayes’ law as follows:
for all . It is also worth mentioning that (2) is referred to as the belief update equation.
Ii-C Stochastic Finite State Control of POMDPs
It is well established that designing optimal policies for POMDPs based on the (continuous) belief states require uncountably infinite memory or internal states [CassandraKL94, KLC98, MADANI20035]. This paper focuses on a particular class of POMDP controllers, namely, stochastic finite state controllers. These controllers lead to a finite state space Markov chain for the closed loop controlled system.
Definition 2 (Stochastic Finite State Controller)
Let be a POMDP with observations , actions , and initial distribution . A stochastic finite state controller for is given by the tuple where
is a finite set of internal states (I-states).
is a function of internal stochastic finite state controller states and observation , such that is a probability distribution over . The next internal state and action pair is chosen by independent sampling of . By abuse of notation, will denote the probability of transitioning to internal stochastic finite state controller state and taking action , when the current internal state is and observation is received.
chooses the starting internal FSC state , by independent sampling of , given initial distribution of . will denote the probability of starting the FSC in internal state when the initial POMDP distribution is .
Closing the loop around a POMDP with a stochastic finite state controller yields the following transition system.
Definition 3 (Global Markov Chain)
Let POMDP have state space and let be the I-states of stochastic finite state controller . The global Markov chain (or simply , where the stochastic finite state controller and the POMDP are clear from the context) with execution evolves as follows:
The probability of initial global state is
The state transition probability, , is given by
Note that the global Markov chain arising from a finite state space POMDP also has a finite state space.
Ii-D Coherent Risk Measures
Consider a probability space , a filteration
, and an adapted sequence of random variables (stage-wise costs), where . We further define the spaces , , and let and . We further assume that the sequence is almost surely bounded, i.e.,
In order to describe how one can evaluate the risk of subsequence from the perspective of stage , we require the following definitions.
Definition 4 (Conditional Risk Measure)
A mapping , where , is called a conditional risk measure, if it has the following monoticity property:
where the inequalities should be understood componentwise.
Definition 5 (Dynamic Risk Measure)
A dynamic risk measure is a sequence of conditional risk measures , .
One fundamental property of dynamic risk measures is their consistency over time. That is, if will be as good as from the perspective of some future time , and they are identical between time and , then should not be worse than from the current time’s perspective.
Definition 6 (Time-Consistent Risk Measure)
A dynamic risk measure is called time-consistent if for all and all sequences the conditions
If a risk measure is time-consistent, we can define the one-step conditional risk measure , as follows:
and for all , we obtain:
Note that the time-consistent risk measure is completely defined by one-step conditional risk measures , and, in particular, for , (6) define a risk measure of the entire sequence .
At this point, we are ready to define a coherent risk measure.
Definition 7 (Coherent Risk Measure)
We call the one-step conditional risk measures , as in (6) a coherent risk measure if it satisfies the following conditions
Convexity: , for all and all ;
Monotonicity: If then for all ;
Time Consistency: for all and ;
Positive Homogeneity: for all and .
Henceforth, all the risk measures considered are assumed to be coherent. In this paper, we are interested in the discounted infinite horizon problems. Let be a given discount factor. For , we define the functionals
which are the same as (6) for , but with discounting applied to each . Finally, we have total discounted risk functional defined as
From [ruszczynski2010risk, Theorem 3], we have that is convex, monotone, and positive homogenoeus.
Iii Risk-Averse POMDPs
Notions of coherent risk and dynamic risk measures discussed in the previous section have been developed and applied in microeconomics and mathematical finance fields in the past two decades [vose2008risk]. Generally speaking, risk-averse decision making is concerned with the behavior of agents, e.g. consumers and investors, who, when exposed to uncertainty, attempt to lower that uncertainty. The agent averts to agree to a situation with an unknown payoff rather than another situation with a more predictable payoff but possibly lower expected payoff. In a Markov decision making setting, the main idea in risk-averse control is to replace the conventional conditional expectation of the cumulative reward or cost objectives with more general risk measures.
Consider a stationary (policies, transition probabilities, and cost functions do not depend explicitly on time) controlled Markov process , . Each policy leads to a cost sequence , . We define the dynamic risk of evaluating the -discounted cost of the policy as
where is defined in (8). In this work, we are interested in addressing the following problem:
For a given POMDP , a discount factor , and a total risk functional as in (9) with being coherent risk measures, compute
We refer to a controlled Markov process with the “nested” objective (9) a risk-averse Markov process. Many applications such as portfolio allocation problems [gonzalo2019differences] and organ transplant decisions [heilman2017potential]
require a risk-averse Markov model. It was also previously demonstrated in[chow2015risk, osogami2012robustness] that coherent risk measure objectives can account for modeling errors and parametric uncertainty in MDPs.
The main challenge is that at any time , the value of is -measurable and is allowed to depend on the entire history of the process and we cannot expect to obtain a Markov optimal policy [ott2010markov].
In order to obtain Markov optimal policies, we need to make the following assumption (see [ruszczynski2010risk, Section 4] for more details):
For any function , we have
where . The function is called a Markov risk transition mapping.
Note that the Markov risk transition mapping depends on the function , the states
, and probability vector. The dot in on the right hand side of (10
) represents a dummy variable that is integrated/summed out with respect to the-th row of the transition probability matrix . The simplest case of the Markov risk transition mapping is the conditional expectation , i.e.,
If is a coherent risk measure as described in Definition 7, then the Markov policies are sufficient to ensure optimality [ruszczynski2010risk]. In particular, for the CVaR risk measure, the Markov risk transition mapping is given by
The risk-averse formulation can be extended to POMDPs as follows.
Note that a POMDP can be represented as an MDP over the belief states (2). Hence, a POMDP is a controlled Markov process with states , where the controlled belief transition probability is described as
Then, given that is non-negative and upper-bounded, from [krishnamurthy2016partially, Theorem 8.6.2] and [ruszczynski2010risk, Theorem 4], we infer that from the Bellman equations (12) we can obtain the optimal policies.
We can use a method based on policy iteration to solve the dynamic programming equations (12) to design risk-averse optimal policies. To this end, for , given a stationary Markov policy , we calculate the corresponding value function as
|Then, we compute the next policy as|
Unfortunately, the problem of designing risk-averse optimal Markovian policies for POMDPs is undecidable in general. This follows from [MADANI20035, Theorem 4.4] by noting that .
In the subsequent section, we demonstrate that, if instead of considering policies with infinite-memory we search over finite-memory policies, then we can minimize upper-bounds on the total risk cost functional (9).
Iv Risk-Averse Stochastic Finite State Controllers
Under a stochastic finite state controller, the POMDP is transformed into a Markov chain with design probability distributions and . We define the total risk functional of this parametric Markov chain as
where s and s are drawn from the probability distribution . In this setting, Problem 1 can be expressed as
For a given POMDP , a stochastic finite state controller , a discount factor , and a total risk functional as in (15) with being coherent risk measures, compute
The optimal value of Problem 2 provides an upper-bound to that of Problem 1, since a stochastic finite state controller only contains finite memory states and it can be at best as good as the belief-based optimal policy (with infinite memory). The latter claim can also be shown using [hansen1998solving, Theorem 1], which indicates that any improvement in the parameters of a stochastic finite state controller (in the sense of optimizing the value functions) is at most as good as the belief value function.
For POMDPs controlled by stochastic finite state controllers, the dynamic program is developed in the global state space . The value function is defined over this global state space, and policy iteration techniques must also be carried out in the global state space. For a given stochastic finite state controller, , and the POMDP , the value function is the discounted dynamic risk measure under , and can be computed by solving a set of equations:
Then, for each , the optimal value function over the induced Markov Chain can be computed by taking the minimum of the above equation over all I-states
Since is convex (because is a coherent risk measure), (16) can be solved by a convex optimization.
We end this section by demonstrating that the optimal values obtained using the stochastic finite state controllers upper-bound those of the belief-based (infinite-memory) policy.
Consider the POMDP and the Markov chain induced by the stochastic finite state controller . Then, for all , we have .
The value function of the induced Markov chain satisfies (16) for all . For each I-state , the value function in beliefs can be computed as
and the optimal value function given by
Applying Hölder inequality to the right-hand side of the above equality, we obtain
where in the last inequality we used the fact that since and the fact that is non-negative (since is non-negative). From (17), we infer .
V A Bounded Policy Iteration Algorithm for Risk-Averse stochastic finite state controllers
So far, we showed that synthesizing an infinite memory controller for POMDPs with coherent risk objectives is undecidable. On the other hand, a stochastic finite state controller can upper-bound the coherent risk for a POMDP. In this section, we provide a computational method based on bounded policy iteration to design risk-averse stochastic finite state controllers. Furthermore, we propose techniques for minimizing the upper bound on the total coherent risk by adding I-states to the algorithm in order to escape local minima.
Policy iteration incrementally improves a controller by alternating between two steps: Policy Evaluation and Policy Improvement, until convergence to an optimal policy [bertsekas76]. During policy improvement, a dynamic programming update using the so called dynamic programming backup equation (DP Backup) is used. For a risk-averse POMDP, the DP Backup is given by
The r.h.s. of the DP Backup can be applied to any risk value function. The effect is a risk reduction (if possible) at every belief state. However, DP Backup is difficult to use directly as it must be computed at each belief state in the belief space, which is uncountably infinite.
In [PoupartB03, hansen08], a methodology called the Bounded Policy Iteration is proposed for stochastic finite state controllers, which allows stochastic finite state controllers with fewer I-states to have comparable performance in comparison with deterministic finite state controllers, while allowing the stochastic finite state controller to grow in a bounded fashion – only one (or a few) I-state(s) need to be added at a time to escape a local minima.
Before presenting our proposed bounded policy iteration method for risk-averse stochastic finite state controllers, we recall the following important definition.
Definition 8 (Tangent Belief State)
A belief state is called a tangent belief state, if touches the DP Backup of from above. Since must equal for some , we also say that the I-state is tangent to the backed up value function at .
Equipped with this definition, the two steps involved in our algorithm is described next.
V-a I-States Improvement via Convex Optimization
Let denote the vectorized in . We say that an I-state is improved, if the tunable stochastic finite state controller parameters associated with that I-state can be adjusted so that decreases.
As a first step, we point out that the search over can be dropped. This is simply because the initial I-state is chosen by computing the best valued I-state for the given initial belief, i.e., , where
After initialization, we pose the improvement as a convex optimization as follows:
I-state Improvement Convex Optimization: For the I-state , the following convex optimization is constructed over the variables , ,
The above convex optimization searches for values that improve the I-state value vector by maximizing the decision variable . If an improvement is found, i.e., , the parameters of the I-state are updated by the corresponding minimizing .
Algorithm 1 outlines the main steps in the bounded policy iteration for risk-averse stochastic finite state controllers. The algorithm has two distinct parts. First, for fixed parameters of the stochastic finite state controller (), policy evaluation is carried out, in which is computed using the following convex optimization (Steps 2, 10 and 18): For each I-state , we have the following:
In fact, the above optimization solves (16) for . Second, after evaluating the current coherent risk function, an improvement is carried out either by changing the parameters of existing I-states, or if no new parameters can improve any I-state, then a fixed number of I-states are added to escape the local minima (Steps 14-17). This is described in Section V-B.
V-B Escaping Local Minima by Adding I-States
At some point of running the algorithm, no I-state may be improved with further iterations, i.e., , the corresponding convex optimization (V-A) yields an optimal value of . Then, policy iteration has reached a local minimum if and only if is tangent to the backed up value function for all [PoupartB03]. The dual variables corresponding to the Improvement Constraints in (V-A) provide those belief states that are tangent to the risk function. The process for adding I-states involves forwarding the tangent beliefs one step and then checking if the value of those forwarded beliefs can be improved. The procedure for adding I-states is provided in Algorithm 2.
Algorithm 2 can be understood as follows. Assume that a tangent belief exists for some I-state . Instead of directly improving the value of the tangent belief, the algorithm tries to improve the value of forwarded beliefs reachable in one step from the tangent beliefs. First, the forwarded beliefs are computed (Step 4-8). Then, the corresponding risk value functions are applied to a DP Backup (Steps 9-11). If some action and successor I-state can in fact reduce the risk value (Step 12), then a new I-state is added which deterministically leads to this action and successor I-state (Steps 13-14). Note that at the end of the algorithm, the newly added I-states, have no incoming edges, i.e., no pre-existing I-states transition to . However, when the other I-states are improved in subsequent policy improvement steps, they generate transitions to any added. This new I-state then improves the value of the original tangent belief.
Vi Numerical Example
An agent (e.g. a robot) has to autonomously navigate a two dimensional terrain map (e.g. Mars surface) represented by a grid world ( states) with obstacles of different shapes. At each time step the agent can move to any of its eight neighboring states (diagonal moves are allowed). Due to sensing and control noise, however, with probability a move to a random neighboring state occurs. The stage-wise cost of each move until reaching the destination is , to account for fuel usage. In between the starting point and the destination, there are a number of obstacles that the agent should avoid. Hitting an obstacle incurs the cost of leading to termination, while the goal grid region has reward . The discount factor is . After a move is chosen, the observation of the agent is assumed to be binary, i.e., either an obstacle is detected in the next cell that the robot is moving to or not. Similar to [chow2015risk], in our simulations, we included an obstacle and target position perturbation in a random direction to one of the neighboring grid cells with probability to represent uncertainty in the terrain map (recall the uncertainty in Mars terrain maps as shown in Figure 1).
The objective is to compute a safe (i.e., obstacle-free) path that is fuel efficient. To this end, we consider CVaR as the coherent risk measure. CVaR is given by
where and the infimum should be understood point-wise. In general, the confidence level may be -measurable function with values in the interval . Here, we assume . A value of corresponds to a risk-neutral policy; whereas, a value of is rather a risk-averse policy. For CVaR risk measure, (16) can be computed as
where the infimum on the right hand side of the above equation can either be solved by line search techniques or by representation in terms of an elementary linear programming problem since it is convex in[rockafellar2000optimization, Theorem 1] (the function is increasing and convex [ott2010markov, Lemma A.1., p. 117]).
Figure 2 depicts the policies and the value functions computed for the grid world based on the bounded policy iteration technique in Section V. For these experiments, we used internal states for the stochastic finite state controller and the corresponding convex optimizations were solved using CVX toolbox [cvx] in MATLAB.
As it can be observed from Figure 2, the risk-neutral policy leads to shorter paths from different cells to the target. However, on perturbed scenarios, it performed poorly with failures. On the other hand, the risk-averse policy leads to longer routes from cells to the target chooses, but it resulted only in failed scenarios. These results parallel those obtained in [chow2015risk], wherein risk-averse policies in terms of CVaR for MDPs were studied.
We proposed a method based on bounded policy iteration and convex optimization to design risk-averse stochastic finite state controllers for POMDPs. Future research will explore risk-averse polices for POMDPs that maximize the satisfaction probability of a set of high-level mission specifications in terms of temporal logic formulae [sharan14, MSB19]. Furthermore, the risk-averse policy synthesis technique will be applied for designing risk-averse planning policies for traversing on uncertain Mars surface (as depicted in Figure 1).