In many applications of decision-making problems modeled by Markov decision processes (MDPs), it is reasonable to incorporate some measure of risk to rule out policies that achieve a high expected reward at the cost of risky and error prone actions. If we think for example of an expensive manufacturing machine that has two running modes: one where the machine runs at peak level and produces the maximum number of products for most of the time at the cost of a high chance for a serious damage and one where the machine runs slightly slower to avoid damage. Most companies would agree that the second option is more reasonable. Yet, if the company would make decision with the help of the classical MDPs, it would pick option one and go for the risky strategy.
Most of the decision-making models like MDPs, are consisted with two descriptions of some mechanism of environments, immediate outcomes
(rewards or costs) at one state by performing one action, and transitions, the transition probability between states with some actions. Both descriptions areobjective in the sense that both outcome and transition probability can be estimated by repeating experiencing the environment sufficient many times. The “risk” depends, however, on the subjective perception of the agent, since different agents might have different risk-preferences facing the same environment. For instance, $100 is more valuable for the poor than for the rich. Behavioral experiments  show that people tend to overreact to small probability events, but underreact to medium and large probabilities.
Due to the apparent usefulness of risk-sensitive objectives, the topic is of major importance in finance and economics. In economics, the utility function is widely used to model the subjective perception of rewards. The renowned prospect theory (PT)  introduces the probability weighting function to model the subjective perception of probabilities. PT can be merely used to model single decision problem, whereas in MDP a sequence of decisions have to be made. In mathematical finance, Ruszczyński (2010)  applies coherent/convex risk measures (CRMs) [2, 11] to incorporate risk in a sequential decision-making structure. However, there are two major drawbacks in their work: 1) he assumes that the risk measures must be coherent or convex, which is not true for some of the most important instances of risk measures, and 2) he discusses merely the finite-stage or discounted risk problem for coherent risk measures. The theory of discounted and average risk for arbitrary measures as in the classical MDP have not been considered yet.
In the community of MDPs (mainly operations research and control theory), despite the apparent usefulness of risk-sensitive measures, few works in MDPs address the issue, since many risk-sensitive objectives cannot be optimized efficiently. The mean-variance trade-off is a popular risk criterion, where variance takes the part of the risk measure as it penalizes highly varying return. However, this objective is difficult to optimize, especially when a discount factor is included. Recently in  the problem even for finite-horizon MDP is proved to be NP-hard. Another popular measure is to apply the exponential utility function. Although an efficient solution (see e.g. ) exists for average infinite-horizon MDP, it is proved in  that the objective for discounted MDP is difficult and the optimal policy might not be stationary.
The question is now if all the risk-sensitive objectives are difficult to optimize for MDPs or if measures like the mean-variance trade-off are just not the “right” measure for MDPs. Inspired by the discovery in mathematical finance and economics, our intuition is therefore to adapt the CRM theory to the MDP structure, where two concerns must be balanced: 1) the axioms should be as general as possible to be able to model all kinds of risk-preferences including mixed risk-preference, and 2) the underlying optimization problem can be solved by a computationally feasible algorithm.
The main contributions of this paper are: 1) To incorporate risk into MDPs, we set up a general framework via prospect maps, which is a generalization of the CRMs. The framework contains most of the existing risk-sensitive approaches in economics, mathematical finance and optimal control theory as special cases (cf. Sec. 5). 2) Within the framework, we define a novel temporal discount scheme, which includes the conventional temporal discount scheme as special cases. The optimization problem to the new discounted objective function is proved to be solved by a value iteration algorithm; 3) We investigate the optimization problem of the average prospect. With one additional assumption, the solution to its optimization problem exists and a value iteration is designed to solve it; 4) For the case where the knowledge of MDP, reward and transition, is unknown, we state one algorithm to estimate the reward and transition models of underlying MDP and simultaneously learn optimal policy. For one specific prospect map (entropic map), a Q-learning like algorithm is proposed to obtain optimal policy without knowing the knowledge of MDP.
In order to avoid tedious mathematical details in general state-action spaces, we consider currently merely the MDPs with finite state-action space. However, the extensions to general space are straightforward.
This paper is organized as follows. In Sec. 2 , we briefly introduce the setups of MDPs and prospect maps, which are adapted in Sec. 3 to the MDP structure. Sec. 4 states the major theory of this paper, the discounted prospect and average prospect, whose optimal control problems are solved by value iterations under different assumptions. In Sec. 5 we discuss the existing risk-sensitive approaches and show how to represent them with specific prospect maps. Two on-line algorithms, which might be of interest for engineering-oriented audience, are stated in Sec. 6, which is followed by experiments with simple MDPs in the final section.
2.1 Markov Decision Processes
A Markov decision process  is composed of a state space , an action space , a transition model and a reward model . Both state and action space are assumed to be finite. The transition model denotes the probability of arriving at state given the current state with chosen action at time . We assume the transition is time-homogenous. The reward function represents the reward obtained at state if action is chosen.
The policy at time is defined as the probability of choosing action given state . Let be the sequential policy where at time the policy is used, and at the policy , etc. Let be the set of all policies. A policy is called Markov if for all , depends merely on and is independent from all the states and actions before time . Let denote the set of all Markov policies and be the set of all one-step Markov policy. Thus . A one-step policy is called Markov deterministic, if for some and . With slight abuse of the notation, we also write as a deterministic function . Denote the set of all one-step Markov deterministic policies by . For any , we define
There are usually three types of objectives functions used in the literature of MDPs, finite-stage, discounted and average reward. We summarize them as follows,
where denotes the discount factor. Suppose we start from one given state . The optimization problem is to maximize the expected objective by selecting a policy :
where can be replaced by , or 111Note that since the limit in defining the average reward might not exist (see e.g. Example 8.1.1, ), the strict definition of the optimization problem of average reward should be
2.2 Dynamic Prospect Maps
In the setup of MDPs, we apply “rewards” instead of “costs” (which are common in the literature of Markov control processes ) to model immediate outcomes and therefore in the optimization problems of MDPs (Eq. 3), objectives are to be maximized rather than minimized. To be consistent with maximizing objectives, “prospect maps” are used to name analogous nonlinear structures as risk measures in finance literature. Similar nomenclature can be also found in , where risk is replaced by valuation.
Let us consider a discrete-time stochastic process and . The capital letters and
denote random variables whereas the realizations of the random variables are denoted by normal letters,and , respectively. Let denotes the set of all real-valued bounded functions on , for We consider a map such that is a real-valued bounded function on for fixed . can be also viewed as a map from to . In the following, the (in-)equalities between two functions are understood elementwise, i.e., we say , if for all .
In the following, we first introduce conditional prospect maps and then construct a dynamic prospect measure from to , , by a series of conditional prospect maps .
A map , , is called a conditional prospect map, if
Monotonicity. , if , then .
Time-consistency. For any and , . Especially, for each and , .
Remarks The monotone axiom reflects the intuition that if the reward of one choice are higher than the reward of another choice, the prospect of the choice must be higher than that of the other one. The time-consistent axiom is obviously a generalization of the conditional expectation. This axiom allows the temporal decomposition (see Proposition 2.1), and together with the axiom of monotonicity make the dynamic programming  the feasible solution to the optimization problems (see Sec. 4). The axiom of centralization sets the reference point to be 0, i.e., there is no risk if there is no cost. Nevertheless, it is possible to use other reference points.
A map , , is called a dynamic prospect map, if there exists a series of conditional prospect maps such that
Let , , , , and , we have
Trivial using Axiom II. ∎
Remarks. In the literature of finance, there exist various ways to extend the CRM to a temporal structure (e.g., [9, 20, 1, 28] and references therein). The definition is usually selected based on the applications, to which the dynamic risk measures are applied. To compare their subtle differences are out of the scope of this paper. Nevertheless, there are 2 points that are remarkable: 1) in all kinds of definitions, the axiom of time-consistency is the most important component that allows the temporal decomposition as shown in Prop. 2.1; and 2) their definitions require either coherence  or convexity [9, 20, 1], which means that the agent has to be economically rational, i.e., risk-aversive (more discussion see Sec. 3.3). However, in some problems (especially in modeling real human behaviors), mixed risk-preference (risk-aversive at some states while risk-seeking at other states) is also a possible strategy. For instance, at gambling, some people are risk-aversive when losing money but risk-seeking when winning money. Therefore, we require neither coherence nor convexity. In this sense, our axioms are even more general than the axioms used in finance literature. Finally, in the literature of coherent risk measures, non-additive measures can be defined due to the coherency. However, in this paper we do not assume coherency in the axioms. Instead, we build the theory based on the functional spaces . Therefore, it is more accurate to use the term “map” than “measure”.
3 Applying Prospect Maps in MDPs
The dynamic prospect maps introduced in Sec. 2.2 can be adapted to arbitrary temporal structures. To adapt in the structure of MDPs, we assume the prospect maps work on the state sequence . On the other hand, since the probability of is controlled by policies (together with the transition model of the MDP, , defined in Sec. 2.1), we assume further that the prospect maps depends on . Thus, the conditional prospect maps working on the MDP given one policy are written as .
3.1 Markov Prospect Maps for MDPs
The conditional prospect maps defined in Def. 2.1 might be dependent on the whole history, which could cause computational problems in real applications. Therefore, the prospect maps are additionally assumed to possess Markov property. Let denote the space of all bounded functions that maps from to .
Definition 3.1 (Markov Prospect Map for MDPs).
Let be a series of conditional prospect maps defined on the MDP given the policy . is called Markov, if there exists a series of maps such that
Remark. It is noticeable that the prospect map depends also on .
From now on, we consider merely the Markov prospect maps. Thus we can write as . Furthermore, we consider merely the Markov policies . For a Markov random policy , depends only on . Hence, we can write as . For each -pair, there exists a corresponding deterministic policy satisfying . Therefore, we can define for each ,
We assume that the Markov prospect map is linear to , i.e.,
To simplify the problem, we consider merely the time-homogeneous Markov prospect maps, i.e., for all . Hence, can be abbreviated by , , and furthermore by . Similar abbreviations are used for which is a special case of . By Assumption 3.1, analogous to the in Eq. 1. we obtain
Then , which is defined by , is a function in the space . can be viewed as a map from to itself. Since we assume the state space is finite, can be viewed as a
-dimensional vector, wheredenotes the number of states. Thus can be understood as a map from to itself.
For any time-homogeneous Markov map , by its definition, satisfies the axioms of monotonicity and time-consistency, for each . Thus is a topical map (see ), which satisfies, i) , whenever ; and ii) , for all and . For each , we define the Hilbert semi-norm222Here we follow the terminology in [12, 24], whereas the same semi-norm is called span semi-norm in [26, 15]. and sup-norm as follows,
Since we consider only the finite state space, is simply an -dimensional vector.
Suppose be a topical map. Then, it can be shown that is nonexpansive under both Hilbert semi-norm and sup-norm (see Eq. 17 and 18, ), i.e., for all ,
Suppose is a time-homogeneous Markov prospect map for some one-step Markov policy . Assume furthermore is concave with respect to at , i.e. for any and and any , we have
Note that the objective is to maximize the prospect (which will be defined Sec. 4). Suppose we have two policies and in the successive time-step which generate two outcomes and respectively. The concavity of implies that the outcome of mixture of two policies, is always preferred (due to maximization) to the mixed outcome of two single policies . In other words, given the policy we choose at current time step, we shall prefer mixture of two policies at the successive time-step. This shows that the corresponding risk-preference of the prospect map is risk-aversive. Similar result can be inferred for convex prospect maps. This categorization coincides categorization of risk-preferences judging by concavity of the utility functions in the expected utility theory . In order to obtain a time-homogeneous risk-preference (risk-aversive or risk-seeking), the everywhere risk-preference is required. We define them as follows,
A time-homogeneous Markov prospect map is said to be
risk-aversive at , if it is concave w.r.t. at , and everywhere risk-aversive, if is concave w.r.t. at all and for all .
risk-seeking at , if is convex w.r.t. at , and everywhere risk-seeking, if is convex w.r.t. at all and for all .
Remarks The categorization depends on the objective. In the CRM theory, the objective is to miminize the risk. Therefore, the categorization is opposite: concavity means risk-seeking and convexity suggests risk-aversive. Apparently under Assupmtion 3.1, if is convex (concave) w.r.t. at all -pairs, then is everywhere convex (everywhere concave). Several existing risk maps (see Sec. 5) in the literature confirm also the above defined categorizations.
One widely used family of prospect maps, the coherent prospect maps, is worth mentioning.
A time-homogeneous Markov prospect map is said to be coherent if for all , for all , and .
4 Discounted and Average Prospect
4.1 Finite-stage Prospect
According to the definition of dynamic prospect maps (Def. 2.2), we define the -stage total prospect as follows,
Suppose the prospect map under consideration is time-homogeneous and Markov. By Prop. 2.1, we have the following decomposition
where the short notation is used. The optimization problem of this objective function is to maximize the -stage total prospect among all Markov random policies, i.e.,
Suppose Assumption 3.1 holds true. Obviously, the optimization problem can be solved by dynamic programming, i.e., we start from
Then we calculate backwards, for ,
It is easy to verify that .
4.2 Discounted Total Prospect
Let denote the discount factor. Suppose Assumption 3.1 holds true. We use the discounted -stage prospect as follows,
and the discounted total prospect as
Thus, the optimization problem for discounted total prospect is
We first prove that the limit exists in Eq. 7. Given , we define the map as , and . For any , define
For any , i) the limit in Eq. 7 exists; ii) , .
(i) Since is bounded for finite state and action spaces, there exists a number such that for all . Hence, by monotonicity and additive property of ,
which implies as .
(ii) Since , is also bounded for all . Let be the upper bound such that . Hence,
Using the conclusion of (i), we have , . ∎
The trivial extension of the classical discounted MDP (cf. in Eq. 2) is as follows,
Using the time-consistency property of prospect maps, we have the following decomposition
We have the following observations:
We can prove analogously as in Prop. 4.1(i) that is well-defined.
If the prospect map is coherent, then is equivalent to (cf. Eq. 7), the discounted total prospect under our definition. Therefore, defined for any coherent prospect map is merely special cases of our definition. Especially, the discounted total reward in classical MDPs is a special case of the discounted total prospect, since it is coherent.
Ruszczyński (2010)  uses as the objective function, which was solved by a value iteration algorithm. However, in the proof of the value iteration algorithm, he uses the representation theorem which is valid merely for coherent prospect maps. On the contrary, we will see later that the objective allows a value iteration algorithm for arbitrary prospect maps.
Given a function and , consider the following map
Now we prove the key property: is a contracting map.
Suppose Assumption 3.1 holds true. Then is a contracting map under sup-norm, i.e., , for all and .
Under Assumption 3.1, there exist deterministic policy and satisfying
By definition, we have for all ,
where the last inequality is due to the nonexpansiveness of . Exchanging and , we have
Thus, for all , which implies ∎
We state the following algorithm:
select one , ;
if , stop; otherwise, and goto step 2.
Since is a contracting map, due to the Banach contraction mapping principle, we conclude that for all , and , as , where is the fixed point of such that and denotes the corresponding policy. The final step is to prove with the following theorem.
Suppose Assumption 3.1 holds true. For any , i) if , then ; ii) If , then ; iii) if , then .
(i) Consider a Markov policy . implies that for any ,
We apply above inequality recursively,
Since is arbitrary, above inequality implies .
4.3 Average Prospect
Analogous to the average reward defined in Eq. 2, we consider the following average prospect,
Suppose there is a pair , which satisfies the following equation
This equation is called average prospect optimality equation (APOE). Under Assumption 3.1, there exists a deterministic function such that
Define operator as
Let be an arbitrary random Markov policy. Define
Suppose the Assumption 3.1 holds true and the APOE has a solution , and . Let be the deterministic policy found in the APOE. Then , for all .
We prove first. Define an operator as follows,
and , . Hence, due to the nonexpansiveness of , we have
On the other hand, by APOE, we have
Hence, . Together with Eq. 10, we obtain .
Now the question is to find proper assumptions that can guarantee the existence of the solutions of the APOE. Assumption 3.1 is not sufficient to take this burden. Recall that denotes Hilbert semi-norm defined in Sec.3.2. We further assumes
There exists an integer and a real number such that for all deterministic policy
Define the operator, ,
Let be as defined in Eq. 9. There must be two policies satisfying and respectively.
Exchange and , we have . Thus,