1 Introduction
In many applications of decisionmaking problems modeled by Markov decision processes (MDPs), it is reasonable to incorporate some measure of risk to rule out policies that achieve a high expected reward at the cost of risky and error prone actions. If we think for example of an expensive manufacturing machine that has two running modes: one where the machine runs at peak level and produces the maximum number of products for most of the time at the cost of a high chance for a serious damage and one where the machine runs slightly slower to avoid damage. Most companies would agree that the second option is more reasonable. Yet, if the company would make decision with the help of the classical MDPs, it would pick option one and go for the risky strategy.
Most of the decisionmaking models like MDPs, are consisted with two descriptions of some mechanism of environments, immediate outcomes
(rewards or costs) at one state by performing one action, and transitions, the transition probability between states with some actions. Both descriptions are
objective in the sense that both outcome and transition probability can be estimated by repeating experiencing the environment sufficient many times. The “risk” depends, however, on the subjective perception of the agent, since different agents might have different riskpreferences facing the same environment. For instance, $100 is more valuable for the poor than for the rich. Behavioral experiments [21] show that people tend to overreact to small probability events, but underreact to medium and large probabilities.Due to the apparent usefulness of risksensitive objectives, the topic is of major importance in finance and economics. In economics, the utility function is widely used to model the subjective perception of rewards. The renowned prospect theory (PT) [21] introduces the probability weighting function to model the subjective perception of probabilities. PT can be merely used to model single decision problem, whereas in MDP a sequence of decisions have to be made. In mathematical finance, Ruszczyński (2010) [28] applies coherent/convex risk measures (CRMs) [2, 11] to incorporate risk in a sequential decisionmaking structure. However, there are two major drawbacks in their work: 1) he assumes that the risk measures must be coherent or convex, which is not true for some of the most important instances of risk measures, and 2) he discusses merely the finitestage or discounted risk problem for coherent risk measures. The theory of discounted and average risk for arbitrary measures as in the classical MDP have not been considered yet.
In the community of MDPs (mainly operations research and control theory), despite the apparent usefulness of risksensitive measures, few works in MDPs address the issue, since many risksensitive objectives cannot be optimized efficiently. The meanvariance tradeoff is a popular risk criterion, where variance takes the part of the risk measure as it penalizes highly varying return. However, this objective is difficult to optimize, especially when a discount factor is included
[10]. Recently in [23] the problem even for finitehorizon MDP is proved to be NPhard. Another popular measure is to apply the exponential utility function. Although an efficient solution (see e.g. [5]) exists for average infinitehorizon MDP, it is proved in [7] that the objective for discounted MDP is difficult and the optimal policy might not be stationary.The question is now if all the risksensitive objectives are difficult to optimize for MDPs or if measures like the meanvariance tradeoff are just not the “right” measure for MDPs. Inspired by the discovery in mathematical finance and economics, our intuition is therefore to adapt the CRM theory to the MDP structure, where two concerns must be balanced: 1) the axioms should be as general as possible to be able to model all kinds of riskpreferences including mixed riskpreference, and 2) the underlying optimization problem can be solved by a computationally feasible algorithm.
The main contributions of this paper are: 1) To incorporate risk into MDPs, we set up a general framework via prospect maps, which is a generalization of the CRMs. The framework contains most of the existing risksensitive approaches in economics, mathematical finance and optimal control theory as special cases (cf. Sec. 5). 2) Within the framework, we define a novel temporal discount scheme, which includes the conventional temporal discount scheme as special cases. The optimization problem to the new discounted objective function is proved to be solved by a value iteration algorithm; 3) We investigate the optimization problem of the average prospect. With one additional assumption, the solution to its optimization problem exists and a value iteration is designed to solve it; 4) For the case where the knowledge of MDP, reward and transition, is unknown, we state one algorithm to estimate the reward and transition models of underlying MDP and simultaneously learn optimal policy. For one specific prospect map (entropic map), a Qlearning like algorithm is proposed to obtain optimal policy without knowing the knowledge of MDP.
In order to avoid tedious mathematical details in general stateaction spaces, we consider currently merely the MDPs with finite stateaction space. However, the extensions to general space are straightforward.
This paper is organized as follows. In Sec. 2 , we briefly introduce the setups of MDPs and prospect maps, which are adapted in Sec. 3 to the MDP structure. Sec. 4 states the major theory of this paper, the discounted prospect and average prospect, whose optimal control problems are solved by value iterations under different assumptions. In Sec. 5 we discuss the existing risksensitive approaches and show how to represent them with specific prospect maps. Two online algorithms, which might be of interest for engineeringoriented audience, are stated in Sec. 6, which is followed by experiments with simple MDPs in the final section.
2 Setup
2.1 Markov Decision Processes
A Markov decision process [26] is composed of a state space , an action space , a transition model and a reward model . Both state and action space are assumed to be finite. The transition model denotes the probability of arriving at state given the current state with chosen action at time . We assume the transition is timehomogenous. The reward function represents the reward obtained at state if action is chosen.
The policy at time is defined as the probability of choosing action given state . Let be the sequential policy where at time the policy is used, and at the policy , etc. Let be the set of all policies. A policy is called Markov if for all , depends merely on and is independent from all the states and actions before time . Let denote the set of all Markov policies and be the set of all onestep Markov policy. Thus . A onestep policy is called Markov deterministic, if for some and . With slight abuse of the notation, we also write as a deterministic function . Denote the set of all onestep Markov deterministic policies by . For any , we define
(1) 
There are usually three types of objectives functions used in the literature of MDPs, finitestage, discounted and average reward. We summarize them as follows,
(2) 
where denotes the discount factor. Suppose we start from one given state . The optimization problem is to maximize the expected objective by selecting a policy :
(3) 
where can be replaced by , or ^{1}^{1}1Note that since the limit in defining the average reward might not exist (see e.g. Example 8.1.1, [26]), the strict definition of the optimization problem of average reward should be
2.2 Dynamic Prospect Maps
In the setup of MDPs, we apply “rewards” instead of “costs” (which are common in the literature of Markov control processes [16]) to model immediate outcomes and therefore in the optimization problems of MDPs (Eq. 3), objectives are to be maximized rather than minimized. To be consistent with maximizing objectives, “prospect maps” are used to name analogous nonlinear structures as risk measures in finance literature. Similar nomenclature can be also found in [20], where risk is replaced by valuation.
Let us consider a discretetime stochastic process and . The capital letters and
denote random variables whereas the realizations of the random variables are denoted by normal letters,
and , respectively. Let denotes the set of all realvalued bounded functions on , for We consider a map such that is a realvalued bounded function on for fixed . can be also viewed as a map from to . In the following, the (in)equalities between two functions are understood elementwise, i.e., we say , if for all .In the following, we first introduce conditional prospect maps and then construct a dynamic prospect measure from to , , by a series of conditional prospect maps .
Definition 2.1.
A map , , is called a conditional prospect map, if

Monotonicity. , if , then .

Timeconsistency. For any and , . Especially, for each and , .

Centralization. .
Remarks The monotone axiom reflects the intuition that if the reward of one choice are higher than the reward of another choice, the prospect of the choice must be higher than that of the other one. The timeconsistent axiom is obviously a generalization of the conditional expectation. This axiom allows the temporal decomposition (see Proposition 2.1), and together with the axiom of monotonicity make the dynamic programming [3] the feasible solution to the optimization problems (see Sec. 4). The axiom of centralization sets the reference point to be 0, i.e., there is no risk if there is no cost. Nevertheless, it is possible to use other reference points.
Definition 2.2.
A map , , is called a dynamic prospect map, if there exists a series of conditional prospect maps such that
Proposition 2.1.
Let , , , , and , we have
Proof.
Trivial using Axiom II. ∎
Remarks. In the literature of finance, there exist various ways to extend the CRM to a temporal structure (e.g., [9, 20, 1, 28] and references therein). The definition is usually selected based on the applications, to which the dynamic risk measures are applied. To compare their subtle differences are out of the scope of this paper. Nevertheless, there are 2 points that are remarkable: 1) in all kinds of definitions, the axiom of timeconsistency is the most important component that allows the temporal decomposition as shown in Prop. 2.1; and 2) their definitions require either coherence [28] or convexity [9, 20, 1], which means that the agent has to be economically rational, i.e., riskaversive (more discussion see Sec. 3.3). However, in some problems (especially in modeling real human behaviors), mixed riskpreference (riskaversive at some states while riskseeking at other states) is also a possible strategy. For instance, at gambling, some people are riskaversive when losing money but riskseeking when winning money. Therefore, we require neither coherence nor convexity. In this sense, our axioms are even more general than the axioms used in finance literature. Finally, in the literature of coherent risk measures, nonadditive measures can be defined due to the coherency. However, in this paper we do not assume coherency in the axioms. Instead, we build the theory based on the functional spaces . Therefore, it is more accurate to use the term “map” than “measure”.
3 Applying Prospect Maps in MDPs
The dynamic prospect maps introduced in Sec. 2.2 can be adapted to arbitrary temporal structures. To adapt in the structure of MDPs, we assume the prospect maps work on the state sequence . On the other hand, since the probability of is controlled by policies (together with the transition model of the MDP, , defined in Sec. 2.1), we assume further that the prospect maps depends on . Thus, the conditional prospect maps working on the MDP given one policy are written as .
3.1 Markov Prospect Maps for MDPs
The conditional prospect maps defined in Def. 2.1 might be dependent on the whole history, which could cause computational problems in real applications. Therefore, the prospect maps are additionally assumed to possess Markov property. Let denote the space of all bounded functions that maps from to .
Definition 3.1 (Markov Prospect Map for MDPs).
Let be a series of conditional prospect maps defined on the MDP given the policy . is called Markov, if there exists a series of maps such that
Remark. It is noticeable that the prospect map depends also on .
From now on, we consider merely the Markov prospect maps. Thus we can write as . Furthermore, we consider merely the Markov policies . For a Markov random policy , depends only on . Hence, we can write as . For each pair, there exists a corresponding deterministic policy satisfying . Therefore, we can define for each ,
(4) 
Assumption 3.1.
We assume that the Markov prospect map is linear to , i.e.,
To simplify the problem, we consider merely the timehomogeneous Markov prospect maps, i.e., for all . Hence, can be abbreviated by , , and furthermore by . Similar abbreviations are used for which is a special case of . By Assumption 3.1, analogous to the in Eq. 1. we obtain
Then , which is defined by , is a function in the space . can be viewed as a map from to itself. Since we assume the state space is finite, can be viewed as a
dimensional vector, where
denotes the number of states. Thus can be understood as a map from to itself.3.2 Nonexpansiveness
For any timehomogeneous Markov map , by its definition, satisfies the axioms of monotonicity and timeconsistency, for each . Thus is a topical map (see [12]), which satisfies, i) , whenever ; and ii) , for all and . For each , we define the Hilbert seminorm^{2}^{2}2Here we follow the terminology in [12, 24], whereas the same seminorm is called span seminorm in [26, 15]. and supnorm as follows,
Since we consider only the finite state space, is simply an dimensional vector.
Suppose be a topical map. Then, it can be shown that is nonexpansive under both Hilbert seminorm and supnorm (see Eq. 17 and 18, [12]), i.e., for all ,
3.3 Categorization
Suppose is a timehomogeneous Markov prospect map for some onestep Markov policy . Assume furthermore is concave with respect to at , i.e. for any and and any , we have
Note that the objective is to maximize the prospect (which will be defined Sec. 4). Suppose we have two policies and in the successive timestep which generate two outcomes and respectively. The concavity of implies that the outcome of mixture of two policies, is always preferred (due to maximization) to the mixed outcome of two single policies . In other words, given the policy we choose at current time step, we shall prefer mixture of two policies at the successive timestep. This shows that the corresponding riskpreference of the prospect map is riskaversive. Similar result can be inferred for convex prospect maps. This categorization coincides categorization of riskpreferences judging by concavity of the utility functions in the expected utility theory [13]. In order to obtain a timehomogeneous riskpreference (riskaversive or riskseeking), the everywhere riskpreference is required. We define them as follows,
Definition 3.2.
A timehomogeneous Markov prospect map is said to be

riskaversive at , if it is concave w.r.t. at , and everywhere riskaversive, if is concave w.r.t. at all and for all .

riskseeking at , if is convex w.r.t. at , and everywhere riskseeking, if is convex w.r.t. at all and for all .
Remarks The categorization depends on the objective. In the CRM theory, the objective is to miminize the risk. Therefore, the categorization is opposite: concavity means riskseeking and convexity suggests riskaversive. Apparently under Assupmtion 3.1, if is convex (concave) w.r.t. at all pairs, then is everywhere convex (everywhere concave). Several existing risk maps (see Sec. 5) in the literature confirm also the above defined categorizations.
One widely used family of prospect maps, the coherent prospect maps, is worth mentioning.
Definition 3.3.
A timehomogeneous Markov prospect map is said to be coherent if for all , for all , and .
4 Discounted and Average Prospect
4.1 Finitestage Prospect
According to the definition of dynamic prospect maps (Def. 2.2), we define the stage total prospect as follows,
(5) 
Suppose the prospect map under consideration is timehomogeneous and Markov. By Prop. 2.1, we have the following decomposition
where the short notation is used. The optimization problem of this objective function is to maximize the stage total prospect among all Markov random policies, i.e.,
Suppose Assumption 3.1 holds true. Obviously, the optimization problem can be solved by dynamic programming, i.e., we start from
Then we calculate backwards, for ,
It is easy to verify that .
4.2 Discounted Total Prospect
Let denote the discount factor. Suppose Assumption 3.1 holds true. We use the discounted stage prospect as follows,
(6) 
and the discounted total prospect as
(7) 
Thus, the optimization problem for discounted total prospect is
We first prove that the limit exists in Eq. 7. Given , we define the map as , and . For any , define
Proposition 4.1.
For any , i) the limit in Eq. 7 exists; ii) , .
Proof.
(i) Since is bounded for finite state and action spaces, there exists a number such that for all . Hence, by monotonicity and additive property of ,
which implies as .
(ii) Since , is also bounded for all . Let be the upper bound such that . Hence,
Using the conclusion of (i), we have , . ∎
Discussion
The trivial extension of the classical discounted MDP (cf. in Eq. 2) is as follows,
Using the timeconsistency property of prospect maps, we have the following decomposition
We have the following observations:

We can prove analogously as in Prop. 4.1(i) that is welldefined.

If the prospect map is coherent, then is equivalent to (cf. Eq. 7), the discounted total prospect under our definition. Therefore, defined for any coherent prospect map is merely special cases of our definition. Especially, the discounted total reward in classical MDPs is a special case of the discounted total prospect, since it is coherent.

Ruszczyński (2010) [28] uses as the objective function, which was solved by a value iteration algorithm. However, in the proof of the value iteration algorithm, he uses the representation theorem which is valid merely for coherent prospect maps. On the contrary, we will see later that the objective allows a value iteration algorithm for arbitrary prospect maps.
Contracting Map
Given a function and , consider the following map
Now we prove the key property: is a contracting map.
Lemma 4.1.
Suppose Assumption 3.1 holds true. Then is a contracting map under supnorm, i.e., , for all and .
Proof.
Under Assumption 3.1, there exist deterministic policy and satisfying
By definition, we have for all ,
where the last inequality is due to the nonexpansiveness of . Exchanging and , we have
Thus, for all , which implies ∎
Value iteration
We state the following algorithm:

select one , ;

calculate ,

if , stop; otherwise, and goto step 2.
Since is a contracting map, due to the Banach contraction mapping principle, we conclude that for all , and , as , where is the fixed point of such that and denotes the corresponding policy. The final step is to prove with the following theorem.
Theorem 4.1.
Suppose Assumption 3.1 holds true. For any , i) if , then ; ii) If , then ; iii) if , then .
Proof.
(i) Consider a Markov policy . implies that for any ,
We apply above inequality recursively,
Since is arbitrary, above inequality implies .
4.3 Average Prospect
Analogous to the average reward defined in Eq. 2, we consider the following average prospect,
where is defined in Eq. 5. Here “” is used to avoid the case where the limit of does not exist (see e.g., Example 8.1.1, [26]). The optimization problem of average prospect is therefore,
Suppose there is a pair , which satisfies the following equation
(8) 
This equation is called average prospect optimality equation (APOE). Under Assumption 3.1, there exists a deterministic function such that
Define operator as
Let be an arbitrary random Markov policy. Define
(9) 
Lemma 4.2.
Suppose the Assumption 3.1 holds true and the APOE has a solution , and . Let be the deterministic policy found in the APOE. Then , for all .
Proof.
We prove first. Define an operator as follows,
and , . Hence, due to the nonexpansiveness of , we have
(10) 
On the other hand, by APOE, we have
Hence, . Together with Eq. 10, we obtain .
Now the question is to find proper assumptions that can guarantee the existence of the solutions of the APOE. Assumption 3.1 is not sufficient to take this burden. Recall that denotes Hilbert seminorm defined in Sec.3.2. We further assumes
Assumption 4.1.
There exists an integer and a real number such that for all deterministic policy
where .
Define the operator, ,
(13) 
Proof.
Let be as defined in Eq. 9. There must be two policies satisfying and respectively.
Exchange and , we have . Thus,
Comments
There are no comments yet.