DeepAI

# Risk-sensitive Markov control processes

We introduce a general framework for measuring risk in the context of Markov control processes with risk maps on general Borel spaces that generalize known concepts of risk measures in mathematical finance, operations research and behavioral economics. Within the framework, applying weighted norm spaces to incorporate also unbounded costs, we study two types of infinite-horizon risk-sensitive criteria, discounted total risk and average risk, and solve the associated optimization problems by dynamic programming. For the discounted case, we propose a new discount scheme, which is different from the conventional form but consistent with the existing literature, while for the average risk criterion, we state Lyapunov-like stability conditions that generalize known conditions for Markov chains to ensure the existence of solutions to the optimality equation.

• 15 publications
• 4 publications
• 17 publications
12/04/2020

### Constrained Risk-Averse Markov Decision Processes

We consider the problem of designing policies for Markov decision proces...
03/06/2021

### Zero-sum risk-sensitive continuous-time stochastic games with unbounded payoff and transition rates and Borel spaces

We study a finite-horizon two-person zero-sum risk-sensitive stochastic ...
09/07/2015

### Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures

In this paper, we consider a finite-horizon Markov decision process (MDP...
09/09/2021

### Risk-Averse Decision Making Under Uncertainty

A large class of decision making under uncertainty problems can be descr...
10/22/2018

### Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint

The classic objective in a reinforcement learning (RL) problem is to fin...
10/17/2022

### Risk-Sensitive Markov Decision Processes with Long-Run CVaR Criterion

CVaR (Conditional Value at Risk) is a risk metric widely used in finance...
09/28/2020

### Decisiveness of Stochastic Systems and its Application to Hybrid Models (Full Version)

In [ABM07], Abdulla et al. introduced the concept of decisiveness, an in...

## 1 Introduction

In many applications of decision-making problems modeled by Markov decision processes (MDPs), it is reasonable to incorporate some measure of risk to rule out policies that achieve a high expected reward at the cost of risky and error prone actions. If we think for example of an expensive manufacturing machine that has two running modes: one where the machine runs at peak level and produces the maximum number of products for most of the time at the cost of a high chance for a serious damage and one where the machine runs slightly slower to avoid damage. Most companies would agree that the second option is more reasonable. Yet, if the company would make decision with the help of the classical MDPs, it would pick option one and go for the risky strategy.

Most of the decision-making models like MDPs, are consisted with two descriptions of some mechanism of environments, immediate outcomes

(rewards or costs) at one state by performing one action, and transitions, the transition probability between states with some actions. Both descriptions are

objective in the sense that both outcome and transition probability can be estimated by repeating experiencing the environment sufficient many times. The “risk” depends, however, on the subjective perception of the agent, since different agents might have different risk-preferences facing the same environment. For instance, \$100 is more valuable for the poor than for the rich. Behavioral experiments [21] show that people tend to overreact to small probability events, but underreact to medium and large probabilities.

Due to the apparent usefulness of risk-sensitive objectives, the topic is of major importance in finance and economics. In economics, the utility function is widely used to model the subjective perception of rewards. The renowned prospect theory (PT) [21] introduces the probability weighting function to model the subjective perception of probabilities. PT can be merely used to model single decision problem, whereas in MDP a sequence of decisions have to be made. In mathematical finance, Ruszczyński (2010) [28] applies coherent/convex risk measures (CRMs) [2, 11] to incorporate risk in a sequential decision-making structure. However, there are two major drawbacks in their work: 1) he assumes that the risk measures must be coherent or convex, which is not true for some of the most important instances of risk measures, and 2) he discusses merely the finite-stage or discounted risk problem for coherent risk measures. The theory of discounted and average risk for arbitrary measures as in the classical MDP have not been considered yet.

In the community of MDPs (mainly operations research and control theory), despite the apparent usefulness of risk-sensitive measures, few works in MDPs address the issue, since many risk-sensitive objectives cannot be optimized efficiently. The mean-variance trade-off is a popular risk criterion, where variance takes the part of the risk measure as it penalizes highly varying return. However, this objective is difficult to optimize, especially when a discount factor is included

[10]. Recently in [23] the problem even for finite-horizon MDP is proved to be NP-hard. Another popular measure is to apply the exponential utility function. Although an efficient solution (see e.g. [5]) exists for average infinite-horizon MDP, it is proved in [7] that the objective for discounted MDP is difficult and the optimal policy might not be stationary.

The question is now if all the risk-sensitive objectives are difficult to optimize for MDPs or if measures like the mean-variance trade-off are just not the “right” measure for MDPs. Inspired by the discovery in mathematical finance and economics, our intuition is therefore to adapt the CRM theory to the MDP structure, where two concerns must be balanced: 1) the axioms should be as general as possible to be able to model all kinds of risk-preferences including mixed risk-preference, and 2) the underlying optimization problem can be solved by a computationally feasible algorithm.

The main contributions of this paper are: 1) To incorporate risk into MDPs, we set up a general framework via prospect maps, which is a generalization of the CRMs. The framework contains most of the existing risk-sensitive approaches in economics, mathematical finance and optimal control theory as special cases (cf. Sec. 5). 2) Within the framework, we define a novel temporal discount scheme, which includes the conventional temporal discount scheme as special cases. The optimization problem to the new discounted objective function is proved to be solved by a value iteration algorithm; 3) We investigate the optimization problem of the average prospect. With one additional assumption, the solution to its optimization problem exists and a value iteration is designed to solve it; 4) For the case where the knowledge of MDP, reward and transition, is unknown, we state one algorithm to estimate the reward and transition models of underlying MDP and simultaneously learn optimal policy. For one specific prospect map (entropic map), a Q-learning like algorithm is proposed to obtain optimal policy without knowing the knowledge of MDP.

In order to avoid tedious mathematical details in general state-action spaces, we consider currently merely the MDPs with finite state-action space. However, the extensions to general space are straightforward.

This paper is organized as follows. In Sec. 2 , we briefly introduce the setups of MDPs and prospect maps, which are adapted in Sec. 3 to the MDP structure. Sec. 4 states the major theory of this paper, the discounted prospect and average prospect, whose optimal control problems are solved by value iterations under different assumptions. In Sec. 5 we discuss the existing risk-sensitive approaches and show how to represent them with specific prospect maps. Two on-line algorithms, which might be of interest for engineering-oriented audience, are stated in Sec. 6, which is followed by experiments with simple MDPs in the final section.

## 2 Setup

### 2.1 Markov Decision Processes

A Markov decision process [26] is composed of a state space , an action space , a transition model and a reward model . Both state and action space are assumed to be finite. The transition model denotes the probability of arriving at state given the current state with chosen action at time . We assume the transition is time-homogenous. The reward function represents the reward obtained at state if action is chosen.

The policy at time is defined as the probability of choosing action given state . Let be the sequential policy where at time the policy is used, and at the policy , etc. Let be the set of all policies. A policy is called Markov if for all , depends merely on and is independent from all the states and actions before time . Let denote the set of all Markov policies and be the set of all one-step Markov policy. Thus . A one-step policy is called Markov deterministic, if for some and . With slight abuse of the notation, we also write as a deterministic function . Denote the set of all one-step Markov deterministic policies by . For any , we define

 rπ(x):=∑aπ(a|x)r(x,a),Pπ(y|x):=∑aπ(a|x)Q(y|x,a) (1)

There are usually three types of objectives functions used in the literature of MDPs, finite-stage, discounted and average reward. We summarize them as follows,

 ST:=T∑t=0r(Xt,At),Sα:=∞∑t=0αtr(Xt,At),and S:=limT→∞1TST (2)

where denotes the discount factor. Suppose we start from one given state . The optimization problem is to maximize the expected objective by selecting a policy :

 maxπ∈ΠEπ[S|X0=x] (3)

where can be replaced by , or 111Note that since the limit in defining the average reward might not exist (see e.g. Example 8.1.1, [26]), the strict definition of the optimization problem of average reward should be

.

### 2.2 Dynamic Prospect Maps

In the setup of MDPs, we apply “rewards” instead of “costs” (which are common in the literature of Markov control processes [16]) to model immediate outcomes and therefore in the optimization problems of MDPs (Eq. 3), objectives are to be maximized rather than minimized. To be consistent with maximizing objectives, “prospect maps” are used to name analogous nonlinear structures as risk measures in finance literature. Similar nomenclature can be also found in [20], where risk is replaced by valuation.

Let us consider a discrete-time stochastic process and . The capital letters and

denote random variables whereas the realizations of the random variables are denoted by normal letters,

and , respectively. Let denotes the set of all real-valued bounded functions on , for We consider a map such that is a real-valued bounded function on for fixed . can be also viewed as a map from to . In the following, the (in-)equalities between two functions are understood elementwise, i.e., we say , if for all .

In the following, we first introduce conditional prospect maps and then construct a dynamic prospect measure from to , , by a series of conditional prospect maps .

###### Definition 2.1.

A map , , is called a conditional prospect map, if

1. Monotonicity. , if , then .

2. Time-consistency. For any and , . Especially, for each and , .

3. Centralization. .

Remarks The monotone axiom reflects the intuition that if the reward of one choice are higher than the reward of another choice, the prospect of the choice must be higher than that of the other one. The time-consistent axiom is obviously a generalization of the conditional expectation. This axiom allows the temporal decomposition (see Proposition 2.1), and together with the axiom of monotonicity make the dynamic programming [3] the feasible solution to the optimization problems (see Sec. 4). The axiom of centralization sets the reference point to be 0, i.e., there is no risk if there is no cost. Nevertheless, it is possible to use other reference points.

###### Definition 2.2.

A map , , is called a dynamic prospect map, if there exists a series of conditional prospect maps such that

###### Proposition 2.1.

Let , , , , and , we have

 Rt,T(v)=vt+Rt(vt+1+…+RT−1(vT)…)
###### Proof.

Trivial using Axiom II. ∎

Remarks. In the literature of finance, there exist various ways to extend the CRM to a temporal structure (e.g., [9, 20, 1, 28] and references therein). The definition is usually selected based on the applications, to which the dynamic risk measures are applied. To compare their subtle differences are out of the scope of this paper. Nevertheless, there are 2 points that are remarkable: 1) in all kinds of definitions, the axiom of time-consistency is the most important component that allows the temporal decomposition as shown in Prop. 2.1; and 2) their definitions require either coherence [28] or convexity [9, 20, 1], which means that the agent has to be economically rational, i.e., risk-aversive (more discussion see Sec. 3.3). However, in some problems (especially in modeling real human behaviors), mixed risk-preference (risk-aversive at some states while risk-seeking at other states) is also a possible strategy. For instance, at gambling, some people are risk-aversive when losing money but risk-seeking when winning money. Therefore, we require neither coherence nor convexity. In this sense, our axioms are even more general than the axioms used in finance literature. Finally, in the literature of coherent risk measures, non-additive measures can be defined due to the coherency. However, in this paper we do not assume coherency in the axioms. Instead, we build the theory based on the functional spaces . Therefore, it is more accurate to use the term “map” than “measure”.

## 3 Applying Prospect Maps in MDPs

The dynamic prospect maps introduced in Sec. 2.2 can be adapted to arbitrary temporal structures. To adapt in the structure of MDPs, we assume the prospect maps work on the state sequence . On the other hand, since the probability of is controlled by policies (together with the transition model of the MDP, , defined in Sec. 2.1), we assume further that the prospect maps depends on . Thus, the conditional prospect maps working on the MDP given one policy are written as .

### 3.1 Markov Prospect Maps for MDPs

The conditional prospect maps defined in Def. 2.1 might be dependent on the whole history, which could cause computational problems in real applications. Therefore, the prospect maps are additionally assumed to possess Markov property. Let denote the space of all bounded functions that maps from to .

###### Definition 3.1 (Markov Prospect Map for MDPs).

Let be a series of conditional prospect maps defined on the MDP given the policy . is called Markov, if there exists a series of maps such that

 Rπt(v(Xt+1)|xt,xt−1,…,x0)=ϱt(v|xt),∀t∈N,v∈FB

Remark. It is noticeable that the prospect map depends also on .

From now on, we consider merely the Markov prospect maps. Thus we can write as . Furthermore, we consider merely the Markov policies . For a Markov random policy , depends only on . Hence, we can write as . For each -pair, there exists a corresponding deterministic policy satisfying . Therefore, we can define for each ,

 Rt(v(Xt+1)|xt,at):=Rft(v(Xt+1)|xt) (4)
###### Assumption 3.1.

We assume that the Markov prospect map is linear to , i.e.,

 Rπtt(v(Xt+1)|xt)=∑a∈Aπt(a|xt)Rt(v(Xt+1)|xt,a),∀t∈N.

To simplify the problem, we consider merely the time-homogeneous Markov prospect maps, i.e., for all . Hence, can be abbreviated by , , and furthermore by . Similar abbreviations are used for which is a special case of . By Assumption 3.1, analogous to the in Eq. 1. we obtain

 Rπ(v|x)=∑a∈Aπ(a|x)R(v|x,a)

Then , which is defined by , is a function in the space . can be viewed as a map from to itself. Since we assume the state space is finite, can be viewed as a

-dimensional vector, where

denotes the number of states. Thus can be understood as a map from to itself.

Remark. For a time-homogeneous Markov map , Assumption 3.1 enables to play the similar role as the transition model in MDPs. Another result of Assumption 3.1 is that for all and , there exists a deterministic policy , such that for any ,

 cf(x)+αRf(v|x)=c(x,f(x))+αR(v|x,f(x))=minπ∈Δ{cπ(x)+αRπ(v|x)}.

### 3.2 Nonexpansiveness

For any time-homogeneous Markov map , by its definition, satisfies the axioms of monotonicity and time-consistency, for each . Thus is a topical map (see [12]), which satisfies, i) , whenever ; and ii) , for all and . For each , we define the Hilbert semi-norm222Here we follow the terminology in [12, 24], whereas the same semi-norm is called span semi-norm in [26, 15]. and sup-norm as follows,

 ∥v∥H:=supx,y∈X(v(x)−v(y)),∥v∥∞:=supx∈X|v(x)|.

Since we consider only the finite state space, is simply an -dimensional vector.

Suppose be a topical map. Then, it can be shown that is nonexpansive under both Hilbert semi-norm and sup-norm (see Eq. 17 and 18, [12]), i.e., for all ,

 ∥F(v)−F(w)∥H≤∥v−w∥H,∥F(v)−F(w)∥∞≤∥v−w∥∞

### 3.3 Categorization

Suppose is a time-homogeneous Markov prospect map for some one-step Markov policy . Assume furthermore is concave with respect to at , i.e. for any and and any , we have

 Rπ(βv+(1−β)w|x)≥βRπ(v|x)+(1−β)Rπ(w|x)

Note that the objective is to maximize the prospect (which will be defined Sec. 4). Suppose we have two policies and in the successive time-step which generate two outcomes and respectively. The concavity of implies that the outcome of mixture of two policies, is always preferred (due to maximization) to the mixed outcome of two single policies . In other words, given the policy we choose at current time step, we shall prefer mixture of two policies at the successive time-step. This shows that the corresponding risk-preference of the prospect map is risk-aversive. Similar result can be inferred for convex prospect maps. This categorization coincides categorization of risk-preferences judging by concavity of the utility functions in the expected utility theory [13]. In order to obtain a time-homogeneous risk-preference (risk-aversive or risk-seeking), the everywhere risk-preference is required. We define them as follows,

###### Definition 3.2.

A time-homogeneous Markov prospect map is said to be

1. risk-aversive at , if it is concave w.r.t.  at , and everywhere risk-aversive, if is concave w.r.t.  at all and for all .

2. risk-seeking at , if is convex w.r.t.  at , and everywhere risk-seeking, if is convex w.r.t.  at all and for all .

Remarks The categorization depends on the objective. In the CRM theory, the objective is to miminize the risk. Therefore, the categorization is opposite: concavity means risk-seeking and convexity suggests risk-aversive. Apparently under Assupmtion 3.1, if is convex (concave) w.r.t.  at all -pairs, then is everywhere convex (everywhere concave). Several existing risk maps (see Sec. 5) in the literature confirm also the above defined categorizations.

One widely used family of prospect maps, the coherent prospect maps, is worth mentioning.

###### Definition 3.3.

A time-homogeneous Markov prospect map is said to be coherent if for all , for all , and .

## 4 Discounted and Average Prospect

### 4.1 Finite-stage Prospect

According to the definition of dynamic prospect maps (Def. 2.2), we define the -stage total prospect as follows,

 JT(x,π):=Rπ0,T(T∑t=0r(Xt,At)|X0=x) (5)

Suppose the prospect map under consideration is time-homogeneous and Markov. By Prop. 2.1, we have the following decomposition

 JT(x,π)=rπ0(x)+Rπ0X0=x[rπ1(X1)+Rπ1X1[rπ2(X2)+…+RπT−1XT−1[rπT(XT)]…]]

where the short notation is used. The optimization problem of this objective function is to maximize the -stage total prospect among all Markov random policies, i.e.,

 J∗T(x)=maxπ∈ΠMJT(x,π)

Suppose Assumption 3.1 holds true. Obviously, the optimization problem can be solved by dynamic programming, i.e., we start from

 VT(x)=maxπ∈Δrπ(x)=maxa∈Ar(x,a)

Then we calculate backwards, for ,

 Vt(x)=maxπ∈Δ{rπ(x)+Rπ(Vt+1|x)}=maxa∈A{r(x,a)+R(Vt+1|x,a)}

It is easy to verify that .

### 4.2 Discounted Total Prospect

Let denote the discount factor. Suppose Assumption 3.1 holds true. We use the discounted -stage prospect as follows,

 Jα,T(x,π):=rπ0(x)+αRπ0X0=x[rπ1(X1)+αRπ1X1[rπ2(X2)+…+αRπT−1XT−1[rπT(XT)]…]] (6)

and the discounted total prospect as

 Jα(x,π):=limT→∞Jα,T(x,π) (7)

Thus, the optimization problem for discounted total prospect is

 J∗α(x):=supπ∈ΠMJα(x,π)

We first prove that the limit exists in Eq. 7. Given , we define the map as , and . For any , define

 Fπα,T(v):=Fπ0α(Fπ1α(…FπT−1α(v)…)).
###### Proposition 4.1.

For any , i) the limit in Eq. 7 exists; ii) , .

###### Proof.

(i) Since is bounded for finite state and action spaces, there exists a number such that for all . Hence, by monotonicity and additive property of ,

 −αT+1M≤Jα,T+1(x,π)−Jα,T(x,π)≤αT+1M

which implies as .

(ii) Since , is also bounded for all . Let be the upper bound such that . Hence,

 rπ−M′≤v≤rπ+M′⇒−M′αT≤Fπα,T(v)−Jα,T(x,π)≤M′αT

Using the conclusion of (i), we have , . ∎

#### Discussion

The trivial extension of the classical discounted MDP (cf.  in Eq. 2) is as follows,

 Dα(x,π):=Rπ0,∞(∞∑t=0αtr(Xt,At)|X0=x)

Using the time-consistency property of prospect maps, we have the following decomposition

 Dα(x,π)=rπ0(x)+Rπ0X0=x[αrπ1(X1)+Rπ1X1[α2rπ2(X2)+…+RπT−1XT−1[αTrπT(XT)+…]…]]

We have the following observations:

• We can prove analogously as in Prop. 4.1(i) that is well-defined.

• If the prospect map is coherent, then is equivalent to (cf. Eq. 7), the discounted total prospect under our definition. Therefore, defined for any coherent prospect map is merely special cases of our definition. Especially, the discounted total reward in classical MDPs is a special case of the discounted total prospect, since it is coherent.

• For general prospect maps, there might not exist a stationary policy that , as proved by Chung & Sobel (1987) [7] for entropic prospect maps, which are not coherent. We can prove analogous statements as Theorem 4 in [7] for arbitrary non-coherent prospect maps.

• Ruszczyński (2010) [28] uses as the objective function, which was solved by a value iteration algorithm. However, in the proof of the value iteration algorithm, he uses the representation theorem which is valid merely for coherent prospect maps. On the contrary, we will see later that the objective allows a value iteration algorithm for arbitrary prospect maps.

#### Contracting Map

Given a function and , consider the following map

 Fα(u|x):=maxπ∈ΔFπα(u|x)=maxa∈A[r(x,a)+αR(u|x,a)](under Assumption % ???)

Now we prove the key property: is a contracting map.

###### Lemma 4.1.

Suppose Assumption 3.1 holds true. Then is a contracting map under sup-norm, i.e., , for all and .

###### Proof.

Under Assumption 3.1, there exist deterministic policy and satisfying

 Fα(u|x)=r(x,f(x))+αR(u|x,f(x)),Fα(v|x)=r(x,g(x))+αR(u|x,g(x))

By definition, we have for all ,

 Fα(u|x)−Fα(v|x)≤ r(x,f(x))+αR(u|x,f(x))−r(x,f(x))−αR(v|x,f(x)) = α[R(u|x,f(x))−R(v|x,f(x))]≤α∥u−v∥∞

where the last inequality is due to the nonexpansiveness of . Exchanging and , we have

 Fα(v|x)−Fα(u|x)≤α∥v−u∥∞

Thus, for all , which implies

#### Value iteration

We state the following algorithm:

1. select one , ;

2. calculate ,

3. if , stop; otherwise, and goto step 2.

Since is a contracting map, due to the Banach contraction mapping principle, we conclude that for all , and , as , where is the fixed point of such that and denotes the corresponding policy. The final step is to prove with the following theorem.

###### Theorem 4.1.

Suppose Assumption 3.1 holds true. For any , i) if , then ; ii) If , then ; iii) if , then .

###### Proof.

(i) Consider a Markov policy . implies that for any ,

 v≥Fα(v)≥rπ+αRπ(v)

We apply above inequality recursively,

 v≥rπ0+αRπ0(v)≥rπ0+αRπ0(rπ1+αRπ1(v))≥…≥Jα(π)

Since is arbitrary, above inequality implies .

(ii) Under Assumption 3.1, there exists an such that . Write and . Since , we have

 v≤Ffα(v)=rf+αRf(v)≤rf+αRf(rf+αRf(v))≤…≤Jα(f∞)≤J∗α

where we apply the monotonicity of recursively. Due to Prop. 4.1(ii), for any , exists. (i) + (ii) implies (iii). ∎

### 4.3 Average Prospect

Analogous to the average reward defined in Eq. 2, we consider the following average prospect,

 J(x,π):=liminfT→∞1TJT(x,π),π∈ΠM,x∈X

where is defined in Eq. 5. Here “” is used to avoid the case where the limit of does not exist (see e.g., Example 8.1.1, [26]). The optimization problem of average prospect is therefore,

 J∗(x)=supπ∈ΠMJ(x,π)

Suppose there is a pair , which satisfies the following equation

 ρ+h(x)=maxπ∈Δ[rπ(x)+Rπ(h|x)] (8)

This equation is called average prospect optimality equation (APOE). Under Assumption 3.1, there exists a deterministic function such that

 ρ+h(x)=maxa∈A[r(x,a)+R(h|x,a)]=r(x,f(x))+R(h|x,f(x))

Define operator as

 Fπ(v):=rπ+Rπ(v),v∈RN

Let be an arbitrary random Markov policy. Define

 FπT(v):=Fπ0(Fπ1(…FπT−1(v)…)) (9)
###### Lemma 4.2.

Suppose the Assumption 3.1 holds true and the APOE has a solution , and . Let be the deterministic policy found in the APOE. Then , for all .

###### Proof.

We prove first. Define an operator as follows,

 F(v):=rf+Rf(v),v∈RN

and , . Hence, due to the nonexpansiveness of , we have

 ∥JT(f∞)−FT(h)∥∞≤∥rf−h∥∞⇒limT→∞(1TJT(f∞)−1TFT(h))=0 (10)

On the other hand, by APOE, we have

 FT(h)=FT−1(rf+Rf(h))=FT−1(h)+ρ=…=h+T⋅ρ

Hence, . Together with Eq. 10, we obtain .

Now we prove that for any and all . By AROE, we have for all ,

 ρ+h≥rπ+Rπ(h) (11)

Let be any Markov random policy. Then defined in Eq. 9 satisfies,

 ∥JT(π)−FπT(v)∥∞≤∥rf−v∥∞⟹limT→∞(1TJT(π)−1TFπT(v))=0 (12)

By Eq. 11, we have

 FπT(h)= FπT−1(rπT−1+RπT−1(h)) ≤ FπT−1(ρ+h)=FπT−1(h)+ρ ≤ …≤h+T⋅ρ

which implies

 liminfT→∞1TFπT(h)≤ρEq. ???⟹liminfT→∞1TJT(π)≤ρ

Now the question is to find proper assumptions that can guarantee the existence of the solutions of the APOE. Assumption 3.1 is not sufficient to take this burden. Recall that denotes Hilbert semi-norm defined in Sec.3.2. We further assumes

###### Assumption 4.1.

There exists an integer and a real number such that for all deterministic policy

 ∥Rπ(u)−Rπ(v)∥H≤β∥u−v∥H,∀u,v∈RN

where .

Define the operator, ,

 F(v|x):=maxa∈A{r(x,a)+R(v|x,a)},Ft(v):=F(Ft−1(v)),t=1,2,… (13)
###### Proposition 4.2.

If Assumption 3.1 and 4.1 hold true, then , for all .

###### Proof.

Let be as defined in Eq. 9. There must be two policies satisfying and respectively.

 FK(u)−FK(v)≤ FπuK(u)−FπuK(v) = Rf0(cf1+Rf1(…+RfK−1(u)…))−Rf0(cf1+Rf1(…+RfK−1(v)…)) (Prop.~{}???)= Rf0(Rf1(…RfK−1(K−1∑t=1cft+u)…))−Rf0(Rf1(…RfK−1(K−1∑t=1cft+v)…))

Exchange and , we have . Thus,

 ∥FK(u)−FK(v)∥H≤ maxπ∈ΔKD∥FπK(u)−FπK(v)∥ = maxπ∈ΔKD∥Rπ(K−1∑t=1cπt+u)−Rπ(K−1∑t=1cπt+v)∥≤β∥u−v∥H