# Maximum Expected Hitting Cost of a Markov Decision Process and Informativeness of Rewards

We propose a new complexity measure for Markov decision processes (MDP), the maximum expected hitting cost (MEHC). This measure tightens the closely related notion of diameter [JOA10] by accounting for the reward structure. We show that this parameter replaces diameter in the upper bound on the optimal value span of an extended MDP, thus refining the associated upper bounds on the regret of several UCRL2-like algorithms. Furthermore, we show that potential-based reward shaping [NHR99] can induce equivalent reward functions with varying informativeness, as measured by MEHC. We further establish that shaping can reduce or increase MEHC by at most a factor of two in a large class of MDPs with finite MEHC and unsaturated optimal average rewards.

• 7 publications
• 24 publications
06/12/2021

### Markov Decision Processes with Long-Term Average Constraints

We consider the problem of constrained Markov Decision Process (CMDP) wh...
06/20/2019

### Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

We tackle the problem of acting in an unknown finite and discrete Markov...
04/15/2021

### Stochastic Processes with Expected Stopping Time

Markov chains are the de facto finite-state model for stochastic dynamic...
06/20/2018

### RUDDER: Return Decomposition for Delayed Rewards

We propose a novel reinforcement learning approach for finite Markov dec...
09/20/2022

### Adaptive and Collaborative Bathymetric Channel-Finding Approach for Multiple Autonomous Marine Vehicle

This paper reports an investigation into the problem of rapid identifica...
01/21/2022

### Under-Approximating Expected Total Rewards in POMDPs

We consider the problem: is the optimal expected total reward to reach a...
07/10/2020

### Efficient MDP Analysis for Selfish-Mining in Blockchains

A proof of work (PoW) blockchain protocol distributes rewards to its par...

## 1 Introduction

In the average reward setting of reinforcement learning (RL)

puterman1994markov; sutton1998reinforcement, an algorithm learns to maximize its average rewards by interacting with an unknown

Markov decision process (MDP). Similar to analysis in multi-armed bandits and other online machine learning problems, (cumulative) regret provides a natural model to evaluate the efficiency of a learning algorithm. With

UCRL2 jaksch2010near show a problem-dependent bound of on regret and an associated logarithmic bound on the expected regret, where is the diameter of the actual MDP (Definition 1), the size of the state space, and the size of the action space. Many subsequent algorithms fruit2019exploitation enjoy similar diameter-dependent bounds. This establishes diameter as an important measure of complexity for an MDP. However, strikingly, this measure is independent of rewards and is a function of only the transitions. This is obviously peculiar as two MDPs differing only in their rewards would have the same regret bounds even if one of them gives the maximum reward for all transitions. We review the related key observation in jaksch2010near and refine it with a new lemma (Lemma 1), establishing a reward-sensitive complexity measure that we refer to as the maximum expected hitting cost (MEHC, Definition 2) which tightens the regret bounds of UCRL2 and similar algorithms by replacing diameter (Theorem 1).

Next, with respect to this new complexity measure, we inquire a notion of reward informativeness (Section 2.4). Intuitively speaking, in an environment, the same desired policies can be motivated by different (immediate) rewards. These differing definitions of rewards can be more or less informative of useful actions, i.e. yielding high long-term rewards. To formalize this intuition, we study a way to reparametrize rewards via potential-based reward shaping (PBRS) ng1999policy that can produce different rewards with the same near-optimal policies (Section 2.5). We show that the MEHC changes under reparametrization by PBRS, and thus regret and sample complexity, substantiating this notion of informativeness. Lastly, we study the extent of its impact. In particular, we show that there is a factor-of-two limit on its impact on MEHC in a large class of MDPs (Theorem 2). This result and the concept of reward informativeness may be useful for a task designer crafting a reward function (Section 3).

The main contributions of this work are two-fold.

• We propose a new MDP structural parameter, maximum expected hitting cost, that accounts for both the transitions and rewards. This parameter replaces diameter in the regret bounds of several model-based RL algorithms.

• We show that potential-based reward shaping can change the maximum expected hitting cost of an MDP and thus the regret bound. This results in a set of MDPs equivalent with different learning difficulties as measured by regret. Moreover, we show that their MEHC differ by a factor of at most two in a large class of MDPs.

### 1.1 Related work

This work is most closely related to jaksch2010near, which establishes diameter as a complexity measure for MDPs that is prevalent in regret bounds of RL algorithms in the average reward setting fruit2019exploitation. As noted by jaksch2010near, unlike some previous measures of MDP complexity such as the return mixing times kearns2002near; brafman2002r, diameter depends only on the transitions and not rewards. The core reason for its presence in the regret analysis is that it upper bounds the optimal value span of the extended MDP which summarizes the observations (Section 2.3 and (8)). We review and update this observation with a reward-dependent parameter we called maximum expected hitting cost (Lemma 1). Interestingly, the gap between diameter and MEHC can be arbitrarily large ; there are MDPs with finite MEHC and infinite diameter. These MDPs are non-communicating but have saturated optimal average rewards . Intuitively, in these MDPs, at some state , the learner cannot visit some other state but can nonetheless achieve the maximum possible average reward thus allowing for good regret guarantees; the unreachable states will not seem better than the reachable ones under the principle of optimism in the face of uncertainty (OFU). We will also use UCRL2 jaksch2010near as an example algorithm in some discussion, though the main results do not depend on it. In particular, with MEHC, its regret bounds are updated (Theorem 1).

Another important comparison is with optimal bias span puterman1994markov; bartlett2009regal; fruit2018efficient, a reward-dependent parameter of MDPs. Here, we again find that the gap can be arbitrarily large 111This inequality can be derived as a consequence of Lemma 1 as ,

has very tight confidence intervals around the actual transition and mean rewards of

. Observe that the span of is equal to at the limit of (jaksch2010near, remark 8).. These non-communicating MDPs would have unsaturated optimal average reward . But as shown in fruit2018near; fruit2018efficient, the extra knowledge of some upper bound on the optimal bias span is necessary for an aglorithm to enjoy a regret that scales with this smaller parameter. In contrast, UCRL2, which scales with MEHC, does not need to know the diameter or MEHC of the actual MDP.

Potential-based reward shaping ng1999policy was originally proposed as a solution technique for a programmer to influence the sample complexity of their reinforcement learning algorithm without changing the near-optimal policies in episodic and discounted setting. Prior theoretical analysis involving PBRS ng1999policy; wiewiora2003potential; wiewiora2003principled; asmuth2008potential; grzes2017reward mostly focuses on the consistency of RL against the shaped rewards, i.e. the resulting learned behavior is also (near-)optimal in the original MDP, while suggesting empirically that the sample complexity can be changed by a well-specified potential. In this work, we use PBRS to construct the so-called -equivalent reward functions in the average reward setting (Section 2.4) and show that two reward functions related by a shaping potential can have different MEHC, and thus different regrets and sample complexities (Section 2.5). However, a subtle but important technical requirement of -boundedness of MDPs makes it difficult to immediately generalize our results (Section 2.5 and Theorem 2) to the treatment of PBRS as a solution technique because an arbitrary potential function picked without knowledge of the original MDP may not preserve the -boundedness. Nevertheless, we think our work may bring some new perspectives to this topic.

## 2 Results

### 2.1 Markov decision process

A Markov decision process is defined by the tuple , where is the state space, is the action space, is the transition function , and is the reward function with mean . Let and and we will restrict our attention to settings in which the state and action spaces are finite. At each time step , an algorithm chooses an action based on the observations up to that point. The state transitions to

according to the probability distribution

and a reward is drawn according to the distribution .222It is important to assume that the support of rewards lies in a known bounded interval, often by convention. This is sometimes referred to as a bounded MDP in the literature. Analogous to bandits, the details of the reward distribution often plays no important role and it suffices to specify an MDP with the mean rewards .

The transition probabilities and reward function of the MDP are unknown to the learner. The sequence of random variables

forms a stochastic process. Note that a stationary deterministic policy is a restrictive type of algorithms whose action depends only on . We will refer to stationary deterministic policies as policies in the rest of the paper.

Recall that in a Markov chain, the

hitting time of state starting at state is a random variable 333-indexing ensures that . Note also that by convention . levin2008markov.

###### Definition 1 (Diameter, jaksch2010near).

Suppose in the stochastic process induced by following a policy in MDP , the time to hit state starting at state is . We define the diameter of to be

 D(M)\coloneqqmaxs,s′∈Sminπ:S→AE[hs→s′(M,π)].

We incorporate rewards into diameter, and introduce a novel MDP parameter.

###### Definition 2 (Maximum expected hitting cost).

We define the maximum expected hitting cost of to be

 κ(M)\coloneqqmaxs,s′∈Sminπ:S→AE⎡⎣hs→s′(M,π)−1∑t=0rmax−rt⎤⎦.

Observe that MEHC is a smaller parameter, that is, , since for any , we have .

### 2.2 Average reward criterion, and regret

The accumulated reward of an algorithm after time steps in MDP starting at state is a random variable

 R(M,L,s,T)\coloneqqT−1∑t=0rt.

We define the average reward or gain puterman1994markov as

 ρ(M,L,s)\coloneqqlimT→∞1TE[R(M,L,s,T)]. (1)

We will evaluate policies by their average reward. This can be maximized by a policy and we define the optimal average reward of starting at state as

 ρ∗(M,s)\coloneqqmaxπ:S→Aρ(M,π,s). (2)

Furthermore, we will demand that the optimal average reward starting at any state to be the same, i.e. for any state . This is a natural requirement on the MDP in the online setting to allow for any hope for a vanishing regret. Otherwise the learner may take actions leading to states with a lower average optimal reward due to ignorance and incur linear regret when compared with the optimal policy starting at the initial state. In particular, this condition is true for communicating MDPs puterman1994markov by virtue of their transitions, but this is also possible for non-communicating MDPs with appropriate rewards. We will write .

We will compete with the expected cumulative reward of an optimal policy on its trajectory, and define the regret of a learning algorithm starting at state after time steps as

 Δ(M,L,s,T)\coloneqqTρ∗(M)−R(M,L,s,T). (3)

### 2.3 Optimism in the face of uncertainty, extended MDP, and Ucrl2

The principle of optimism in the face of uncertainty (OFU) sutton1998reinforcement states that for uncertain state-actions—we have not visited them enough up to this point—we should be optimistic about their outcome. The intuition for doing so is that taking reward-maximizing actions with respect to this optimistic model (in terms of both transitions and immediate rewards for these uncertain state-actions), we will have no regret if the optimism is well placed else we will quickly learn more about these suboptimal state-actions to avoid them in the future. This fruitful idea has been the basis for many model-based RL algorithms fruit2019exploitation and in particular, UCRL2 jaksch2010near which keeps track of the statistical uncertainty via upper confidence bounds.

Suppose we have visited a particular state-action for -many times, then with confidence of at least , we can establish that a confidence interval for both its mean reward and its transition from Chernoff-Hoeffding inequality (or Bernstein, fruit2018near). Let be the -confidence bound after observing i.i.d. samples of a -bounded random variable, the empirical mean of , the empirical transition of , and the statistically plausible mean rewards are

 Bδ(s,a)\coloneqq{r′∈R:|r′−^r(s,a)|≤rmaxb(δ,N(s,a))}∩[0,rmax]

and the statistically plausible transitions are

 Cδ(s,a)\coloneqq{p′∈P(S):||p′(⋅)−^p(⋅|s,a)||1≤b(δ,N(s,a))}.

We define an extended MDP to summarize these statistics givan2000bounded; strehl2005theoretical; tewari2007bounded; jaksch2010near, where is the same state space as in , the action space is a union over state-specific actions

 A+s\coloneqq{(a,p′,r′):a∈A,p′∈Cδ(s,a),r′∈Bδ(s,a)}, (4)

where is the same action space in , transitions according to the selected distribution

 p+(⋅|s,(a,p′,r′))\coloneqqp′(⋅), (5)

and rewards according to the selected mean reward

 r+(s,(a,p′,r′))\coloneqqr′. (6)

It is not hard to see that is indeed an MDP, with an infinite but compact action space.

By OFU, we want to find an optimal policy on an optimistic MDP within the set of statistically plausible MDPs. As observed in jaksch2010near, this is equivalent to finding an optimal policy in the extended MDP , which specifies a policy in via , where is the projection map onto the -th coordinate (and an optimistic MDP via transitions and mean rewards over actions selected by 444For transitions and mean rewards over actions we can set them to and .).

By construction of the extended MDP , with high confidence, is in , i.e. and for all . At the heart of UCRL2-type regret analysis, there is a key observation (jaksch2010near, equation (11)) that we can bound the span of optimal values in the extended MDP by the diameter of the actual MDP under the condition that is in . This observation is needed to characterize how good following the “optimistic” policy in the actual MDP is. For , the -step optimal values of is the expected total reward by following an optimal non-stationary -step policy starting at state . We can also define them recursively (via dynamic programming555In fact, the exact maximization of (7) can be found via extended value iteration (jaksch2010near, section 3.1))

 u0(s)\coloneqq0
 ui+1(s) \coloneqqmax(a,p′,r′)∈A+s[r+(s,(a,p′,r′))+∑s′p+(s′|s,(a,p′,r′))ui(s′)] By (5) and (6) =max(a,p′,r′)∈A+s[r′+∑s′p′(s′)ui(s′)] By (4) =maxa∈A[maxr′∈Bδ(s,a)r′+maxp′∈Cδ(s,a)∑s′p′(s′)ui(s′)] (7)

We are now ready to restate the observation. If is in , which happens with high probability, jaksch2010near observe that

 maxsui(s)−mins′ui(s′)≤rmaxD(M). (8)

However, this bound is too conservative because it fails to account for the rewards collected. By patching this, we tighten the upper bound with MEHC.

###### Lemma 1 (MEHC upper bounds the span of values).

Assuming that the actual MDP is in the extended MDP , i.e. and for all , we have

 maxsui(s)−mins′ui(s′)≤κ(M)

where is the -step optimal undiscounted value of state .

This refined upper bound immediately plugs into the main theorems of (jaksch2010near, equations 19 and 22, theorem 2).

###### Theorem 1 (Reward-sensitive regret bound of Ucrl2).

With probability of at least , for any initial state and any , and , the regret of UCRL2 is bounded by

 Δ(M,UCRL2,s,T) ≤√58Tlog(8Tδ)+√T+κ√52Tlog(8Tδ)+κSAlog2(8TSA) +(κ√14Slog(2ATδ)+√14log(2SATδ)+2)(√2+1)√SAT ≤34max{1,κ}S√ATlog(Tδ).

As a corollary, in terms of sample complexity kakade2003sample, Theorem 1 implies that UCRL2 offers by inverting the regret bound by demanding that the per-step regret is at most with probability of at least (jaksch2010near, corollary 3). Similarly, we have an updated logarithmic bound on the expected regret (jaksch2010near, theorem 4), .

### 2.4 Informativeness of rewards

Informally, it is not hard to appreciate the challenge imposed by delayed feedback inherent in MDPs as actions with high immediate rewards do not necessarily lead to a high optimal value. Are there different but “equivalent” reward functions that differ in their informativeness with the more informative ones being easier to reinforcement learn? Suppose we have two MDPs differing only in their rewards, and , then they will have the same diameters and thus the same diameter-dependent regret bounds from previous works. With MEHC, however, we may get a more meaningful answer.

Firstly, let us make precise a notion of equivalence. We say that and are -equivalent if for any policy , its average rewards are the same under the two reward functions . Formally, we will study the MEHC of a class of -equivalent reward functions related via a potential.

### 2.5 Potential-based reward shaping

Originally introduced by ng1999policy, potential-based reward shaping (PBRS) takes a potential and defines shaped rewards

 rφt\coloneqqrt−φ(st)+φ(st+1). (9)

We can think of the stochastic process being generated from an MDP with reward function 666One needs to ensure that respects the -boundedness of . whose mean rewards are

 ¯rφ(s,a)=¯r(s,a)−φ(s)+Es′∼p(⋅|s,a)[φ(s′)].

It is easy to check that and are indeed -equivalent. For any policy ,

 ρ(Mφ,π,s) =limT→∞1TE[R(Mφ,π,s,T)] =limT→∞1TE[T−1∑t=0rφt] =limT→∞1TE[T−1∑t=0rt−φ(st)+φ(st+1)] By telescoping sums of potential terms over consecutive t =limT→∞1TE[−φ(s0)+φ(sT)+T−1∑t=0rt] =limT→∞1T(−φ(s)+E[φ(sT)]+E[R(M,π,s,T)]) The first two terms vanish in the limit =limT→∞1TE[R(M,π,s,T)] =ρ(M,π,s). (10)

To get some intuition, it is instructive to consider a toy example (Figure 1). Suppose and , then the optimal average reward in this MDP is , and the optimal stationary deterministic policy is and , as staying in state yields the highest average reward. As the expected number of steps needed to transition from state to and vice versa are both via action , we conclude that . Furthermore, notice that taking action in either state transitions to the other state with probability of , however the immediate rewards are the same as taking the alternative action to stay in the current state—the immediate rewards are not informative. We can differentiate the actions better by shaping with a potential of and . The shaped mean rewards become, at ,

 ¯rφ(s1,a2)=1−α−φ(s1)+ϵφ(s2)+(1−ϵ)φ(s1)=1−\nicefrac(α+β)2>1−α=¯rφ(s1,a1)

and at ,

 ¯rφ(s2,a2)=1−β−φ(s2)+ϵφ(s1)+(1−ϵ)φ(s2)=1−\nicefrac(α+β)2<1−β=¯rφ(s2,a1).

This encourages taking actions at state and discourages taking actions at state simultaneously. The maximum expected hitting cost becomes smaller

 κ(Mφ) =max{α,β,φ(s1)−φ(s2)+αϵ,φ(s2)−φ(s1)+βϵ} =max{α,β,α+β2ϵ,α+β2ϵ} =α+β2ϵ <αϵ=κ(M).

In this example, MEHC is halved at best when is made arbitrarily close to zero. Noting that the original MDP is equivalent to shaped with potential , i.e. from (9), we see that MEHC can be almost doubled. It turns that halving or doubling the MEHC is the most PBRS can do in a large class of MDPs.

###### Theorem 2 (MEHC under PBRS).

Given an MDP with finite maximum expected hitting cost and an unsaturated optimal average reward , the maximum expected hitting cost of any PBRS-parameterized MDP is bounded by a multiplicative factor of two

 12κ(M)≤κ(Mφ)≤2κ(M).

## 3 Discussion

If we view RL as an engineering tool that “compiles” an arbitrary reward function into a behavior (as represented by a policy) in an environment, then a programmer’s primary responsibility would be to craft a reward function that faithfully expresses the intended goal. However, this problem of reward design is complicated by practical concerns for the difficulty of learning. As recognized by kober2013reinforcement,

“[t]here is also a trade-off between the complexity of the reward function and the complexity of the learning problem.”

Accurate rewards are often easy to specify in a sparse manner (reaching a position, capturing the king, etc), thus hard to learn, whereas dense rewards, providing more feedback, are harder to specify accurately, leading to incorrect trained behaviors. The recent rise of deep RL also exposes “bugs” in some of these designed rewards. Our results show that the informativeness of rewards, an aspect of “the complexity of the learning problem” can be controlled by a well specified potential without inadvertently changing the intended behaviors of the original reward. Therefore, we propose to separate the definitional concern from the training concern. Rewards should be first defined to faithfully express the intended task, and then any extra knowledge can be incorporated via a shaping potential to reduce the sample complexity of training to obtain the same desired behaviors. That is not to say that it is generally easy to find a helpful potential making the rewards more informative.

#### Acknowledgments

We thank Avrim Blum for many insightful comments. In particular, his challenge to finding a better example has led to Theorem 2. We also thank Ronan Fruit for a discussion on a concept similar to the proposed maximum expected hitting cost that he independently developed in his thesis draft.

## Appendix A Detailed proofs

### a.1 Proof of Lemma 1

Assuming that the actual MDP is in the extended MDP , i.e. and for all , we have

 maxsui(s)−mins′ui(s′)≤κ(M)

where is the -step optimal undiscounted value of state .

###### Proof.

By assumption, the actual mean rewards and transitions are contained in the extended MDP , i.e. for any and , and . Thus for any policy in the actual MDP , we can construct a corresponding policy in the extended MDP

 π+(s)\coloneqq(π(s),p(⋅|s,π(s)),¯r(s,π(s))).

Following in induces the same stochastic process as following in . In particular they have the same expected hitting times and expected rewards. By definition is the value of following an optimal -step non-stationary policy starting at in the extended MDP . For any , by optimality, must be no worse than first following from to and then following the optimal -step non-stationary policy from onward. Along the path from to , we receive rewards according to and after arriving at , we have missed at most -many rewards of so in expectation

 ui(s) ≥E⎡⎢⎣hs→s′(M+,π+)−1∑t=0rt⎤⎥⎦+ui(s′)−E[rmaxhs→s′(M+,π+)] =E⎡⎢⎣hs→s′(M+,π+)−1∑t=0rt−rmax⎤⎥⎦+ui(s′) By definition of π+, hitting time hs→s′(M,π)=hs→s′(M+,π+) =E⎡⎣hs→s′(M,π)−1∑t=0rt−rmax⎤⎦+ui(s′).

Moving the terms around and we get

 ui(s′)−ui(s)≤E⎡⎣hs→s′(M,π)−1∑t=0rmax−rt⎤⎦.

Since this holds for any by optimality, we can choose one with the smallest expected hitting cost

 ui(s′)−ui(s)≤minπ:S→AE⎡⎣hs→s′(M,π)−1∑t=0rmax−rt⎤⎦.

Since are arbitrary, we can maximize over pairs of states on both sides and get

 maxs′ui(s′)−minsui(s)≤maxs,s′minπ:S→AE⎡⎣hs→s′(M,π)−1∑t=0rmax−rt⎤⎦=κ(M).

It should be noted that even in some cases where the hitting time is infinity—in a non-communicating MDPs for example— can still be finite and this inequality is still true! In these cases, except for finitely many terms implying . ∎

### a.2 Proof of Theorem 2

Given an MDP with finite maximum expected hitting cost and an unsaturated optimal average reward , the maximum expected hitting cost of any PBRS-parametrized MDP is bounded by a multiplicative factor of two

 12κ(M)≤κ(Mφ)≤2κ(M).
###### Proof.

We denote the expected hitting cost between two states as

 c(s,s′)\coloneqqminπ:S→AE⎡⎣hs→s′(M,π)−1∑t=0rmax−rt⎤⎦.

Suppose that the pair of states maximizes the expected hitting cost in which is assumed to be finite

 κ(M)=c(s,s′)<∞.

Furthermore, the condition that implies that the hitting times are finite for the minimizing policies. This ensures that the destination state is actually hit in the stochastic process.

Considering the expected hitting cost of the reverse pair, ,

 κ(M)=max{c(s,s′),c(s′,s)}≤c(s,s′)+c(s′,s) (11)

since hitting costs are nonnegative.

With -shaping,

 cφ(s,s′) =minπ:S→AE⎡⎣hs→s′(M,π)−1∑t=0rmax−rφt⎤⎦ =minπ:S→AE⎡⎣hs→s′(M,π)−1∑t=0rmax−(rt−φ(st)+φ(st+1))⎤⎦ By telescoping sums =minπ:S→AE⎡⎣φ(s0)−φ(shs→s′(M,π))+hs→s′(M,π)−1∑t=0rmax−rt⎤⎦ By definition of a finite hitting time, shs→s′(M,π)=s′ =φ(s)−φ(s′)+minπ:S→AE⎡⎣hs→s′(M,π)−1∑t=0rmax−rt⎤⎦ =φ(s)−φ(s′)+c(s,s′) (12)

and that the minimizing policy for a state pair will not change. Therefore,

 κ(Mφ) By definition of MEHC ≥max{cφ(s,s′),cφ(s′,s)} By (12) =max{c(s,s′)+φ(s)−φ(s′),c(s′,s)+φ(s′)−φ(s)} The maximum is no smaller than half of the sum ≥12[c(s,s′)+c(s′,s)] By (11) ≥12κ(M).

We obtain the other half of the inequality by observing . ∎