1 Introduction
In the average reward setting of reinforcement learning (RL)
puterman1994markov; sutton1998reinforcement, an algorithm learns to maximize its average rewards by interacting with an unknownMarkov decision process (MDP). Similar to analysis in multiarmed bandits and other online machine learning problems, (cumulative) regret provides a natural model to evaluate the efficiency of a learning algorithm. With
UCRL2 jaksch2010near show a problemdependent bound of on regret and an associated logarithmic bound on the expected regret, where is the diameter of the actual MDP (Definition 1), the size of the state space, and the size of the action space. Many subsequent algorithms fruit2019exploitation enjoy similar diameterdependent bounds. This establishes diameter as an important measure of complexity for an MDP. However, strikingly, this measure is independent of rewards and is a function of only the transitions. This is obviously peculiar as two MDPs differing only in their rewards would have the same regret bounds even if one of them gives the maximum reward for all transitions. We review the related key observation in jaksch2010near and refine it with a new lemma (Lemma 1), establishing a rewardsensitive complexity measure that we refer to as the maximum expected hitting cost (MEHC, Definition 2) which tightens the regret bounds of UCRL2 and similar algorithms by replacing diameter (Theorem 1).Next, with respect to this new complexity measure, we inquire a notion of reward informativeness (Section 2.4). Intuitively speaking, in an environment, the same desired policies can be motivated by different (immediate) rewards. These differing definitions of rewards can be more or less informative of useful actions, i.e. yielding high longterm rewards. To formalize this intuition, we study a way to reparametrize rewards via potentialbased reward shaping (PBRS) ng1999policy that can produce different rewards with the same nearoptimal policies (Section 2.5). We show that the MEHC changes under reparametrization by PBRS, and thus regret and sample complexity, substantiating this notion of informativeness. Lastly, we study the extent of its impact. In particular, we show that there is a factoroftwo limit on its impact on MEHC in a large class of MDPs (Theorem 2). This result and the concept of reward informativeness may be useful for a task designer crafting a reward function (Section 3).
The main contributions of this work are twofold.

We propose a new MDP structural parameter, maximum expected hitting cost, that accounts for both the transitions and rewards. This parameter replaces diameter in the regret bounds of several modelbased RL algorithms.

We show that potentialbased reward shaping can change the maximum expected hitting cost of an MDP and thus the regret bound. This results in a set of MDPs equivalent with different learning difficulties as measured by regret. Moreover, we show that their MEHC differ by a factor of at most two in a large class of MDPs.
1.1 Related work
This work is most closely related to jaksch2010near, which establishes diameter as a complexity measure for MDPs that is prevalent in regret bounds of RL algorithms in the average reward setting fruit2019exploitation. As noted by jaksch2010near, unlike some previous measures of MDP complexity such as the return mixing times kearns2002near; brafman2002r, diameter depends only on the transitions and not rewards. The core reason for its presence in the regret analysis is that it upper bounds the optimal value span of the extended MDP which summarizes the observations (Section 2.3 and (8)). We review and update this observation with a rewarddependent parameter we called maximum expected hitting cost (Lemma 1). Interestingly, the gap between diameter and MEHC can be arbitrarily large ; there are MDPs with finite MEHC and infinite diameter. These MDPs are noncommunicating but have saturated optimal average rewards . Intuitively, in these MDPs, at some state , the learner cannot visit some other state but can nonetheless achieve the maximum possible average reward thus allowing for good regret guarantees; the unreachable states will not seem better than the reachable ones under the principle of optimism in the face of uncertainty (OFU). We will also use UCRL2 jaksch2010near as an example algorithm in some discussion, though the main results do not depend on it. In particular, with MEHC, its regret bounds are updated (Theorem 1).
Another important comparison is with optimal bias span puterman1994markov; bartlett2009regal; fruit2018efficient, a rewarddependent parameter of MDPs. Here, we again find that the gap can be arbitrarily large ^{1}^{1}1This inequality can be derived as a consequence of Lemma 1 as ,
has very tight confidence intervals around the actual transition and mean rewards of
. Observe that the span of is equal to at the limit of (jaksch2010near, remark 8).. These noncommunicating MDPs would have unsaturated optimal average reward . But as shown in fruit2018near; fruit2018efficient, the extra knowledge of some upper bound on the optimal bias span is necessary for an aglorithm to enjoy a regret that scales with this smaller parameter. In contrast, UCRL2, which scales with MEHC, does not need to know the diameter or MEHC of the actual MDP.Potentialbased reward shaping ng1999policy was originally proposed as a solution technique for a programmer to influence the sample complexity of their reinforcement learning algorithm without changing the nearoptimal policies in episodic and discounted setting. Prior theoretical analysis involving PBRS ng1999policy; wiewiora2003potential; wiewiora2003principled; asmuth2008potential; grzes2017reward mostly focuses on the consistency of RL against the shaped rewards, i.e. the resulting learned behavior is also (near)optimal in the original MDP, while suggesting empirically that the sample complexity can be changed by a wellspecified potential. In this work, we use PBRS to construct the socalled equivalent reward functions in the average reward setting (Section 2.4) and show that two reward functions related by a shaping potential can have different MEHC, and thus different regrets and sample complexities (Section 2.5). However, a subtle but important technical requirement of boundedness of MDPs makes it difficult to immediately generalize our results (Section 2.5 and Theorem 2) to the treatment of PBRS as a solution technique because an arbitrary potential function picked without knowledge of the original MDP may not preserve the boundedness. Nevertheless, we think our work may bring some new perspectives to this topic.
2 Results
2.1 Markov decision process
A Markov decision process is defined by the tuple , where is the state space, is the action space, is the transition function , and is the reward function with mean . Let and and we will restrict our attention to settings in which the state and action spaces are finite. At each time step , an algorithm chooses an action based on the observations up to that point. The state transitions to
according to the probability distribution
and a reward is drawn according to the distribution .^{2}^{2}2It is important to assume that the support of rewards lies in a known bounded interval, often by convention. This is sometimes referred to as a bounded MDP in the literature. Analogous to bandits, the details of the reward distribution often plays no important role and it suffices to specify an MDP with the mean rewards .The transition probabilities and reward function of the MDP are unknown to the learner. The sequence of random variables
forms a stochastic process. Note that a stationary deterministic policy is a restrictive type of algorithms whose action depends only on . We will refer to stationary deterministic policies as policies in the rest of the paper.Recall that in a Markov chain, the
hitting time of state starting at state is a random variable ^{3}^{3}3indexing ensures that . Note also that by convention . levin2008markov.Definition 1 (Diameter, jaksch2010near).
Suppose in the stochastic process induced by following a policy in MDP , the time to hit state starting at state is . We define the diameter of to be
We incorporate rewards into diameter, and introduce a novel MDP parameter.
Definition 2 (Maximum expected hitting cost).
We define the maximum expected hitting cost of to be
Observe that MEHC is a smaller parameter, that is, , since for any , we have .
2.2 Average reward criterion, and regret
The accumulated reward of an algorithm after time steps in MDP starting at state is a random variable
We define the average reward or gain puterman1994markov as
(1) 
We will evaluate policies by their average reward. This can be maximized by a policy and we define the optimal average reward of starting at state as
(2) 
Furthermore, we will demand that the optimal average reward starting at any state to be the same, i.e. for any state . This is a natural requirement on the MDP in the online setting to allow for any hope for a vanishing regret. Otherwise the learner may take actions leading to states with a lower average optimal reward due to ignorance and incur linear regret when compared with the optimal policy starting at the initial state. In particular, this condition is true for communicating MDPs puterman1994markov by virtue of their transitions, but this is also possible for noncommunicating MDPs with appropriate rewards. We will write .
We will compete with the expected cumulative reward of an optimal policy on its trajectory, and define the regret of a learning algorithm starting at state after time steps as
(3) 
2.3 Optimism in the face of uncertainty, extended MDP, and Ucrl2
The principle of optimism in the face of uncertainty (OFU) sutton1998reinforcement states that for uncertain stateactions—we have not visited them enough up to this point—we should be optimistic about their outcome. The intuition for doing so is that taking rewardmaximizing actions with respect to this optimistic model (in terms of both transitions and immediate rewards for these uncertain stateactions), we will have no regret if the optimism is well placed else we will quickly learn more about these suboptimal stateactions to avoid them in the future. This fruitful idea has been the basis for many modelbased RL algorithms fruit2019exploitation and in particular, UCRL2 jaksch2010near which keeps track of the statistical uncertainty via upper confidence bounds.
Suppose we have visited a particular stateaction for many times, then with confidence of at least , we can establish that a confidence interval for both its mean reward and its transition from ChernoffHoeffding inequality (or Bernstein, fruit2018near). Let be the confidence bound after observing i.i.d. samples of a bounded random variable, the empirical mean of , the empirical transition of , and the statistically plausible mean rewards are
and the statistically plausible transitions are
We define an extended MDP to summarize these statistics givan2000bounded; strehl2005theoretical; tewari2007bounded; jaksch2010near, where is the same state space as in , the action space is a union over statespecific actions
(4) 
where is the same action space in , transitions according to the selected distribution
(5) 
and rewards according to the selected mean reward
(6) 
It is not hard to see that is indeed an MDP, with an infinite but compact action space.
By OFU, we want to find an optimal policy on an optimistic MDP within the set of statistically plausible MDPs. As observed in jaksch2010near, this is equivalent to finding an optimal policy in the extended MDP , which specifies a policy in via , where is the projection map onto the th coordinate (and an optimistic MDP via transitions and mean rewards over actions selected by ^{4}^{4}4For transitions and mean rewards over actions we can set them to and .).
By construction of the extended MDP , with high confidence, is in , i.e. and for all . At the heart of UCRL2type regret analysis, there is a key observation (jaksch2010near, equation (11)) that we can bound the span of optimal values in the extended MDP by the diameter of the actual MDP under the condition that is in . This observation is needed to characterize how good following the “optimistic” policy in the actual MDP is. For , the step optimal values of is the expected total reward by following an optimal nonstationary step policy starting at state . We can also define them recursively (via dynamic programming^{5}^{5}5In fact, the exact maximization of (7) can be found via extended value iteration (jaksch2010near, section 3.1))
We are now ready to restate the observation. If is in , which happens with high probability, jaksch2010near observe that
(8) 
However, this bound is too conservative because it fails to account for the rewards collected. By patching this, we tighten the upper bound with MEHC.
Lemma 1 (MEHC upper bounds the span of values).
Assuming that the actual MDP is in the extended MDP , i.e. and for all , we have
where is the step optimal undiscounted value of state .
This refined upper bound immediately plugs into the main theorems of (jaksch2010near, equations 19 and 22, theorem 2).
Theorem 1 (Rewardsensitive regret bound of Ucrl2).
With probability of at least , for any initial state and any , and , the regret of UCRL2 is bounded by
As a corollary, in terms of sample complexity kakade2003sample, Theorem 1 implies that UCRL2 offers by inverting the regret bound by demanding that the perstep regret is at most with probability of at least (jaksch2010near, corollary 3). Similarly, we have an updated logarithmic bound on the expected regret (jaksch2010near, theorem 4), .
2.4 Informativeness of rewards
Informally, it is not hard to appreciate the challenge imposed by delayed feedback inherent in MDPs as actions with high immediate rewards do not necessarily lead to a high optimal value. Are there different but “equivalent” reward functions that differ in their informativeness with the more informative ones being easier to reinforcement learn? Suppose we have two MDPs differing only in their rewards, and , then they will have the same diameters and thus the same diameterdependent regret bounds from previous works. With MEHC, however, we may get a more meaningful answer.
Firstly, let us make precise a notion of equivalence. We say that and are equivalent if for any policy , its average rewards are the same under the two reward functions . Formally, we will study the MEHC of a class of equivalent reward functions related via a potential.
2.5 Potentialbased reward shaping
Originally introduced by ng1999policy, potentialbased reward shaping (PBRS) takes a potential and defines shaped rewards
(9) 
We can think of the stochastic process being generated from an MDP with reward function ^{6}^{6}6One needs to ensure that respects the boundedness of . whose mean rewards are
It is easy to check that and are indeed equivalent. For any policy ,
By telescoping sums of potential terms over consecutive  
The first two terms vanish in the limit  
(10) 
To get some intuition, it is instructive to consider a toy example (Figure 1). Suppose and , then the optimal average reward in this MDP is , and the optimal stationary deterministic policy is and , as staying in state yields the highest average reward. As the expected number of steps needed to transition from state to and vice versa are both via action , we conclude that . Furthermore, notice that taking action in either state transitions to the other state with probability of , however the immediate rewards are the same as taking the alternative action to stay in the current state—the immediate rewards are not informative. We can differentiate the actions better by shaping with a potential of and . The shaped mean rewards become, at ,
and at ,
This encourages taking actions at state and discourages taking actions at state simultaneously. The maximum expected hitting cost becomes smaller
In this example, MEHC is halved at best when is made arbitrarily close to zero. Noting that the original MDP is equivalent to shaped with potential , i.e. from (9), we see that MEHC can be almost doubled. It turns that halving or doubling the MEHC is the most PBRS can do in a large class of MDPs.
Theorem 2 (MEHC under PBRS).
Given an MDP with finite maximum expected hitting cost and an unsaturated optimal average reward , the maximum expected hitting cost of any PBRSparameterized MDP is bounded by a multiplicative factor of two
3 Discussion
If we view RL as an engineering tool that “compiles” an arbitrary reward function into a behavior (as represented by a policy) in an environment, then a programmer’s primary responsibility would be to craft a reward function that faithfully expresses the intended goal. However, this problem of reward design is complicated by practical concerns for the difficulty of learning. As recognized by kober2013reinforcement,
“[t]here is also a tradeoff between the complexity of the reward function and the complexity of the learning problem.”
Accurate rewards are often easy to specify in a sparse manner (reaching a position, capturing the king, etc), thus hard to learn, whereas dense rewards, providing more feedback, are harder to specify accurately, leading to incorrect trained behaviors. The recent rise of deep RL also exposes “bugs” in some of these designed rewards. Our results show that the informativeness of rewards, an aspect of “the complexity of the learning problem” can be controlled by a well specified potential without inadvertently changing the intended behaviors of the original reward. Therefore, we propose to separate the definitional concern from the training concern. Rewards should be first defined to faithfully express the intended task, and then any extra knowledge can be incorporated via a shaping potential to reduce the sample complexity of training to obtain the same desired behaviors. That is not to say that it is generally easy to find a helpful potential making the rewards more informative.
Acknowledgments
We thank Avrim Blum for many insightful comments. In particular, his challenge to finding a better example has led to Theorem 2. We also thank Ronan Fruit for a discussion on a concept similar to the proposed maximum expected hitting cost that he independently developed in his thesis draft.
Appendix A Detailed proofs
a.1 Proof of Lemma 1
Assuming that the actual MDP is in the extended MDP , i.e. and for all , we have
where is the step optimal undiscounted value of state .
Proof.
By assumption, the actual mean rewards and transitions are contained in the extended MDP , i.e. for any and , and . Thus for any policy in the actual MDP , we can construct a corresponding policy in the extended MDP
Following in induces the same stochastic process as following in . In particular they have the same expected hitting times and expected rewards. By definition is the value of following an optimal step nonstationary policy starting at in the extended MDP . For any , by optimality, must be no worse than first following from to and then following the optimal step nonstationary policy from onward. Along the path from to , we receive rewards according to and after arriving at , we have missed at most many rewards of so in expectation
By definition of , hitting time  
Moving the terms around and we get
Since this holds for any by optimality, we can choose one with the smallest expected hitting cost
Since are arbitrary, we can maximize over pairs of states on both sides and get
It should be noted that even in some cases where the hitting time is infinity—in a noncommunicating MDPs for example— can still be finite and this inequality is still true! In these cases, except for finitely many terms implying . ∎
a.2 Proof of Theorem 2
Given an MDP with finite maximum expected hitting cost and an unsaturated optimal average reward , the maximum expected hitting cost of any PBRSparametrized MDP is bounded by a multiplicative factor of two
Proof.
We denote the expected hitting cost between two states as
Suppose that the pair of states maximizes the expected hitting cost in which is assumed to be finite
Furthermore, the condition that implies that the hitting times are finite for the minimizing policies. This ensures that the destination state is actually hit in the stochastic process.
Considering the expected hitting cost of the reverse pair, ,
(11) 
since hitting costs are nonnegative.
With shaping,
By telescoping sums  
By definition of a finite hitting time,  
(12) 
and that the minimizing policy for a state pair will not change. Therefore,
We obtain the other half of the inequality by observing . ∎