# A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs with Near-optimal Regret

Recently, model-free reinforcement learning has attracted research attention due to its simplicity, memory and computation efficiency, and the flexibility to combine with function approximation. In this paper, we propose Exploration Enhanced Q-learning (EE-QL), a model-free algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs) that achieves regret bound of O(√(T)) for the general class of weakly communicating MDPs, where T is the number of interactions. EE-QL assumes that an online concentrating approximation of the optimal average reward is available. This is the first model-free learning algorithm that achieves O(√(T)) regret without the ergodic assumption, and matches the lower bound in terms of T except for logarithmic factors. Experiments show that the proposed algorithm performs as well as the best known model-based algorithms.

Comments

There are no comments yet.

## Authors

• 9 publications
• 22 publications
• 35 publications
• 37 publications
• ### Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes

Model-free reinforcement learning is known to be memory and computation ...
10/15/2019 ∙ by Chen-Yu Wei, et al. ∙ 0

read it

• ### Almost Optimal Model-Free Reinforcement Learning via Reference-Advantage Decomposition

We study the reinforcement learning problem in the setting of finite-hor...
04/21/2020 ∙ by Zihan Zhang, et al. ∙ 0

read it

• ### Regret Bounds for Discounted MDPs

Recently, it has been shown that carefully designed reinforcement learni...
02/12/2020 ∙ by Shuang Liu, et al. ∙ 0

read it

• ### Model-Free Algorithm and Regret Analysis for MDPs with Peak Constraints

In the optimization of dynamic systems, the variables typically have con...
03/11/2020 ∙ by Qinbo Bai, et al. ∙ 0

read it

• ### Average-reward model-free reinforcement learning: a systematic review and literature mapping

Model-free reinforcement learning (RL) has been an active area of resear...
10/18/2020 ∙ by Vektor Dewanto, et al. ∙ 14

read it

• ### Online Learning for Unknown Partially Observable MDPs

Solving Partially Observable Markov Decision Processes (POMDPs) is hard....
02/25/2021 ∙ by Mehdi Jafarnia-Jahromi, et al. ∙ 0

read it

• ### Provably Efficient Adaptive Approximate Policy Iteration

Model-free reinforcement learning algorithms combined with value functio...
02/08/2020 ∙ by Botao Hao, et al. ∙ 15

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Reinforcement learning (RL) studies the problem of an agent interacting with an unknown environment while trying to maximize its cumulative reward. The agent faces a fundamental exploration-exploitation trade-off: should it explore the environment to gain more information for future decisions, or should it exploit the available information to maximize the reward. Efficient exploration is a crucial property of learning algorithms evaluated with the notion of regret: the difference between the cumulative reward of the optimal policy and that of the algorithm. Regret quantifies the speed of learning, i.e., low regret algorithms can learn more efficiently.

RL algorithms can broadly be classified as

model-based and model-free

. Model-based algorithms maintain an estimate of the environment dynamics and plan based on the estimated model. Model-free algorithms, on the other hand, directly estimate the value function or the policy without explicitly estimating the environment model. Model-free algorithms are simpler, memory and computation efficient, and more amenable to extend to large scale problems by incorporating function approximation. Indeed, most of the recent advances in RL such as DQN

(Mnih et al., 2013), TRPO (Schulman et al., 2015), AC3 (Mnih et al., 2016), PPO (Schulman et al., 2017), etc., are all in the model-free paradigm.

It was believed that model-based algorithms can better manage the trade-off between exploration and exploitation. Several model-based algorithms with low regret guarantees have been proposed in the past decade including UCRL2 (Jaksch et al., 2010), REGAL (Bartlett and Tewari, 2009), PSRL (Ouyang et al., 2017), UCBVI (Azar et al., 2017) , SCAL (Fruit et al., 2018), EBF (Zhang and Ji, 2019) and EULER (Zanette and Brunskill, 2019). However, the recent success of model-free algorithms in practice raised the theoretical question of whether it is possible to design model-free algorithms with low regret guarantees. In Jin et al. (2018), it was shown for the first time that (model-free) Q-learning (QL) with UCB exploration can achieve near-optimal regret bound in the episodic finite-horizon Markov Decision Processes (MDPs) where hides constants and logarithmic factors. This result was extended by Dong et al. (2019) to the infinite-horizon discounted setting.

However, designing model-free algorithms with near-optimal regret in the infinite-horizon average-reward scheme has been rather challenging. The main difficulty in this setting is that the estimate of the -value function may grow unbounded over time due to the infinite-horizon nature of the problem and lack of the discount factor. Moreover, the contraction property of the discounted setting does not hold and the backward induction technique in the finite-horizon cannot be applied here.

This paper presents Exploration Enhanced Q-learning (EE-QL), the first model-free algorithm that achieves regret for the infinite-horizon average-reward MDPs without the strong ergodicity assumption. We consider the general class of weakly communicating MDPs with finite states and actions. In prior work Wei et al. (2020), the Optimistic QL algorithm does not need the strong ergodicity assumption, but achieves only regret, while the MDP-OOMD algorithm in the same paper, achieves regret, but needs the strong ergodicity assumption. Our result matches the lower bound of Jaksch et al. (2010) in terms of except for logarithmic factors. For comparison to other model-based and model-free algorithms see Table 1.

EE-QL (read equal) uses stochastic approximation to estimate the -value function by assuming that a concentrating estimate of the optimal gain is available. The key idea of this algorithm is the careful design of the learning rate to efficiently balance the effect of new and old observations as well as controlling the magnitude of the -value function. Despite the typical learning rate of (where is the number of visits to the corresponding state-action pair) in the standard Q-learning type algorithms, the proposed EE-QL algorithm uses the learning rate of . This learning rate provides nice properties (listed in Lemma 4) that are central to our analysis. In addition, experiments show that EE-QL significantly outperforms the existing model-free algorithms and has similar performance to the best model-based algorithms. This is due to the fact that, unlike previous model-free algorithms in the tabular setting that optimistically estimate each entry of the optimal -value function (Jin et al., 2018; Dong et al., 2019; Wei et al., 2020), EE-QL estimates a single scalar (the optimal gain) optimistically to avoid spending unnecessary optimism.

## 2 Preliminaries

We consider infinite-horizon average-reward MDPs described by where is the state space, is the action space, is the deterministic reward function, and is the transition kernel. Here and are finite sets with cardinalities and , respectively. The gain of a stationary deterministic policy with the initial state is defined as

 Jπ(s):=liminfT→∞1TE[T∑t=1r(st,π(st))∣∣s1=s],

where for . Let be the optimal gain. The optimal gain is independent of the initial state for the standard class of weakly communicating MDPs considered in this paper. An MDP is weakly communicating if its state space can be divided into two subsets. In the first subset, all the states are transient under any stationary policy. In the second subset, every state is accessible from any other state under some stationary policy. It is known that the weakly communicating assumption is required to achieve low regret (Bartlett and Tewari, 2009). From the standard MDP theory (Puterman, 2014), we know that for weakly communicating MDPs, there exists a function (unique up to an additive constant) such that the following Bellman equation holds:

 J∗+q∗(s,a)=r(s,a)+Es′∼p(⋅|s,a)[maxbq∗(s′,b)], (1)

for all and . The optimal gain is achieved by the corresponding optimal policy (note that such a policy may not be unique).

In this paper, we consider the reinforcement learning problem of an agent interacting with a weakly communicating MDP with unknown transition kernel and reward function (thus, the Bellman equation cannot be solved directly). At each time , the agent observes the state , takes action , and receives the reward . The next state

is then determined according to the probability distribution

. The performance of the learning algorithm is quantified by the notion of cumulative regret defined as

 RT:=T∑t=1(J∗−r(st,at)).

Regret evaluates the transient performance of the learning algorithm by measuring the difference between the total gain of the optimal policy and the cumulative reward obtained by the learning algorithm upto time . The goal of the agent is to maximize the total reward (or equivalently minimize the regret). If a learning algorithm achieves sub-linear regret, its average reward converges to the optimal gain. Zhang and Ji (2019) proposed a model-based algorithm with regret bound of (where is the diameter of the MDP) and matches the lower bound of Jaksch et al. (2010). The best existing regret bound of a model-free algorithm for weakly communicating MDPs is by Wei et al. (2020).

## 3 The Exploration Enhanced Q-learning Algorithm

In this section, we introduce the Exploration Enhanced Q-learning (EE-QL) algorithm (see Algorithm 1). The algorithm works for the broad class of weakly communicating MDPs. It is well-known that the weakly communicating condition is necessary to achieve sublinear regret (Bartlett and Tewari, 2009).

EE-QL approximates the -value function for the infinite-horizon average-reward setting using stochastic approximation with carefully chosen learning rates. The algorithm takes greedy actions with respect to the current estimate, function. After visiting the next state, a stochastic update of is made based on the Bellman equation. in the algorithm is an estimate of that satisfies the following assumption.

###### Assumption 1.

(Concentrating ) There exists a constant such that .

In some applications, is known apriori. For example, in the infinite horizon version of Cartpole described in Hao et al. (2020), the optimal policy keeps the pole upright throughout the horizon which leads to a known . In such cases, one can simply set . In applications where is not known, one can set for some constant , where and is stochastically updated as for some decaying learning rate . In particular, yields

 Jt=1tt∑t′=1r(st′,at′)+C√t. (2)

We have numerically verified that this choice of with satisfies Assumption 1 for in the RiverSwim and RandomMDP environemnts (see Section 5 for more details). The choice of the learning rate is particularly important. Choosing (rather than ) efficiently combines the new and old observations and provides nice properties listed in Lemma 4 that play a central role in the analysis. The widely used learning rate of in the standard Q-learning algorithm (Abounadi et al., 2001) may not satisfy these properties.

In addition, unlike the Q-learning algorithms with UCB exploration (Jin et al., 2018; Dong et al., 2019; Wei et al., 2020), EE-QL does not optimistically estimate the -value function. In the case that , the algorithm need not follow the optimism in the face of uncertainty principle as in (Jin et al., 2018; Dong et al., 2019; Wei et al., 2020; Jaksch et al., 2010). However, our numerical experiments show that if is not known, has to be an optimistic estimate of the average reward as in (2). Thus, EE-QL

is economical in using optimism. In other words, instead of wasting optimistic confidence intervals around each entry of the

function, our algorithm is optimistic around a single scalar . This leads to significant improvement in the numerical performance compared to the literature (see Section 5). We now state the main regret guarantee of Algorithm 1.

###### Theorem 1.

Under Assumption 1, the EE-QL algorithm ensures that with probability at least , , where is defined in Assumption 1.

This result improves the previous best known regret bound of by Wei et al. (2020) and matches the lower bound of (Jaksch et al., 2010) in terms of up to logarithmic factors. To the best of our knowledge, this is the first model-free algorithm that achieves regret bound for the general class of weakly communicating MDPs in the infinite-horizon average-reward setting.

## 4 Analysis

In this section, we provide the proof of Theorem 1. Before we start the analysis, let’s define

 αiτ:=αiτ∏j=i+1(1−αj) (3)

for , where is the learning rate used in Algorithm 1. determines the effect of the -th step on -th update. This quantity has nice properties that are listed in Lemma 4 and are central to our analysis. In particular, the regret bound is merely due to properties 2 and 4 in Lemma 4.

### 4.1 Proof of Theorem 1

###### Proof.

We start by decomposing the regret using Lemma 7. With probability at least , the regret of any algorithm can be bounded by

 RT≤sp(v∗)+sp(v∗)√12Tln1δ+T∑t=1Δ(st,at), (4)

where . Suffices to bound . Let denote the number of visits to state-action pair before time (including time and excluding time ). For notational simplicity, let and be the time step at which is visited for the th time. We can write:

 T∑t=1[vt(st)−v∗(st)+Δ(st,at)]=T∑t=1[Qt(st,at)−v∗(st)+Δ(st,at)] =T∑t=1[Qt(st,at)−q∗(st,at)] (5)

where the first equality is by the fact that and the second equality is by the definition of . The second term on the right hand side can be bounded by by using line (1) of the Algorithm (see Lemma 3) where . The rest of the proof proceeds to write the first term on the right hand side in terms of (to telescope with the left hand side) plus some sublinear additive terms. We can write:

 =∑s,aT∑t=11(st=s,at=a)[Qt+1(s,a)−q∗(s,a)]1(nt+1(s,a)≥1)=:R1

By Lemma 6, the term can be written as:

 R1 =∑s,aT∑t=11(st=s,at=a){nt+1(s,a)∑i=1αint+1(s,a)[J∗−Jti(s,a)] +nt+1(s,a)∑i=1αint+1(s,a)[vti(s,a)(sti(s,a)+1)−v∗(sti(s,a)+1)] +nt+1(s,a)∑i=1αint+1(s,a)[v∗(sti(s,a)+1)−Es′∼p(⋅|s,a)v∗(s′)]} =∑s,anT+1(s,a)∑j=1{j∑i=1αij[J∗−Jti(s,a)]+j∑i=1αij[vti(s,a)(sti(s,a)+1)−v∗(sti(s,a)+1)] +j∑i=1αij[v∗(sti(s,a)+1)−Es′∼p(⋅|s,a)v∗(s′)]}

By changing the order of summation on and , we can write:

 R1 =∑s,anT+1(s,a)∑i=1{[J∗−Jti(s,a)]nT+1(s,a)∑j=iαij+[vti(s,a)(sti(s,a)+1)−v∗(sti(s,a)+1)]nT+1(s,a)∑j=iαij +[v∗(sti(s,a)+1)−Es′∼p(⋅|s,a)v∗(s′)]nT+1(s,a)∑j=iαij}.

We proceed by upper bounding each term in the latter by using Lemma 4(3). Note that

 |J∗−Jti(s,a)|≤c√ti(s,a)≤c√i, −sp(v∗)≤v∗(sti(s,a)+1)−Es′∼p(⋅|s,a)v∗(s′)≤sp(v∗).

Moreover, note that is unique upto a constant. So, without loss of generality, we choose such that , where is the uniform bound on (and ) as in Lemma 2. This choice of implies that for all . Replacing these bounds for and in Lemma 4(3) implies

 R1 ≤∑s,anT+1(s,a)∑i=1{J∗−Jti(s,a)+5c2i+c√i(1−1√i+1)nT+1(s,a)−i+1 +vti(s,a)(sti(s,a)+1)−v∗(sti(s,a)+1)+(2B+sp(v∗))52√i +v∗(sti(s,a)+1)−Es′∼p(⋅|s,a)v∗(s′)+sp(v∗)(52√i+(1−1√i+1)nT+1(s,a)−i+1)}. (6)

To simplify the right hand side of the above inequality, observe that

 ∑s,anT+1(s,a)∑i=1(J∗−Jti(s,a))=∑s,aT∑t=11(st=s,at=a)(J∗−Jt)=T∑t=1(J∗−Jt). (7)

Similarly,

 ∑s,anT+1(s,a)∑i=1(vti(s,a)(sti(s,a)+1)−v∗(sti(s,a)+1))=T∑t=1(vt(st+1)−v∗(st+1)), (8) ∑s,anT+1(s,a)∑i=1(v∗(sti(s,a)+1)−Es′∼p(⋅|s,a)v∗(s′))=T∑t=1(v∗(st+1)−Es′∼p(⋅|st,at)v∗(s′)). (9)

Using the inequalities in Lemma 5 and Lemma 4(5), replacing the equalities (7), (8), (9) into the right hand side of (4.1), and adding and subtracting implies

 R1 ≤T∑t=1(J∗−Jt)+5cSA2(1+lnT)+2√2cSA +T∑t=1(vt+1(st+1)−v∗(st+1))+T∑t=1(vt(st+1)−vt+1(st+1))+5(2B+sp(v∗))√SAT +T∑t=1(v∗(st+1)−Es′∼p(⋅|st,at)[v∗(s′)])+6sp(v∗)√SAT. (10)

Note that by Assumption 1, . Furthermore, is a martingale difference sequence and can be bounded by with probability at least , using Azuma’s inequality. Moreover, by Lemma 8. Replacing these bounds on the right hand side of the above inequality, simplifying the result and plugging back into (4.1) implies

 T∑t=1[vt(st)−v∗(st)+Δ(st,at)] ≤T∑t=1(vt+1(st+1)−v∗(st+1))+(14B+11sp(v∗)+4)√SAT +2c√T+sp(v∗)√12Tln1δ+9cSA2lnT+(92+2√2)cSA,

with probability at least . Telescoping the left hand side with the right hand side and noting that (Lemma 2) and , implies that

 T∑t=1Δ(st,at) ≤(14B+11sp(v∗)+4)√SAT+2c√T+sp(v∗)√12Tln1δ +9cSA2lnT+(92+2√2)cSA+2B+sp(v∗),

with probability at least . Replacing this bound into (4) implies that

 RT≤ (14B+11sp(v∗)+4)√SAT+2c√T+2sp(v∗)√12Tln2δ +9cSA2lnT+(92+2√2)cSA+2B+2∗sp(v∗),

with probability at least which completes the proof. ∎

### 4.2 Auxiliary Lemmas

In this section, we provide some auxiliary lemmas that are used in the proof of Theorem 1. The proof for these lemmas can be found in the appendix.

###### Lemma 2.

The in Algorithm 1 is bounded by .

###### Lemma 3.

The second term of (4.1) can be bounded by

###### Lemma 4.

The following properties hold:

1. for any .

2. For any , and any , we have .

3. Let be a scalar and define and . Then, for any , and any , we have .

4. For any , we have .

5. For any , we have .

###### Lemma 5 (Frequently used inequalities).

The following inequalities hold:

1. .

2. .

3. .

###### Lemma 6.

For a fixed , let , and be the time step at which is taken for the th time. Then,

 (Qt(s,a)−q∗(s,a))1(τ≥1)={ τ∑i=1αiτ[J∗−Jti]+τ∑i=1αiτ[vti(sti+1)−v∗(sti+1)] +τ∑i=1αiτ[v∗(sti+1)−Es′∼p(⋅|s,a)v∗(s′)]}
###### Lemma 7.

With probability at least , the regret of any algorithm is bounded as

 RT≤sp(v∗)+sp(v∗)√12Tln1δ+T∑t=1[v∗(st)−q∗(st,at)].

.

## 5 Experiments

In this section, we numerically evaluate the performance of our proposed EE-QL algorithm. Two environments are considered: RandomMDP and RiverSwim. The RandomMDP environment is an ergodic MDP with states and actions where the transition kernel and the rewards are chosen uniformly at random. The RiverSwim environment is a weakly communicating MDP with states arranged in a chain and actions (left and right) that simulates an agent swimming in a river. If the agent swims left (i.e., in the direction of the river current), it is always successful. If it decides to swim right, it may fail with some probability. The reward function can be described as follows: , and for all other states and actions. The agent starts from the leftmost state (). The optimal policy is to always swim right to reach the high-reward state .

We compare our algorithm against Optimistic QL (Wei et al., 2020), MDP-OOMD (Wei et al., 2020), and Politex (Abbasi-Yadkori et al., 2019a) as model-free algorithms and UCRL2 (Jaksch et al., 2010) and PSRL (Ouyang et al., 2017) as model-based benchmarks. The hyper parameters for these algorithms are tuned to obtain the best performance (see Table 2 for more details). is chosen as in (2) with appropriate (see Table 2). We numerically verified that this choice of satisfies Assumption 1 with . Figure 1 shows that in the RiverSwim environment, our algorithm significantly outperforms Optimistic QL, the only existing model-free algorithm with low regret for weakly communicating MDPs. The reason is that the proposed algorithm does not waste optimism for the entire function. Rather, the optimism in the face of uncertainty principle is used around a single scalar . Note that other model-free algorithms such as Politex and MDP-OOMD did not yield sub-linear regret in RiverSwim and thus removed from the figure. This is due to the fact that RiverSwim does not satisfy the ergodicity assumption required by these algorithms. Moreover, both in the RiverSwim and RandomMDP environments, our algorithm performs as well as the best existing model-based algorithms in practice, though with less memory.

## Conclusions

We proposed EE-QL, the first model-free algorithm with regret bound for weakly communicating MDPs in the infinite-horizon average-reward setting. Our algorithm has a tremendous numerical performance, significantly better than the existing model-free algorithms and similar to the best model-based algorithms, yet with less memory. The key to obtain such numerical performance is to avoid optimistic estimation of each entry of the function. Instead, EE-QL uses optimism for a single scalar (the gain of the optimal policy). Our algorithm assumes that a concentrating estimate of is available. This assumption is verified numerically for an optimistic empirical average reward estimator. The theoretical verification of this assumption is left for future work.

## Appendix A Proof of Lemma 2

Lemma 2 (Restated). The in Algorithm 1 is bounded by

 ∥Qt∥∞≤sp(q∗)+cSA(1+ln(t−1)).
###### Proof.

We first prove for the case where and then extend the proof to the general case. Let be an operator on the space of -functions defined by

 [Gsas′Q](x,u) ={(1−ατ)Q(s,a)+ατ(r(s,a)−J∗+maxbQ(s′,b)),if(x,u)=(s,a)Q(x,u),otherwise

where is arbitrary. Note that is a non-expansive operator because

 Gsas′Q1(s,a)−Gsas′Q2(s,a) =(1−ατ)(Q1(s,a)−Q2(s,a))+ατ(maxbQ1(s′,b)−maxbQ2(s′,b)) ≤(1−ατ)(Q1(s,a)−Q2(s,a))+ατ(Q1(s′,b1∗)−Q2(s′,b1∗)) ≤(1−ατ)∥∥Q1−Q2∥∥∞+ατ∥∥Q1−Q2∥∥∞ =∥∥Q1−Q2∥∥∞,

where . Thus, . Moreover, note that is a fixed point of by the Bellman equation, i.e., . For the case that , of the algorithm can be obtained by applying a sequence of these non-expansive operators. Let . We have