1 Introduction
Reinforcement learning (RL) studies the problem of an agent interacting with an unknown environment while trying to maximize its cumulative reward. The agent faces a fundamental explorationexploitation tradeoff: should it explore the environment to gain more information for future decisions, or should it exploit the available information to maximize the reward. Efficient exploration is a crucial property of learning algorithms evaluated with the notion of regret: the difference between the cumulative reward of the optimal policy and that of the algorithm. Regret quantifies the speed of learning, i.e., low regret algorithms can learn more efficiently.
RL algorithms can broadly be classified as
modelbased and modelfree. Modelbased algorithms maintain an estimate of the environment dynamics and plan based on the estimated model. Modelfree algorithms, on the other hand, directly estimate the value function or the policy without explicitly estimating the environment model. Modelfree algorithms are simpler, memory and computation efficient, and more amenable to extend to large scale problems by incorporating function approximation. Indeed, most of the recent advances in RL such as DQN
(Mnih et al., 2013), TRPO (Schulman et al., 2015), AC3 (Mnih et al., 2016), PPO (Schulman et al., 2017), etc., are all in the modelfree paradigm.It was believed that modelbased algorithms can better manage the tradeoff between exploration and exploitation. Several modelbased algorithms with low regret guarantees have been proposed in the past decade including UCRL2 (Jaksch et al., 2010), REGAL (Bartlett and Tewari, 2009), PSRL (Ouyang et al., 2017), UCBVI (Azar et al., 2017) , SCAL (Fruit et al., 2018), EBF (Zhang and Ji, 2019) and EULER (Zanette and Brunskill, 2019). However, the recent success of modelfree algorithms in practice raised the theoretical question of whether it is possible to design modelfree algorithms with low regret guarantees. In Jin et al. (2018), it was shown for the first time that (modelfree) Qlearning (QL) with UCB exploration can achieve nearoptimal regret bound in the episodic finitehorizon Markov Decision Processes (MDPs) where hides constants and logarithmic factors. This result was extended by Dong et al. (2019) to the infinitehorizon discounted setting.
However, designing modelfree algorithms with nearoptimal regret in the infinitehorizon averagereward scheme has been rather challenging. The main difficulty in this setting is that the estimate of the value function may grow unbounded over time due to the infinitehorizon nature of the problem and lack of the discount factor. Moreover, the contraction property of the discounted setting does not hold and the backward induction technique in the finitehorizon cannot be applied here.
This paper presents Exploration Enhanced Qlearning (EEQL), the first modelfree algorithm that achieves regret for the infinitehorizon averagereward MDPs without the strong ergodicity assumption. We consider the general class of weakly communicating MDPs with finite states and actions. In prior work Wei et al. (2020), the Optimistic QL algorithm does not need the strong ergodicity assumption, but achieves only regret, while the MDPOOMD algorithm in the same paper, achieves regret, but needs the strong ergodicity assumption. Our result matches the lower bound of Jaksch et al. (2010) in terms of except for logarithmic factors. For comparison to other modelbased and modelfree algorithms see Table 1.
EEQL (read equal) uses stochastic approximation to estimate the value function by assuming that a concentrating estimate of the optimal gain is available. The key idea of this algorithm is the careful design of the learning rate to efficiently balance the effect of new and old observations as well as controlling the magnitude of the value function. Despite the typical learning rate of (where is the number of visits to the corresponding stateaction pair) in the standard Qlearning type algorithms, the proposed EEQL algorithm uses the learning rate of . This learning rate provides nice properties (listed in Lemma 4) that are central to our analysis. In addition, experiments show that EEQL significantly outperforms the existing modelfree algorithms and has similar performance to the best modelbased algorithms. This is due to the fact that, unlike previous modelfree algorithms in the tabular setting that optimistically estimate each entry of the optimal value function (Jin et al., 2018; Dong et al., 2019; Wei et al., 2020), EEQL estimates a single scalar (the optimal gain) optimistically to avoid spending unnecessary optimism.
2 Preliminaries
We consider infinitehorizon averagereward MDPs described by where is the state space, is the action space, is the deterministic reward function, and is the transition kernel. Here and are finite sets with cardinalities and , respectively. The gain of a stationary deterministic policy with the initial state is defined as
where for . Let be the optimal gain. The optimal gain is independent of the initial state for the standard class of weakly communicating MDPs considered in this paper. An MDP is weakly communicating if its state space can be divided into two subsets. In the first subset, all the states are transient under any stationary policy. In the second subset, every state is accessible from any other state under some stationary policy. It is known that the weakly communicating assumption is required to achieve low regret (Bartlett and Tewari, 2009). From the standard MDP theory (Puterman, 2014), we know that for weakly communicating MDPs, there exists a function (unique up to an additive constant) such that the following Bellman equation holds:
(1) 
for all and . The optimal gain is achieved by the corresponding optimal policy (note that such a policy may not be unique).
In this paper, we consider the reinforcement learning problem of an agent interacting with a weakly communicating MDP with unknown transition kernel and reward function (thus, the Bellman equation cannot be solved directly). At each time , the agent observes the state , takes action , and receives the reward . The next state
is then determined according to the probability distribution
. The performance of the learning algorithm is quantified by the notion of cumulative regret defined asRegret evaluates the transient performance of the learning algorithm by measuring the difference between the total gain of the optimal policy and the cumulative reward obtained by the learning algorithm upto time . The goal of the agent is to maximize the total reward (or equivalently minimize the regret). If a learning algorithm achieves sublinear regret, its average reward converges to the optimal gain. Zhang and Ji (2019) proposed a modelbased algorithm with regret bound of (where is the diameter of the MDP) and matches the lower bound of Jaksch et al. (2010). The best existing regret bound of a modelfree algorithm for weakly communicating MDPs is by Wei et al. (2020).
3 The Exploration Enhanced Qlearning Algorithm
In this section, we introduce the Exploration Enhanced Qlearning (EEQL) algorithm (see Algorithm 1). The algorithm works for the broad class of weakly communicating MDPs. It is wellknown that the weakly communicating condition is necessary to achieve sublinear regret (Bartlett and Tewari, 2009).
EEQL approximates the value function for the infinitehorizon averagereward setting using stochastic approximation with carefully chosen learning rates. The algorithm takes greedy actions with respect to the current estimate, function. After visiting the next state, a stochastic update of is made based on the Bellman equation. in the algorithm is an estimate of that satisfies the following assumption.
Assumption 1.
(Concentrating ) There exists a constant such that .
In some applications, is known apriori. For example, in the infinite horizon version of Cartpole described in Hao et al. (2020), the optimal policy keeps the pole upright throughout the horizon which leads to a known . In such cases, one can simply set . In applications where is not known, one can set for some constant , where and is stochastically updated as for some decaying learning rate . In particular, yields
(2) 
We have numerically verified that this choice of with satisfies Assumption 1 for in the RiverSwim and RandomMDP environemnts (see Section 5 for more details). The choice of the learning rate is particularly important. Choosing (rather than ) efficiently combines the new and old observations and provides nice properties listed in Lemma 4 that play a central role in the analysis. The widely used learning rate of in the standard Qlearning algorithm (Abounadi et al., 2001) may not satisfy these properties.
Initialization:
Define:
for do
Update:
In addition, unlike the Qlearning algorithms with UCB exploration (Jin et al., 2018; Dong et al., 2019; Wei et al., 2020), EEQL does not optimistically estimate the value function. In the case that , the algorithm need not follow the optimism in the face of uncertainty principle as in (Jin et al., 2018; Dong et al., 2019; Wei et al., 2020; Jaksch et al., 2010). However, our numerical experiments show that if is not known, has to be an optimistic estimate of the average reward as in (2). Thus, EEQL
is economical in using optimism. In other words, instead of wasting optimistic confidence intervals around each entry of the
function, our algorithm is optimistic around a single scalar . This leads to significant improvement in the numerical performance compared to the literature (see Section 5). We now state the main regret guarantee of Algorithm 1.Theorem 1.
This result improves the previous best known regret bound of by Wei et al. (2020) and matches the lower bound of (Jaksch et al., 2010) in terms of up to logarithmic factors. To the best of our knowledge, this is the first modelfree algorithm that achieves regret bound for the general class of weakly communicating MDPs in the infinitehorizon averagereward setting.
4 Analysis
In this section, we provide the proof of Theorem 1. Before we start the analysis, let’s define
(3) 
for , where is the learning rate used in Algorithm 1. determines the effect of the th step on th update. This quantity has nice properties that are listed in Lemma 4 and are central to our analysis. In particular, the regret bound is merely due to properties 2 and 4 in Lemma 4.
4.1 Proof of Theorem 1
Proof.
We start by decomposing the regret using Lemma 7. With probability at least , the regret of any algorithm can be bounded by
(4) 
where . Suffices to bound . Let denote the number of visits to stateaction pair before time (including time and excluding time ). For notational simplicity, let and be the time step at which is visited for the th time. We can write:
(5) 
where the first equality is by the fact that and the second equality is by the definition of . The second term on the right hand side can be bounded by by using line (1) of the Algorithm (see Lemma 3) where . The rest of the proof proceeds to write the first term on the right hand side in terms of (to telescope with the left hand side) plus some sublinear additive terms. We can write:
By Lemma 6, the term can be written as:
By changing the order of summation on and , we can write:
We proceed by upper bounding each term in the latter by using Lemma 4(3). Note that
Moreover, note that is unique upto a constant. So, without loss of generality, we choose such that , where is the uniform bound on (and ) as in Lemma 2. This choice of implies that for all . Replacing these bounds for and in Lemma 4(3) implies
(6) 
To simplify the right hand side of the above inequality, observe that
(7) 
Similarly,
(8)  
(9) 
Using the inequalities in Lemma 5 and Lemma 4(5), replacing the equalities (7), (8), (9) into the right hand side of (4.1), and adding and subtracting implies
(10) 
Note that by Assumption 1, . Furthermore, is a martingale difference sequence and can be bounded by with probability at least , using Azuma’s inequality. Moreover, by Lemma 8. Replacing these bounds on the right hand side of the above inequality, simplifying the result and plugging back into (4.1) implies
with probability at least . Telescoping the left hand side with the right hand side and noting that (Lemma 2) and , implies that
with probability at least . Replacing this bound into (4) implies that
with probability at least which completes the proof. ∎
4.2 Auxiliary Lemmas
In this section, we provide some auxiliary lemmas that are used in the proof of Theorem 1. The proof for these lemmas can be found in the appendix.
Lemma 2.
The in Algorithm 1 is bounded by .
Lemma 3.
The second term of (4.1) can be bounded by
Lemma 4.
The following properties hold:

for any .

For any , and any , we have .

Let be a scalar and define and . Then, for any , and any , we have .

For any , we have .

For any , we have .
Lemma 5 (Frequently used inequalities).
The following inequalities hold:

.

.

.
Lemma 6.
For a fixed , let , and be the time step at which is taken for the th time. Then,
Lemma 7.
With probability at least , the regret of any algorithm is bounded as
Lemma 8.
.
5 Experiments
In this section, we numerically evaluate the performance of our proposed EEQL algorithm. Two environments are considered: RandomMDP and RiverSwim. The RandomMDP environment is an ergodic MDP with states and actions where the transition kernel and the rewards are chosen uniformly at random. The RiverSwim environment is a weakly communicating MDP with states arranged in a chain and actions (left and right) that simulates an agent swimming in a river. If the agent swims left (i.e., in the direction of the river current), it is always successful. If it decides to swim right, it may fail with some probability. The reward function can be described as follows: , and for all other states and actions. The agent starts from the leftmost state (). The optimal policy is to always swim right to reach the highreward state .
We compare our algorithm against Optimistic QL (Wei et al., 2020), MDPOOMD (Wei et al., 2020), and Politex (AbbasiYadkori et al., 2019a) as modelfree algorithms and UCRL2 (Jaksch et al., 2010) and PSRL (Ouyang et al., 2017) as modelbased benchmarks. The hyper parameters for these algorithms are tuned to obtain the best performance (see Table 2 for more details). is chosen as in (2) with appropriate (see Table 2). We numerically verified that this choice of satisfies Assumption 1 with . Figure 1 shows that in the RiverSwim environment, our algorithm significantly outperforms Optimistic QL, the only existing modelfree algorithm with low regret for weakly communicating MDPs. The reason is that the proposed algorithm does not waste optimism for the entire function. Rather, the optimism in the face of uncertainty principle is used around a single scalar . Note that other modelfree algorithms such as Politex and MDPOOMD did not yield sublinear regret in RiverSwim and thus removed from the figure. This is due to the fact that RiverSwim does not satisfy the ergodicity assumption required by these algorithms. Moreover, both in the RiverSwim and RandomMDP environments, our algorithm performs as well as the best existing modelbased algorithms in practice, though with less memory.
Conclusions
We proposed EEQL, the first modelfree algorithm with regret bound for weakly communicating MDPs in the infinitehorizon averagereward setting. Our algorithm has a tremendous numerical performance, significantly better than the existing modelfree algorithms and similar to the best modelbased algorithms, yet with less memory. The key to obtain such numerical performance is to avoid optimistic estimation of each entry of the function. Instead, EEQL uses optimism for a single scalar (the gain of the optimal policy). Our algorithm assumes that a concentrating estimate of is available. This assumption is verified numerically for an optimistic empirical average reward estimator. The theoretical verification of this assumption is left for future work.
References

AbbasiYadkori et al. (2019a)
Yasin AbbasiYadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba
Szepesvari, and Gellért Weisz.
Politex: Regret bounds for policy iteration using expert prediction.
In
International Conference on Machine Learning
, pages 3692–3702, 2019a.  AbbasiYadkori et al. (2019b) Yasin AbbasiYadkori, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. Explorationenhanced politex. arXiv preprint arXiv:1908.10479, 2019b.
 Abounadi et al. (2001) Jinane Abounadi, D Bertsekas, and Vivek S Borkar. Learning algorithms for markov decision processes with average cost. SIAM Journal on Control and Optimization, 40(3):681–698, 2001.
 Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 263–272. JMLR. org, 2017.

Bartlett and Tewari (2009)
Peter L Bartlett and Ambuj Tewari.
Regal: A regularization based algorithm for reinforcement learning in
weakly communicating mdps.
In
Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, pages 35–42. AUAI Press, 2009.  Dong et al. (2019) Kefan Dong, Yuanhao Wang, Xiaoyu Chen, and Liwei Wang. Qlearning with ucb exploration is sample efficient for infinitehorizon mdp. arXiv preprint arXiv:1901.09311, 2019.
 Fruit et al. (2018) Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient biasspanconstrained explorationexploitation in reinforcement learning. In International Conference on Machine Learning, pages 1573–1581, 2018.
 Fruit et al. (2019) Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of ucrl2b, 2019. Available at rlgammazero.github.io/docs/ucrl2b_improved.pdf.
 Hao et al. (2020) Botao Hao, Nevena Lazic, Yasin AbbasiYadkori, Pooria Joulani, and Csaba Szepesvari. Provably efficient adaptive approximate policy iteration. arXiv preprint arXiv:2002.03069, 2020.
 Jaksch et al. (2010) Thomas Jaksch, Ronald Ortner, and Peter Auer. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
 Jin et al. (2018) Chi Jin, Zeyuan AllenZhu, Sebastien Bubeck, and Michael I Jordan. Is Qlearning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
 Ortner (2018) Ronald Ortner. Regret bounds for reinforcement learning via markov chain concentration. arXiv preprint arXiv:1808.01813, 2018.

Ouyang et al. (2017)
Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain.
Learning unknown markov decision processes: A thompson sampling approach.
In Advances in Neural Information Processing Systems, pages 1333–1342, 2017.  Puterman (2014) Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Talebi and Maillard (2018) Mohammad Sadegh Talebi and OdalricAmbrym Maillard. Varianceaware regret bounds for undiscounted reinforcement learning in mdps. In Algorithmic Learning Theory, pages 770–805, 2018.
 Wei et al. (2020) ChenYu Wei, Mehdi JafarniaJahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Modelfree reinforcement learning in infinitehorizon averagereward markov decision processes. In International Conference on Machine Learning, 2020.
 Zanette and Brunskill (2019) Andrea Zanette and Emma Brunskill. Tighter problemdependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, 2019.
 Zhang and Ji (2019) Zihan Zhang and Xiangyang Ji. Regret minimization for reinforcement learning by evaluating the optimal bias function. In Advances in Neural Information Processing Systems, 2019.
Appendix
Appendix A Proof of Lemma 2
Proof.
We first prove for the case where and then extend the proof to the general case. Let be an operator on the space of functions defined by
where is arbitrary. Note that is a nonexpansive operator because
where . Thus, . Moreover, note that is a fixed point of by the Bellman equation, i.e., . For the case that , of the algorithm can be obtained by applying a sequence of these nonexpansive operators. Let . We have
Comments
There are no comments yet.