Reinforcement learning (RL) studies the problem of an agent interacting with an unknown environment while trying to maximize its cumulative reward. The agent faces a fundamental exploration-exploitation trade-off: should it explore the environment to gain more information for future decisions, or should it exploit the available information to maximize the reward. Efficient exploration is a crucial property of learning algorithms evaluated with the notion of regret: the difference between the cumulative reward of the optimal policy and that of the algorithm. Regret quantifies the speed of learning, i.e., low regret algorithms can learn more efficiently.
RL algorithms can broadly be classified asmodel-based and model-free
. Model-based algorithms maintain an estimate of the environment dynamics and plan based on the estimated model. Model-free algorithms, on the other hand, directly estimate the value function or the policy without explicitly estimating the environment model. Model-free algorithms are simpler, memory and computation efficient, and more amenable to extend to large scale problems by incorporating function approximation. Indeed, most of the recent advances in RL such as DQN(Mnih et al., 2013), TRPO (Schulman et al., 2015), AC3 (Mnih et al., 2016), PPO (Schulman et al., 2017), etc., are all in the model-free paradigm.
It was believed that model-based algorithms can better manage the trade-off between exploration and exploitation. Several model-based algorithms with low regret guarantees have been proposed in the past decade including UCRL2 (Jaksch et al., 2010), REGAL (Bartlett and Tewari, 2009), PSRL (Ouyang et al., 2017), UCBVI (Azar et al., 2017) , SCAL (Fruit et al., 2018), EBF (Zhang and Ji, 2019) and EULER (Zanette and Brunskill, 2019). However, the recent success of model-free algorithms in practice raised the theoretical question of whether it is possible to design model-free algorithms with low regret guarantees. In Jin et al. (2018), it was shown for the first time that (model-free) Q-learning (QL) with UCB exploration can achieve near-optimal regret bound in the episodic finite-horizon Markov Decision Processes (MDPs) where hides constants and logarithmic factors. This result was extended by Dong et al. (2019) to the infinite-horizon discounted setting.
However, designing model-free algorithms with near-optimal regret in the infinite-horizon average-reward scheme has been rather challenging. The main difficulty in this setting is that the estimate of the -value function may grow unbounded over time due to the infinite-horizon nature of the problem and lack of the discount factor. Moreover, the contraction property of the discounted setting does not hold and the backward induction technique in the finite-horizon cannot be applied here.
This paper presents Exploration Enhanced Q-learning (EE-QL), the first model-free algorithm that achieves regret for the infinite-horizon average-reward MDPs without the strong ergodicity assumption. We consider the general class of weakly communicating MDPs with finite states and actions. In prior work Wei et al. (2020), the Optimistic QL algorithm does not need the strong ergodicity assumption, but achieves only regret, while the MDP-OOMD algorithm in the same paper, achieves regret, but needs the strong ergodicity assumption. Our result matches the lower bound of Jaksch et al. (2010) in terms of except for logarithmic factors. For comparison to other model-based and model-free algorithms see Table 1.
EE-QL (read equal) uses stochastic approximation to estimate the -value function by assuming that a concentrating estimate of the optimal gain is available. The key idea of this algorithm is the careful design of the learning rate to efficiently balance the effect of new and old observations as well as controlling the magnitude of the -value function. Despite the typical learning rate of (where is the number of visits to the corresponding state-action pair) in the standard Q-learning type algorithms, the proposed EE-QL algorithm uses the learning rate of . This learning rate provides nice properties (listed in Lemma 4) that are central to our analysis. In addition, experiments show that EE-QL significantly outperforms the existing model-free algorithms and has similar performance to the best model-based algorithms. This is due to the fact that, unlike previous model-free algorithms in the tabular setting that optimistically estimate each entry of the optimal -value function (Jin et al., 2018; Dong et al., 2019; Wei et al., 2020), EE-QL estimates a single scalar (the optimal gain) optimistically to avoid spending unnecessary optimism.
We consider infinite-horizon average-reward MDPs described by where is the state space, is the action space, is the deterministic reward function, and is the transition kernel. Here and are finite sets with cardinalities and , respectively. The gain of a stationary deterministic policy with the initial state is defined as
where for . Let be the optimal gain. The optimal gain is independent of the initial state for the standard class of weakly communicating MDPs considered in this paper. An MDP is weakly communicating if its state space can be divided into two subsets. In the first subset, all the states are transient under any stationary policy. In the second subset, every state is accessible from any other state under some stationary policy. It is known that the weakly communicating assumption is required to achieve low regret (Bartlett and Tewari, 2009). From the standard MDP theory (Puterman, 2014), we know that for weakly communicating MDPs, there exists a function (unique up to an additive constant) such that the following Bellman equation holds:
for all and . The optimal gain is achieved by the corresponding optimal policy (note that such a policy may not be unique).
In this paper, we consider the reinforcement learning problem of an agent interacting with a weakly communicating MDP with unknown transition kernel and reward function (thus, the Bellman equation cannot be solved directly). At each time , the agent observes the state , takes action , and receives the reward . The next state
is then determined according to the probability distribution. The performance of the learning algorithm is quantified by the notion of cumulative regret defined as
Regret evaluates the transient performance of the learning algorithm by measuring the difference between the total gain of the optimal policy and the cumulative reward obtained by the learning algorithm upto time . The goal of the agent is to maximize the total reward (or equivalently minimize the regret). If a learning algorithm achieves sub-linear regret, its average reward converges to the optimal gain. Zhang and Ji (2019) proposed a model-based algorithm with regret bound of (where is the diameter of the MDP) and matches the lower bound of Jaksch et al. (2010). The best existing regret bound of a model-free algorithm for weakly communicating MDPs is by Wei et al. (2020).
3 The Exploration Enhanced Q-learning Algorithm
In this section, we introduce the Exploration Enhanced Q-learning (EE-QL) algorithm (see Algorithm 1). The algorithm works for the broad class of weakly communicating MDPs. It is well-known that the weakly communicating condition is necessary to achieve sublinear regret (Bartlett and Tewari, 2009).
EE-QL approximates the -value function for the infinite-horizon average-reward setting using stochastic approximation with carefully chosen learning rates. The algorithm takes greedy actions with respect to the current estimate, function. After visiting the next state, a stochastic update of is made based on the Bellman equation. in the algorithm is an estimate of that satisfies the following assumption.
(Concentrating ) There exists a constant such that .
In some applications, is known apriori. For example, in the infinite horizon version of Cartpole described in Hao et al. (2020), the optimal policy keeps the pole upright throughout the horizon which leads to a known . In such cases, one can simply set . In applications where is not known, one can set for some constant , where and is stochastically updated as for some decaying learning rate . In particular, yields
We have numerically verified that this choice of with satisfies Assumption 1 for in the RiverSwim and RandomMDP environemnts (see Section 5 for more details). The choice of the learning rate is particularly important. Choosing (rather than ) efficiently combines the new and old observations and provides nice properties listed in Lemma 4 that play a central role in the analysis. The widely used learning rate of in the standard Q-learning algorithm (Abounadi et al., 2001) may not satisfy these properties.
In addition, unlike the Q-learning algorithms with UCB exploration (Jin et al., 2018; Dong et al., 2019; Wei et al., 2020), EE-QL does not optimistically estimate the -value function. In the case that , the algorithm need not follow the optimism in the face of uncertainty principle as in (Jin et al., 2018; Dong et al., 2019; Wei et al., 2020; Jaksch et al., 2010). However, our numerical experiments show that if is not known, has to be an optimistic estimate of the average reward as in (2). Thus, EE-QL
is economical in using optimism. In other words, instead of wasting optimistic confidence intervals around each entry of thefunction, our algorithm is optimistic around a single scalar . This leads to significant improvement in the numerical performance compared to the literature (see Section 5). We now state the main regret guarantee of Algorithm 1.
This result improves the previous best known regret bound of by Wei et al. (2020) and matches the lower bound of (Jaksch et al., 2010) in terms of up to logarithmic factors. To the best of our knowledge, this is the first model-free algorithm that achieves regret bound for the general class of weakly communicating MDPs in the infinite-horizon average-reward setting.
In this section, we provide the proof of Theorem 1. Before we start the analysis, let’s define
for , where is the learning rate used in Algorithm 1. determines the effect of the -th step on -th update. This quantity has nice properties that are listed in Lemma 4 and are central to our analysis. In particular, the regret bound is merely due to properties 2 and 4 in Lemma 4.
4.1 Proof of Theorem 1
We start by decomposing the regret using Lemma 7. With probability at least , the regret of any algorithm can be bounded by
where . Suffices to bound . Let denote the number of visits to state-action pair before time (including time and excluding time ). For notational simplicity, let and be the time step at which is visited for the th time. We can write:
where the first equality is by the fact that and the second equality is by the definition of . The second term on the right hand side can be bounded by by using line (1) of the Algorithm (see Lemma 3) where . The rest of the proof proceeds to write the first term on the right hand side in terms of (to telescope with the left hand side) plus some sublinear additive terms. We can write:
By Lemma 6, the term can be written as:
By changing the order of summation on and , we can write:
We proceed by upper bounding each term in the latter by using Lemma 4(3). Note that
Moreover, note that is unique upto a constant. So, without loss of generality, we choose such that , where is the uniform bound on (and ) as in Lemma 2. This choice of implies that for all . Replacing these bounds for and in Lemma 4(3) implies
To simplify the right hand side of the above inequality, observe that
Note that by Assumption 1, . Furthermore, is a martingale difference sequence and can be bounded by with probability at least , using Azuma’s inequality. Moreover, by Lemma 8. Replacing these bounds on the right hand side of the above inequality, simplifying the result and plugging back into (4.1) implies
with probability at least . Telescoping the left hand side with the right hand side and noting that (Lemma 2) and , implies that
with probability at least . Replacing this bound into (4) implies that
with probability at least which completes the proof. ∎
4.2 Auxiliary Lemmas
In this section, we provide some auxiliary lemmas that are used in the proof of Theorem 1. The proof for these lemmas can be found in the appendix.
The in Algorithm 1 is bounded by .
The second term of (4.1) can be bounded by
The following properties hold:
for any .
For any , and any , we have .
Let be a scalar and define and . Then, for any , and any , we have .
For any , we have .
For any , we have .
Lemma 5 (Frequently used inequalities).
The following inequalities hold:
For a fixed , let , and be the time step at which is taken for the th time. Then,
With probability at least , the regret of any algorithm is bounded as
In this section, we numerically evaluate the performance of our proposed EE-QL algorithm. Two environments are considered: RandomMDP and RiverSwim. The RandomMDP environment is an ergodic MDP with states and actions where the transition kernel and the rewards are chosen uniformly at random. The RiverSwim environment is a weakly communicating MDP with states arranged in a chain and actions (left and right) that simulates an agent swimming in a river. If the agent swims left (i.e., in the direction of the river current), it is always successful. If it decides to swim right, it may fail with some probability. The reward function can be described as follows: , and for all other states and actions. The agent starts from the leftmost state (). The optimal policy is to always swim right to reach the high-reward state .
We compare our algorithm against Optimistic QL (Wei et al., 2020), MDP-OOMD (Wei et al., 2020), and Politex (Abbasi-Yadkori et al., 2019a) as model-free algorithms and UCRL2 (Jaksch et al., 2010) and PSRL (Ouyang et al., 2017) as model-based benchmarks. The hyper parameters for these algorithms are tuned to obtain the best performance (see Table 2 for more details). is chosen as in (2) with appropriate (see Table 2). We numerically verified that this choice of satisfies Assumption 1 with . Figure 1 shows that in the RiverSwim environment, our algorithm significantly outperforms Optimistic QL, the only existing model-free algorithm with low regret for weakly communicating MDPs. The reason is that the proposed algorithm does not waste optimism for the entire function. Rather, the optimism in the face of uncertainty principle is used around a single scalar . Note that other model-free algorithms such as Politex and MDP-OOMD did not yield sub-linear regret in RiverSwim and thus removed from the figure. This is due to the fact that RiverSwim does not satisfy the ergodicity assumption required by these algorithms. Moreover, both in the RiverSwim and RandomMDP environments, our algorithm performs as well as the best existing model-based algorithms in practice, though with less memory.
We proposed EE-QL, the first model-free algorithm with regret bound for weakly communicating MDPs in the infinite-horizon average-reward setting. Our algorithm has a tremendous numerical performance, significantly better than the existing model-free algorithms and similar to the best model-based algorithms, yet with less memory. The key to obtain such numerical performance is to avoid optimistic estimation of each entry of the function. Instead, EE-QL uses optimism for a single scalar (the gain of the optimal policy). Our algorithm assumes that a concentrating estimate of is available. This assumption is verified numerically for an optimistic empirical average reward estimator. The theoretical verification of this assumption is left for future work.
Abbasi-Yadkori et al. (2019a)
Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba
Szepesvari, and Gellért Weisz.
Politex: Regret bounds for policy iteration using expert prediction.
International Conference on Machine Learning, pages 3692–3702, 2019a.
- Abbasi-Yadkori et al. (2019b) Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. Exploration-enhanced politex. arXiv preprint arXiv:1908.10479, 2019b.
- Abounadi et al. (2001) Jinane Abounadi, D Bertsekas, and Vivek S Borkar. Learning algorithms for markov decision processes with average cost. SIAM Journal on Control and Optimization, 40(3):681–698, 2001.
- Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
Bartlett and Tewari (2009)
Peter L Bartlett and Ambuj Tewari.
Regal: A regularization based algorithm for reinforcement learning in
weakly communicating mdps.
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 35–42. AUAI Press, 2009.
- Dong et al. (2019) Kefan Dong, Yuanhao Wang, Xiaoyu Chen, and Liwei Wang. Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. arXiv preprint arXiv:1901.09311, 2019.
- Fruit et al. (2018) Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, pages 1573–1581, 2018.
- Fruit et al. (2019) Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of ucrl2b, 2019. Available at rlgammazero.github.io/docs/ucrl2b_improved.pdf.
- Hao et al. (2020) Botao Hao, Nevena Lazic, Yasin Abbasi-Yadkori, Pooria Joulani, and Csaba Szepesvari. Provably efficient adaptive approximate policy iteration. arXiv preprint arXiv:2002.03069, 2020.
- Jaksch et al. (2010) Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- Jin et al. (2018) Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
- Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
- Ortner (2018) Ronald Ortner. Regret bounds for reinforcement learning via markov chain concentration. arXiv preprint arXiv:1808.01813, 2018.
Ouyang et al. (2017)
Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain.
Learning unknown markov decision processes: A thompson sampling approach.In Advances in Neural Information Processing Systems, pages 1333–1342, 2017.
- Puterman (2014) Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Talebi and Maillard (2018) Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. In Algorithmic Learning Theory, pages 770–805, 2018.
- Wei et al. (2020) Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. In International Conference on Machine Learning, 2020.
- Zanette and Brunskill (2019) Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, 2019.
- Zhang and Ji (2019) Zihan Zhang and Xiangyang Ji. Regret minimization for reinforcement learning by evaluating the optimal bias function. In Advances in Neural Information Processing Systems, 2019.
Appendix A Proof of Lemma 2
We first prove for the case where and then extend the proof to the general case. Let be an operator on the space of -functions defined by
where is arbitrary. Note that is a non-expansive operator because
where . Thus, . Moreover, note that is a fixed point of by the Bellman equation, i.e., . For the case that , of the algorithm can be obtained by applying a sequence of these non-expansive operators. Let . We have