Reinforcement learning (RL) (Sutton and Barto, 2018)
is a field of machine learning that tackles the problem of learning how to act in aunknown dynamic environment. An agent interacts with the environment, and receives feedback on its actions, in the form of a state-dependent reward signal. Using this experience, the agent’s goal is then to find a policy that maximizes the long-term reward.
There are two main approaches for learning such a policy: model-based and model-free. The model-based approach estimates the system’s model and uses it to assess the long-term effects of actions viafull-planning (e.g., Jaksch et al. 2010). Model-based RL algorithms usually enjoy good performance guarantees, which are measured by their regret – the difference between the sum of rewards gained by playing an optimal policy and the sum of rewards that the agent accumulates (Jaksch et al., 2010; Bartlett and Tewari, 2009). Nevertheless, model-based algorithms suffer from high space and computation complexity. The first is caused by the need for storing a model. The second is due to the frequent full-planning, which requires a full solution of the estimated model. Alternatively, model-free RL algorithms directly estimate quantities that take into account the long-term effect of an action, thus, avoiding model estimation and planning operations altogether (Jin et al., 2018). These algorithms usually enjoy better computational and space complexity, but seem to have worse performance guarantees.
In many applications, the high computational complexity of model-based RL is infeasible. Thus, practical model-based approaches alleviate this computational burden by using short-term planning e.g., Dyna (Sutton, 1991), instead of full-planning. To the best of our knowledge, there are no regret guarantees for such algorithms, even in the tabular setting. This raises the following question: Can model-based approach coupled with short-term planning enjoy the favorable performance of model-based RL?
In this work, we show that model-based algorithms that use 1-step planning can achieve the same performance as algorithms that perform full-planning, and thus, answering affirmatively to the above question. As Table 1 shows, this decreases the computational complexity gap between model-based and model-free approaches by a factor of . To this end, we adapt the Real-Time Dynamic-Programming (RTDP) algorithm (Barto et al., 1995) that finds the optimal policy of a known model by acting greedily based on 1-step planning. We demonstrate how RTDP can be incorporated into model-based RL algorithms, and prove that the regret of the resulting algorithms remains unchanged, while their computational complexity drastically decreases.
The contributions of our paper are as follows: we first prove regret bounds for RTDP when the model is known. To do so, we establish some concentration results on Decreasing Bounded Processes, which are of independent interest. We then show that the regret bound translates into a Uniform Probably Approximately Correct (PAC)(Dann et al., 2017) bound for RTDP that greatly improves existing PAC results (Strehl et al., 2006). Next, we move to the learning problem, where the model is unknown. Based on the results developed for RTDP, and focusing on finite-horizon MDPs, we adapt UCRL2 (Jaksch et al., 2010) and EULER (Zanette and Brunskill, 2019), both act by full-planning, to UCRL2 with Greedy Policies (UCRL2-GP) and EULER with Greedy Policies (EULER-GP);model-based algorithms that act by 1-step planning. The adapted versions are shown to preserve the performance guarantees, while improve in terms of computational complexity.
|Algorithm||Regret||Time Complexity||Space Complexity|
|UCRL2 (Jaksch et al., 2010)|
|UCBVI (Azar et al., 2017)|
|EULER (Zanette and Brunskill, 2019)|
|Q-v2 (Jin et al., 2018)|
2 Notations and Definitions
We consider finite-horizon MDPs with time-independent dynamics (Bertsekas and Tsitsiklis, 1996). A finite-horizon MDP is defined by the tuple , where and are the state and action spaces with cardinalities and , respectively. The immediate reward for taking an action at state
is a random variablewith expectation . The transition probability is , the probability of transitioning to state upon taking action at state . Furthermore, is the maximum number of non-zero transition probabilities across the entire state-action pairs. If this number is unknown to the designer of the algorithm in advanced, then we set . The initial state in each episode is arbitrarily chosen and is the horizon, i.e., the number of time-steps in each episode. We define for all , and throughout the paper use and to denote time-step inside an episode and the index of an episode, respectively.
A deterministic policy is a mapping from states and time-step indices to actions. We denote by , the action taken at time at state according to a policy . The quality of a policy from state at time is measured by its value function, which is defined as
where the expectation is over the environment’s randomness. An optimal policy maximizes this value for all states and time-steps , and the corresponding optimal value is denoted by for all . The optimal value satisfies the optimal Bellman equation, i.e.,
We consider an agent that repeatedly interacts with an MDP in a sequence of episodes . The performance of the agent is measured by its regret, defined as . Throughout this work, the policy is computed by a 1-step planning operation with respect to the value function estimated by the algorithm at the end of episode , denoted by . We also call such policy a greedy policy. Moreover, and stand, respectively, for the state and the action taken at the time-step of the episode.
Next, we define the filtration that includes all events (states, actions, and rewards) until the end of the episode, as well as the initial state of the episode . We denote by , the total number of time-steps (samples). Moreover, we denote by , the number of times that the agent has visited state-action pair , and by , the empirical average of a random variable . Both quantities are based on experience gathered until the end of the episode and are measurable. We also define the probability to visit the state-action pair at the episode at time-step by . We note that is measurable, and thus, . Also denote .
We use to refer to a quantity that depends on up to poly-log expression of a quantity at most polynomial in , , , , , and . Similarly, represents up to numerical constants or poly-log factors. We define , where
is a probability distribution over the domain of, and use . Lastly, is the set of probability distributions over the state space .
3 Real-Time Dynamic Programming
RTDP (Barto et al., 1995) is a well-known algorithm that solves an MDP when a model of the environment is given. Unlike, e.g., Value Iteration (VI) (Bertsekas and Tsitsiklis, 1996) that solves an MDP by offline calculations, RTDP solves an MDP in a real-time manner. As mentioned in Barto et al. (1995), RTDP can be interpreted as an asynchronous VI adjusted to a real-time algorithm.
Algorithm 1 contains the pseudocode of RTDP for finite-horizon MDPs. The value function is initialized with an optimistic value, i.e., an upper bound of the optimal value. At each time-step and episode , the agent acts from the current state greedily with respect to the current value at the next time step, . It then updates the value of according to the optimal Bellman operator. We denote by , the value function, and as we show in the following, it always upper bounds . Note that since the action at a fixed state is chosen according to , then is measurable.
Since RTDP is an online algorithm, i.e., it updates its value estimates through interactions with the environment, it is natural to measure its performance in terms of the regret. The rest of this section is devoted to supplying expected and high-probability bounds on the regret of RTDP, which will also lead to PAC bounds for this algorithm. In Section 4, based on the observations from this section, we will establish minimax regret bounds for 1-step greedy model-based RL.
We start by stating two basic properties of RTDP in the following lemma: the value is always optimistic and decreases in (see proof in Appendix B). Although the first property is known (Barto et al., 1995), to the best of our knowledge, the second one has not been proven in previous work.
For all , , and , it holds that (i) and (ii) .
The following lemma, that we believe is new, relates the difference between the optimistic value and the real value to the expected cumulative update of the value function at the end of the episode (see proof in Appendix B).
Lemma 2 (Value Update for Exact Model).
The expected cumulative value update of RTDP at the episode satisfies
The result relates the difference of the optimistic value and the value of the greedy policy to the expected update along the trajectory, created by following . Thus, for example, if the optimistic value is overestimated, then the value update throughout this episode is expected to be large.
3.1 Regret and PAC Analysis
Using Lemma 1, we observe that the sequence of values is decreasing and bounded from below. Thus, intuitively, the decrements of the values cannot be indefinitely large. Importantly, Lemma 2 states that when the expected decrements of the values are small, then is close to , and thus, to , since the former upper bounds the latter.
Building on this reasoning, we are led to establish a general result on a decreasing process. This result will allow us to formally justify the aforementioned reasoning and derive regret bounds for RTDP. The proof utilizes self-normalized concentration bounds (de la Peña et al., 2007), applied on martingales, and can be found in Appendix A.
Definition 1 (Decreasing Bounded Process).
We call a random process , where is a filtration and is adapted to this filtration, a Decreasing Bounded Process, if it satisfies the following properties:
decreases, i.e., a.s. .
and for all a.s. .
Theorem 3 (Regret Bound of a Decreasing Bounded Process).
Let be a Decreasing Bounded Process and be its -round regret. Then,
Specifically, it holds that
We are now ready to prove the central result of this section, the expected and high-probability regret bounds on RTDP (see full proof in Appendix B).
Theorem 4 (Regret Bounds for RTDP).
The following regret bounds hold for RTDP:
For any , with probability , for all ,
Theorem 4 exhibits a regret bound that does not depend on . While it is expected that RTDP, that has access to the exact model, would achieve better performance than an RL algorithm with no such access, a regret bound independent of is a noteworthy result. Indeed, it leads to the following Uniform PAC (see Dann et al. 2017 for the definition) and PAC guarantees for RTDP (see proofs in Appendix B). To the best of our knowledge, both are the first PAC guarantees for RTDP.111Existing PAC results on RTDP analyze variations of RTDP in which is an input parameter of the algorithm.
Corollary 5 (RTDP is Uniform PAC).
Let and be the number of episodes in which RTDP outputs a policy with . Then,
Corollary 6 (RTDP is Pac).
Let and be the number of episodes in which RTDP outputs a non optimal policy. Define the (unknown) gap of the MDP, Then,
4 Exploration in Model-based RL: Greedy Policy Achieves Minimax Regret
We start this section by formulating a general optimistic RL scheme that acts by 1-step planning (see Algorithm 2). Then, we establish Lemma 7, which generalizes Lemma 2 to the case where a non-exact model is used for the value updates. Using this lemma, we offer a novel regret decomposition for algorithms which follow Algorithm 2. Based on the decomposition, we analyze generalizations of UCRL2 (Jaksch et al., 2010) (for finite horizon MDPs) and EULER (Zanette and Brunskill, 2019), that use greedy policies instead of solving an MDP (full planning) at the beginning of each episode. Surprisingly, we find that both generalized algorithms do not suffer from performance degradation, up to numerical constants and logarithmic factors. Thus, we conclude that there exists an RL algorithm that achieves the minimax regret bound, while acting according to greedy policies.
Consider the general RL scheme that explores by greedy policies as depicted in Algorithm 2. The value is initialized optimistically and the algorithm interacts with the unknown environment in an episodic manner. At each time-step , a greedy policy from the current state, , is calculated optimistically based on the empirical model and the current value at the next time-step . This is done in a subroutine called ‘ModelBaseOptimisticQ’. We further assume the optimistic -function has the form and refer to as the optimistic model. The agent interacts with the environment based on the greedy policy with respect to and uses the gathered experience to update the empirical model at the end of the episode.
By construction of the update rule (see Line 7), the value is a decreasing function of , for all . Thus, property (ii) in Lemma 1 holds for Algorithm 2. Furthermore, the algorithms analyzed in this section will also be optimistic with high probability, i.e., property (i) in Lemma 1 also holds. Finally, since the value update uses the empirical quantities , , and from the previous episode, policy is still measurable.
Lemma 7 (Value Update for Optimistic Model).
The expected cumulative value update of Algorithm 2 in the episode is bounded by
In the rest of the section, we consider two instantiations of the subroutine ‘ModelBaseOptimisticQ’ in Algorithm 2. We use the bonus terms of UCRL2 and of EULER to acquire an optimistic -function, . These two options then lead to UCRL2 with Greedy Policies (UCRL2-GP) and EULER with Greedy Policies (EULER-GP) algorithms.
4.1 UCRL2 with Greedy Policies for Finite-Horizon MDPs
We form the optimistic local model based on the confidence set of UCRL2 (Jaksch et al., 2010). This amounts to use Algorithm 3 as the subroutine ‘ModelBaseOptimisticQ’ in Algorithm 2. The maximization problem on Line 3 of Algorithm 3 is common, when using bonus based on an optimistic model (Jaksch et al., 2010), and it can be solved efficiently in operations (e.g., Strehl and Littman 2008, Section 3.1.5). A full version of the algorithm can be found in Appendix D.
Thus, Algorithm 3 performs operations per episode. This saves the need to perform Extended Value Iteration (Jaksch et al., 2010), that costs operations per episode (an extra factor of ). Despite the significant improvement in terms of computational complexity, the regret of UCRL2-GP is similar to the one of UCRL2 (Jaksch et al., 2010) as the following theorem formalizes (see proof in Appendix D).
Theorem 8 (Regret Bound of UCRL2-GP).
For any time , with probability at least , the regret of UCRL2-GP is bounded by .
Thus, the regret is upper bounded by two terms. As in Theorem 4, by applying Lemma 11 (Appendix A), the first term in (4) is a sum of Decreasing Bounded Processes, and can thus be bounded by . The presence of the second term in (4) is common in recent regret analyses (e.g., Dann et al. 2017). Using standard techniques (Jaksch et al., 2010; Dann et al., 2017; Zanette and Brunskill, 2019), this term can be bounded (up to additive constant factors) with high probability by . ∎
4.2 EULER with Greedy Policies
In this section, we use bonus terms as in EULER (Zanette and Brunskill, 2019). Similar to the previous section, this amounts to replacing the subroutine ‘ModelBaseOptimisticQ’ in Algorithm 2 with a subroutine based on the bonus terms from (Zanette and Brunskill, 2019). Algorithm 5 in Appendix E contains the pseudocode of the algorithm. The bonus terms in EULER are based on the empirical Bernstein inequality and tracking both an upper bound and a lower-bound on . Using these, EULER achieves both minimax optimal and problem dependent regret bounds.
EULER (Zanette and Brunskill, 2019) performs computations per episode (same as the VI algorithm), while EULER-GP requires only . Despite this advantage in computational complexity, EULER-GP exhibits similar minimax regret bounds to EULER (see proof in Appendix E), much like the equivalent performance of UCRL2 and UCRL2-GP proved in Section 4.1.
Theorem 9 (Regret Bound of EULER-GP).
Let be an upper bound on the total reward collected within an episode. Define and . With probability at least , for any time jointly on all episodes , the regret of EULER-GP is bounded by Thus, it is also bounded by .
5 Related Work
Real-Time Dynamic Programming: RTDP (Barto et al., 1995) has been extensively used and has many variants that exhibit superior empirical performance (e.g., (Bonet and Geffner, 2003; McMahan et al., 2005; Smith and Simmons, 2006)). For planning in discounted MDPs, Strehl et al. (2006) proved -PAC bounds of , for a modified version of the algorithm that updates the value only if the update size is larger than . Bonet and Geffner (2003) proved convergence rate of for Labeled RTDP in the stochastic shortest path problem, which is equivalent to in finite-horizon MDP. Nevertheless, these algorithms explicitly use to mark states with accurate value estimate. We prove that RTDP converges in a rate of without knowing . Indeed, Strehl et al. (2006) posed whether the original RTDP is PAC as an open problem. Furthermore, no regret bound for RTDP has been reported in the literature.
Regret bounds for RL: The most renowned algorithms with regret guarantees for undiscounted infinite-horizon MDPs are UCRL2 (Jaksch et al., 2010) and REGAL (Bartlett and Tewari, 2009), which have been extended throughout the years (e.g., by Fruit et al. 2018; Talebi and Maillard 2018). Recently, there is an increasing interest in regret bounds for MDPs with finite horizon and stationary dynamics. In this scenario, UCRL2 enjoys a regret bound of order . Azar et al. (2017) proposed UCBVI, with improved regret bound of order , which is also asymptotically tight (Osband and Van Roy, 2016). Dann et al. (2018) presented ORLC that achieves tight regret bounds and (nearly) tight PAC guarantees for non-stationary MDPs. Finally, Zanette and Brunskill (2019) proposed EULER, an algorithm that enjoys tight minimax regret bounds and has additional problem-dependent bounds that encapsulate the MDP’s complexity. All of these algorithms are model-based and require full-planning. Model-free RL was analyzed by (Jin et al., 2018). There, the authors exhibit regret bounds that are worse by a factor of relatively to the lower-bound. To the best of our knowledge, there are no model-based algorithms with regret guarantees that avoid full-planning. It is worth noting that while all the above algorithms, and the ones in this work, rely on the Optimism in the Face of Uncertainty principle (Lai and Robbins, 1985)
, Thompson Sampling model-based RL algorithms exist(Osband et al., 2013; Gopalan and Mannor, 2015; Agrawal and Jia, 2017; Osband and Van Roy, 2017). There, a model is sampled from a distribution over models, on which full-planning takes place.
Greedy policies in model-based RL: By adjusting RTDP to the case where the model is unknown, Strehl et al. (2012) formulated model-based RL algorithms that act using a greedy policy. They proved a sample complexity bound for discounted MDPs. To the best of our knowledge, there are no regret bounds for model-based RL algorithms that act by greedy policies.
Practical model-based RL: Due to the high computational complexity of planning in model-based RL, most of the practical algorithms are model-free (e.g., Mnih et al. 2015). Algorithms that do use a model usually only take advantage of local information. For example, Dyna (Sutton, 1991; Peng et al., 2018) randomly samples state-action pairs and updates them according to a local model. Other papers use the local model to plan for a short horizon from the current state (Tamar et al., 2016; Hafner et al., 2018). The performance of such algorithms depends heavily on the planning horizon, that in turn dramatically increases the computational complexity.
6 Conclusions and Future Work
In this work, we established that tabular model-based RL algorithms can explore by 1-step planning instead of full-planning, without suffering from performance degradation. Specifically, exploring with model-based greedy policies can be minimax optimal in terms of regret. Differently put, the variance caused by exploring with greedy policies is smaller than the variance caused by learning a sufficiently good model. Indeed, the extra term which appears due to the greedy exploration is(e.g., the first term in (4)); a constant term, smaller than the existing constant terms of UCRL2 and EULER.
This work raises and highlights some interesting research questions. The obvious ones are extensions to average and discounted MDPs, as well as to Thompson sampling based RL algorithms. Although these scenarios are harder or different in terms of analysis, we believe this work introduces the relevant approach to tackle this question. Another interesting question is the applicability of the results in large-scale problems, when tabular representation is infeasible and approximation must be used. There, algorithms that act using lookahead policies, instead of 1-step planning, are expected to yield better performance, as they are less sensitive to value approximation errors (e.g., Bertsekas and Tsitsiklis 1996; Jiang et al. 2018; Efroni et al. 2018b, a). Even then, full-planning, as opposed to using a short-horizon planning, might be unnecessary. Lastly, establishing whether the model-based approach is or is not provably better than the model-free approach, as the current state of the literature suggests, is yet an important and unsolved open problem.
We thank Oren Louidor for illuminating discussions relating the Decreasing Bounded Process, and Esther Derman for very helpful comments.
- Agrawal and Jia  Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pages 1184–1194, 2017.
- Azar et al.  Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
Bartlett and Tewari 
Peter L Bartlett and Ambuj Tewari.
Regal: A regularization based algorithm for reinforcement learning in
weakly communicating mdps.
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 35–42. AUAI Press, 2009.
- Barto et al.  Andrew G Barto, Steven J Bradtke, and Satinder P Singh. Learning to act using real-time dynamic programming. Artificial intelligence, 72(1-2):81–138, 1995.
- Bertsekas and Tsitsiklis  Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming, volume 5. Athena Scientific Belmont, MA, 1996.
- Bonet and Geffner  Blai Bonet and Hector Geffner. Labeled rtdp: Improving the convergence of real-time dynamic programming. In ICAPS, volume 3, pages 12–21, 2003.
- Dann et al.  Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
- Dann et al.  Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. arXiv preprint arXiv:1811.03056, 2018.
- de la Peña et al.  Victor H de la Peña, Michael J Klass, Tze Leung Lai, et al. Pseudo-maximization and self-normalized processes. Probability Surveys, 4:172–192, 2007.
- de la Peña et al.  Victor H de la Peña, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008.
- Efroni et al. [2018a] Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. How to combine tree-search methods in reinforcement learning. arXiv preprint arXiv:1809.01843, 2018a.
- Efroni et al. [2018b] Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Multiple-step greedy policies in approximate and online reinforcement learning. In Advances in Neural Information Processing Systems, pages 5238–5247, 2018b.
- Fruit et al.  Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. arXiv preprint arXiv:1802.04020, 2018.
- Gopalan and Mannor  Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decision processes. In Conference on Learning Theory, pages 861–898, 2015.
- Hafner et al.  Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
- Jaksch et al.  Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- Jiang et al.  Daniel R Jiang, Emmanuel Ekwedike, and Han Liu. Feedback-based tree search for reinforcement learning. arXiv preprint arXiv:1805.05935, 2018.
- Jin et al.  Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
- Lai and Robbins  Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- Maurer and Pontil  Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
- McMahan et al.  H Brendan McMahan, Maxim Likhachev, and Geoffrey J Gordon. Bounded real-time dynamic programming: Rtdp with monotone upper bounds and performance guarantees. In Proceedings of the 22nd international conference on Machine learning, pages 569–576. ACM, 2005.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Osband and Van Roy  Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
- Osband and Van Roy  Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2701–2710. JMLR. org, 2017.
- Osband et al.  Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
- Peng et al.  Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Kam-Fai Wong, and Shang-Yu Su. Deep dyna-q: Integrating planning for task-completion dialogue policy learning. arXiv preprint arXiv:1801.06176, 2018.
Smith and Simmons 
Trey Smith and Reid Simmons.
Focused real-time dynamic programming for mdps: Squeezing more out of a heuristic.In AAAI, pages 1227–1232, 2006.
- Strehl and Littman  Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
- Strehl et al.  Alexander L Strehl, Lihong Li, and Michael L Littman. Pac reinforcement learning bounds for rtdp and rand-rtdp. In Proceedings of AAAI workshop on learning for search, 2006.
- Strehl et al.  Alexander L Strehl, Lihong Li, and Michael L Littman. Incremental model-based learners with formal learning-time guarantees. arXiv preprint arXiv:1206.6870, 2012.
- Sutton  Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991.
- Sutton and Barto  Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Talebi and Maillard  Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. arXiv preprint arXiv:1803.01626, 2018.
- Tamar et al.  Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154–2162, 2016.
- Weissman et al.  Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
- Zanette and Brunskill  Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. arXiv preprint arXiv:1901.00210, 2019.
Appendix A Proofs on Decreasing Bounded Processes
In this section, we state and prove useful results on Decreasing Bounded Processes (see Definition 1). These results will be in use in proofs of the central theorems of this work.
Without loss of generality, assume , since otherwise the results are trivial. We start by remarking that is almost surely monotonically increasing, since . Define the martingale difference process
and the martingale process . Since almost surely, can be bounded by . Also define the quadratic variations as and . Next, recall Theorem 2.7 of [de la Peña et al., 2007]:
Let and be two random variables, such that for all , we have
In we used the fact that a.s., which allows us to conclude that the cross-term is non-positive. In (**), we bounded . We can also bound , since each of the summands is a.s. non-negative, and thus,
Combining all of the above bounds yields
Finally, we can bound by
Combining everything we obtain
Or, substituting in (6), we have
Next, notice that for , the function is monotonically increasing for any :
Moreover, for ,
where the inequality holds since . Thus, if , then , and we can bound the probability that by
and setting , we obtain
We remark that since is monotonically increasing a.s., this bound also implies that
To obtain a uniform bound, that is, bound that holds for all , note that the random sequence is monotonically increasing in and bounded. Thus, due to monotone convergence