Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies

05/27/2019 ∙ by Yonathan Efroni, et al. ∙ 0

State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing full-planning on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon MDP setting and establish that exploring with greedy policies -- act by 1-step planning -- can achieve tight minimax performance in terms of regret, Õ(√(HSAT)). Thus, full-planning in model-based RL can be avoided altogether without any performance degradation, and, by doing so, the computational complexity decreases by a factor of S. The results are based on a novel analysis of real-time dynamic programming, then extended to model-based RL. Specifically, we generalize existing algorithms that perform full-planning to such that act by 1-step planning. For these generalizations, we prove regret bounds with the same rate as their full-planning counterparts.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) (Sutton and Barto, 2018)

is a field of machine learning that tackles the problem of learning how to act in a

unknown dynamic environment. An agent interacts with the environment, and receives feedback on its actions, in the form of a state-dependent reward signal. Using this experience, the agent’s goal is then to find a policy that maximizes the long-term reward.

There are two main approaches for learning such a policy: model-based and model-free. The model-based approach estimates the system’s model and uses it to assess the long-term effects of actions via

full-planning (e.g., Jaksch et al. 2010). Model-based RL algorithms usually enjoy good performance guarantees, which are measured by their regret – the difference between the sum of rewards gained by playing an optimal policy and the sum of rewards that the agent accumulates (Jaksch et al., 2010; Bartlett and Tewari, 2009). Nevertheless, model-based algorithms suffer from high space and computation complexity. The first is caused by the need for storing a model. The second is due to the frequent full-planning, which requires a full solution of the estimated model. Alternatively, model-free RL algorithms directly estimate quantities that take into account the long-term effect of an action, thus, avoiding model estimation and planning operations altogether (Jin et al., 2018). These algorithms usually enjoy better computational and space complexity, but seem to have worse performance guarantees.

In many applications, the high computational complexity of model-based RL is infeasible. Thus, practical model-based approaches alleviate this computational burden by using short-term planning e.g., Dyna (Sutton, 1991), instead of full-planning. To the best of our knowledge, there are no regret guarantees for such algorithms, even in the tabular setting. This raises the following question: Can model-based approach coupled with short-term planning enjoy the favorable performance of model-based RL?

In this work, we show that model-based algorithms that use 1-step planning can achieve the same performance as algorithms that perform full-planning, and thus, answering affirmatively to the above question. As Table 1 shows, this decreases the computational complexity gap between model-based and model-free approaches by a factor of . To this end, we adapt the Real-Time Dynamic-Programming (RTDP) algorithm (Barto et al., 1995) that finds the optimal policy of a known model by acting greedily based on 1-step planning. We demonstrate how RTDP can be incorporated into model-based RL algorithms, and prove that the regret of the resulting algorithms remains unchanged, while their computational complexity drastically decreases.

The contributions of our paper are as follows: we first prove regret bounds for RTDP when the model is known. To do so, we establish some concentration results on Decreasing Bounded Processes, which are of independent interest. We then show that the regret bound translates into a Uniform Probably Approximately Correct (PAC)

(Dann et al., 2017) bound for RTDP that greatly improves existing PAC results (Strehl et al., 2006). Next, we move to the learning problem, where the model is unknown. Based on the results developed for RTDP, and focusing on finite-horizon MDPs, we adapt UCRL2 (Jaksch et al., 2010) and EULER (Zanette and Brunskill, 2019), both act by full-planning, to UCRL2 with Greedy Policies (UCRL2-GP) and EULER with Greedy Policies (EULER-GP);model-based algorithms that act by 1-step planning. The adapted versions are shown to preserve the performance guarantees, while improve in terms of computational complexity.

Algorithm Regret Time Complexity Space Complexity
UCRL2 (Jaksch et al., 2010)
UCBVI (Azar et al., 2017)
EULER (Zanette and Brunskill, 2019)
Q-v2 (Jin et al., 2018)

Lower bounds
Table 1: Comparison of our bounds with several state-of-the-art bounds for RL in tabular finite-horizon MDPs. The time complexity of the algorithms is per episode, and are the sizes of the state and action sets, respectively, is the horizon of the MDP, and is the total number of samples that the algorithm gathers. Lastly, is the maximum number of non-zero transition probabilities across the entire state-action pairs. The algorithms proposed in this paper are highlighted in gray.

2 Notations and Definitions

We consider finite-horizon MDPs with time-independent dynamics (Bertsekas and Tsitsiklis, 1996). A finite-horizon MDP is defined by the tuple , where and are the state and action spaces with cardinalities and , respectively. The immediate reward for taking an action at state

is a random variable

with expectation . The transition probability is , the probability of transitioning to state upon taking action at state . Furthermore, is the maximum number of non-zero transition probabilities across the entire state-action pairs. If this number is unknown to the designer of the algorithm in advanced, then we set . The initial state in each episode is arbitrarily chosen and is the horizon, i.e., the number of time-steps in each episode. We define for all , and throughout the paper use and to denote time-step inside an episode and the index of an episode, respectively.

A deterministic policy is a mapping from states and time-step indices to actions. We denote by , the action taken at time at state according to a policy . The quality of a policy from state at time is measured by its value function, which is defined as

where the expectation is over the environment’s randomness. An optimal policy maximizes this value for all states and time-steps , and the corresponding optimal value is denoted by for all . The optimal value satisfies the optimal Bellman equation, i.e.,


We consider an agent that repeatedly interacts with an MDP in a sequence of episodes . The performance of the agent is measured by its regret, defined as . Throughout this work, the policy is computed by a 1-step planning operation with respect to the value function estimated by the algorithm at the end of episode , denoted by . We also call such policy a greedy policy. Moreover, and stand, respectively, for the state and the action taken at the time-step of the episode.

Next, we define the filtration that includes all events (states, actions, and rewards) until the end of the episode, as well as the initial state of the episode . We denote by , the total number of time-steps (samples). Moreover, we denote by , the number of times that the agent has visited state-action pair , and by , the empirical average of a random variable . Both quantities are based on experience gathered until the end of the episode and are measurable. We also define the probability to visit the state-action pair at the episode at time-step by . We note that is measurable, and thus, . Also denote .

We use to refer to a quantity that depends on up to poly-log expression of a quantity at most polynomial in , , , , , and . Similarly, represents up to numerical constants or poly-log factors. We define , where

is a probability distribution over the domain of

, and use . Lastly, is the set of probability distributions over the state space .

3 Real-Time Dynamic Programming

  Initialize: .
  for  do
     for  do
        Act with and observe .
     end for
  end for
Algorithm 1 Real-Time Dynamic Programming

RTDP (Barto et al., 1995) is a well-known algorithm that solves an MDP when a model of the environment is given. Unlike, e.g., Value Iteration (VI) (Bertsekas and Tsitsiklis, 1996) that solves an MDP by offline calculations, RTDP solves an MDP in a real-time manner. As mentioned in Barto et al. (1995), RTDP can be interpreted as an asynchronous VI adjusted to a real-time algorithm.

Algorithm 1 contains the pseudocode of RTDP for finite-horizon MDPs. The value function is initialized with an optimistic value, i.e., an upper bound of the optimal value. At each time-step and episode , the agent acts from the current state greedily with respect to the current value at the next time step, . It then updates the value of according to the optimal Bellman operator. We denote by , the value function, and as we show in the following, it always upper bounds . Note that since the action at a fixed state is chosen according to , then is measurable.

Since RTDP is an online algorithm, i.e., it updates its value estimates through interactions with the environment, it is natural to measure its performance in terms of the regret. The rest of this section is devoted to supplying expected and high-probability bounds on the regret of RTDP, which will also lead to PAC bounds for this algorithm. In Section 4, based on the observations from this section, we will establish minimax regret bounds for 1-step greedy model-based RL.

We start by stating two basic properties of RTDP in the following lemma: the value is always optimistic and decreases in (see proof in Appendix B). Although the first property is known (Barto et al., 1995), to the best of our knowledge, the second one has not been proven in previous work.

Lemma 1.

For all , , and , it holds that (i) and (ii) .

The following lemma, that we believe is new, relates the difference between the optimistic value and the real value to the expected cumulative update of the value function at the end of the episode (see proof in Appendix B).

Lemma 2 (Value Update for Exact Model).

The expected cumulative value update of RTDP at the episode satisfies

The result relates the difference of the optimistic value and the value of the greedy policy to the expected update along the trajectory, created by following . Thus, for example, if the optimistic value is overestimated, then the value update throughout this episode is expected to be large.

3.1 Regret and PAC Analysis

Using Lemma 1, we observe that the sequence of values is decreasing and bounded from below. Thus, intuitively, the decrements of the values cannot be indefinitely large. Importantly, Lemma 2 states that when the expected decrements of the values are small, then is close to , and thus, to , since the former upper bounds the latter.

Building on this reasoning, we are led to establish a general result on a decreasing process. This result will allow us to formally justify the aforementioned reasoning and derive regret bounds for RTDP. The proof utilizes self-normalized concentration bounds (de la Peña et al., 2007), applied on martingales, and can be found in Appendix A.

Definition 1 (Decreasing Bounded Process).

We call a random process , where is a filtration and is adapted to this filtration, a Decreasing Bounded Process, if it satisfies the following properties:

  1. decreases, i.e., a.s. .

  2. and for all a.s. .

Theorem 3 (Regret Bound of a Decreasing Bounded Process).

Let be a Decreasing Bounded Process and be its -round regret. Then,

Specifically, it holds that

We are now ready to prove the central result of this section, the expected and high-probability regret bounds on RTDP (see full proof in Appendix B).

Theorem 4 (Regret Bounds for RTDP).

The following regret bounds hold for RTDP:

  1. For any , with probability , for all ,

Proof Sketch.

We give a sketch of the proof of the second claim. Applying Lemmas 1 and then 2,


We then establish (see Lemma 34) that RHS of (2) is, in fact, a sum of Decreasing Bounded Processes, i.e.,


Since for any fixed , is a decreasing process by Lemma 1, we can use Theorem 3, for a fixed , and conclude the proof by applying the union bound on all terms in (3). ∎

Theorem 4 exhibits a regret bound that does not depend on . While it is expected that RTDP, that has access to the exact model, would achieve better performance than an RL algorithm with no such access, a regret bound independent of is a noteworthy result. Indeed, it leads to the following Uniform PAC (see Dann et al. 2017 for the definition) and PAC guarantees for RTDP (see proofs in Appendix B). To the best of our knowledge, both are the first PAC guarantees for RTDP.111Existing PAC results on RTDP analyze variations of RTDP in which is an input parameter of the algorithm.

Corollary 5 (RTDP is Uniform PAC).

Let and be the number of episodes in which RTDP outputs a policy with . Then,

Corollary 6 (RTDP is Pac).

Let and be the number of episodes in which RTDP outputs a non optimal policy. Define the (unknown) gap of the MDP, Then,

4 Exploration in Model-based RL: Greedy Policy Achieves Minimax Regret

We start this section by formulating a general optimistic RL scheme that acts by 1-step planning (see Algorithm 2). Then, we establish Lemma 7, which generalizes Lemma 2 to the case where a non-exact model is used for the value updates. Using this lemma, we offer a novel regret decomposition for algorithms which follow Algorithm 2. Based on the decomposition, we analyze generalizations of UCRL2 (Jaksch et al., 2010) (for finite horizon MDPs) and EULER (Zanette and Brunskill, 2019), that use greedy policies instead of solving an MDP (full planning) at the beginning of each episode. Surprisingly, we find that both generalized algorithms do not suffer from performance degradation, up to numerical constants and logarithmic factors. Thus, we conclude that there exists an RL algorithm that achieves the minimax regret bound, while acting according to greedy policies.

1:  Initialize: .
2:  for  do
3:     Initialize
4:     for  do
8:        Act with and observe .
9:     end for
10:     Update with all experience gathered in episode.
11:  end for
Algorithm 2 Model-based RL with Greedy Policies

Consider the general RL scheme that explores by greedy policies as depicted in Algorithm 2. The value is initialized optimistically and the algorithm interacts with the unknown environment in an episodic manner. At each time-step , a greedy policy from the current state, , is calculated optimistically based on the empirical model and the current value at the next time-step . This is done in a subroutine called ‘ModelBaseOptimisticQ’. We further assume the optimistic -function has the form and refer to as the optimistic model. The agent interacts with the environment based on the greedy policy with respect to and uses the gathered experience to update the empirical model at the end of the episode.

By construction of the update rule (see Line 7), the value is a decreasing function of , for all . Thus, property (ii) in Lemma 1 holds for Algorithm 2. Furthermore, the algorithms analyzed in this section will also be optimistic with high probability, i.e., property (i) in Lemma 1 also holds. Finally, since the value update uses the empirical quantities , , and from the previous episode, policy is still measurable.

The following lemma generalizes Lemma 2 to the case where, unlike in RTDP, the update rule does not use the exact model (see proof in Appendix C).

Lemma 7 (Value Update for Optimistic Model).

The expected cumulative value update of Algorithm 2 in the episode is bounded by

In the rest of the section, we consider two instantiations of the subroutine ‘ModelBaseOptimisticQ’ in Algorithm 2. We use the bonus terms of UCRL2 and of EULER to acquire an optimistic -function, . These two options then lead to UCRL2 with Greedy Policies (UCRL2-GP) and EULER with Greedy Policies (EULER-GP) algorithms.

4.1 UCRL2 with Greedy Policies for Finite-Horizon MDPs

5:  Return
Algorithm 3 UCRL2 with Greedy Policies (UCRL2-GP)

We form the optimistic local model based on the confidence set of UCRL2 (Jaksch et al., 2010). This amounts to use Algorithm 3 as the subroutine ‘ModelBaseOptimisticQ’ in Algorithm 2. The maximization problem on Line 3 of Algorithm 3 is common, when using bonus based on an optimistic model (Jaksch et al., 2010), and it can be solved efficiently in operations (e.g., Strehl and Littman 2008, Section 3.1.5). A full version of the algorithm can be found in Appendix D.

Thus, Algorithm 3 performs operations per episode. This saves the need to perform Extended Value Iteration (Jaksch et al., 2010), that costs operations per episode (an extra factor of ). Despite the significant improvement in terms of computational complexity, the regret of UCRL2-GP is similar to the one of UCRL2 (Jaksch et al., 2010) as the following theorem formalizes (see proof in Appendix D).

Theorem 8 (Regret Bound of UCRL2-GP).

For any time , with probability at least , the regret of UCRL2-GP is bounded by .

Proof Sketch.

Using the optimism of the value function (see Section D.2) and by applying Lemma 7, we bound the regret as follows:


Thus, the regret is upper bounded by two terms. As in Theorem 4, by applying Lemma 11 (Appendix A), the first term in (4) is a sum of Decreasing Bounded Processes, and can thus be bounded by . The presence of the second term in (4) is common in recent regret analyses (e.g., Dann et al. 2017). Using standard techniques (Jaksch et al., 2010; Dann et al., 2017; Zanette and Brunskill, 2019), this term can be bounded (up to additive constant factors) with high probability by . ∎

4.2 EULER with Greedy Policies

In this section, we use bonus terms as in EULER  (Zanette and Brunskill, 2019). Similar to the previous section, this amounts to replacing the subroutine ‘ModelBaseOptimisticQ’ in Algorithm 2 with a subroutine based on the bonus terms from (Zanette and Brunskill, 2019). Algorithm 5 in Appendix E contains the pseudocode of the algorithm. The bonus terms in EULER are based on the empirical Bernstein inequality and tracking both an upper bound and a lower-bound on . Using these, EULER achieves both minimax optimal and problem dependent regret bounds.

EULER (Zanette and Brunskill, 2019) performs computations per episode (same as the VI algorithm), while EULER-GP requires only . Despite this advantage in computational complexity, EULER-GP exhibits similar minimax regret bounds to EULER (see proof in Appendix E), much like the equivalent performance of UCRL2 and UCRL2-GP proved in Section 4.1.

Theorem 9 (Regret Bound of EULER-GP).

Let be an upper bound on the total reward collected within an episode. Define and . With probability at least , for any time jointly on all episodes , the regret of EULER-GP is bounded by Thus, it is also bounded by .

Note that Theorem 9 exhibits similar problem-dependent regret-bounds as in Theorem 1 of (Zanette and Brunskill, 2019). Thus, the same corollaries derived in (Zanette and Brunskill, 2019) for EULER can also be applied to EULER-GP.

5 Related Work

Real-Time Dynamic Programming: RTDP (Barto et al., 1995) has been extensively used and has many variants that exhibit superior empirical performance (e.g., (Bonet and Geffner, 2003; McMahan et al., 2005; Smith and Simmons, 2006)). For planning in discounted MDPs, Strehl et al. (2006) proved -PAC bounds of , for a modified version of the algorithm that updates the value only if the update size is larger than . Bonet and Geffner (2003) proved convergence rate of for Labeled RTDP in the stochastic shortest path problem, which is equivalent to in finite-horizon MDP. Nevertheless, these algorithms explicitly use to mark states with accurate value estimate. We prove that RTDP converges in a rate of without knowing . Indeed, Strehl et al. (2006) posed whether the original RTDP is PAC as an open problem. Furthermore, no regret bound for RTDP has been reported in the literature.

Regret bounds for RL: The most renowned algorithms with regret guarantees for undiscounted infinite-horizon MDPs are UCRL2 (Jaksch et al., 2010) and REGAL (Bartlett and Tewari, 2009), which have been extended throughout the years (e.g., by Fruit et al. 2018; Talebi and Maillard 2018). Recently, there is an increasing interest in regret bounds for MDPs with finite horizon and stationary dynamics. In this scenario, UCRL2 enjoys a regret bound of order . Azar et al. (2017) proposed UCBVI, with improved regret bound of order , which is also asymptotically tight (Osband and Van Roy, 2016). Dann et al. (2018) presented ORLC that achieves tight regret bounds and (nearly) tight PAC guarantees for non-stationary MDPs. Finally, Zanette and Brunskill (2019) proposed EULER, an algorithm that enjoys tight minimax regret bounds and has additional problem-dependent bounds that encapsulate the MDP’s complexity. All of these algorithms are model-based and require full-planning. Model-free RL was analyzed by (Jin et al., 2018). There, the authors exhibit regret bounds that are worse by a factor of relatively to the lower-bound. To the best of our knowledge, there are no model-based algorithms with regret guarantees that avoid full-planning. It is worth noting that while all the above algorithms, and the ones in this work, rely on the Optimism in the Face of Uncertainty principle (Lai and Robbins, 1985)

, Thompson Sampling model-based RL algorithms exist

(Osband et al., 2013; Gopalan and Mannor, 2015; Agrawal and Jia, 2017; Osband and Van Roy, 2017). There, a model is sampled from a distribution over models, on which full-planning takes place.

Greedy policies in model-based RL: By adjusting RTDP to the case where the model is unknown, Strehl et al. (2012) formulated model-based RL algorithms that act using a greedy policy. They proved a sample complexity bound for discounted MDPs. To the best of our knowledge, there are no regret bounds for model-based RL algorithms that act by greedy policies.

Practical model-based RL: Due to the high computational complexity of planning in model-based RL, most of the practical algorithms are model-free (e.g., Mnih et al. 2015). Algorithms that do use a model usually only take advantage of local information. For example, Dyna (Sutton, 1991; Peng et al., 2018) randomly samples state-action pairs and updates them according to a local model. Other papers use the local model to plan for a short horizon from the current state (Tamar et al., 2016; Hafner et al., 2018). The performance of such algorithms depends heavily on the planning horizon, that in turn dramatically increases the computational complexity.

6 Conclusions and Future Work

In this work, we established that tabular model-based RL algorithms can explore by 1-step planning instead of full-planning, without suffering from performance degradation. Specifically, exploring with model-based greedy policies can be minimax optimal in terms of regret. Differently put, the variance caused by exploring with greedy policies is smaller than the variance caused by learning a sufficiently good model. Indeed, the extra term which appears due to the greedy exploration is

(e.g., the first term in (4)); a constant term, smaller than the existing constant terms of UCRL2 and EULER.

This work raises and highlights some interesting research questions. The obvious ones are extensions to average and discounted MDPs, as well as to Thompson sampling based RL algorithms. Although these scenarios are harder or different in terms of analysis, we believe this work introduces the relevant approach to tackle this question. Another interesting question is the applicability of the results in large-scale problems, when tabular representation is infeasible and approximation must be used. There, algorithms that act using lookahead policies, instead of 1-step planning, are expected to yield better performance, as they are less sensitive to value approximation errors (e.g., Bertsekas and Tsitsiklis 1996; Jiang et al. 2018; Efroni et al. 2018b, a). Even then, full-planning, as opposed to using a short-horizon planning, might be unnecessary. Lastly, establishing whether the model-based approach is or is not provably better than the model-free approach, as the current state of the literature suggests, is yet an important and unsolved open problem.


We thank Oren Louidor for illuminating discussions relating the Decreasing Bounded Process, and Esther Derman for very helpful comments.


Appendix A Proofs on Decreasing Bounded Processes

In this section, we state and prove useful results on Decreasing Bounded Processes (see Definition 1). These results will be in use in proofs of the central theorems of this work.

See 3


Without loss of generality, assume , since otherwise the results are trivial. We start by remarking that is almost surely monotonically increasing, since . Define the martingale difference process

and the martingale process . Since almost surely, can be bounded by . Also define the quadratic variations as and . Next, recall Theorem 2.7 of [de la Peña et al., 2007]:

Theorem 10.

Let and be two random variables, such that for all , we have


Then, ,


Condition (5) holds for and , due to Theorem 9.21 of [de la Peña et al., 2008]. can be easily bounded by . To bound , we first calculate and :


In we used the fact that a.s., which allows us to conclude that the cross-term is non-positive. In (**), we bounded . We can also bound , since each of the summands is a.s. non-negative, and thus,

Combining all of the above bounds yields

Finally, we can bound by

Combining everything we obtain

Or, substituting in (6), we have

Next, notice that for , the function is monotonically increasing for any :

Moreover, for ,

where the inequality holds since . Thus, if , then , and we can bound the probability that by

and setting , we obtain

We remark that since is monotonically increasing a.s., this bound also implies that

To obtain a uniform bound, that is, bound that holds for all , note that the random sequence is monotonically increasing in and bounded. Thus, due to monotone convergence