Directed Exploration in PAC Model-Free Reinforcement Learning

08/31/2018 ∙ by Min-hwan Oh, et al. ∙ 0

We study an exploration method for model-free RL that generalizes the counter-based exploration bonus methods and takes into account long term exploratory value of actions rather than a single step look-ahead. We propose a model-free RL method that modifies Delayed Q-learning and utilizes the long-term exploration bonus with provable efficiency. We show that our proposed method finds a near-optimal policy in polynomial time (PAC-MDP), and also provide experimental evidence that our proposed algorithm is an efficient exploration method.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In reinforcement learning (RL), an agent, whose objective is to maximize the expected sum of reward, initially starts to make decisions in an unknown environment. It faces trials and errors while collecting reward and information. However, it is not feasible for the agent to act near-optimally until it has explored the environment sufficiently and identified all of the opportunities for high reward. One of the fundamental challenges in RL is to balance exploration and exploitation — whether to act not greedily action according to current estimates in order to gain new information or to act consistently with past experience to maximize reward.

Common dithering strategies, such as -greedy, or sampling from a Boltzmann distribution (Softmax) over the learned Q-values have been widely applied to standard RL methods as exploration strategies. However these naive approaches can lead to highly inefficient exploration, in the sense that they waste exploration resources on actions and trajectories which are already well known. In other words, they are not directed towards gaining more knowledge, not biasing actions in the direction of unexplored trajectories (Thrun, 1992; Little & Sommer, 2013; Osband et al., 2016).

In order to avoid wasteful exploration and guide toward more directed exploration, many of the previous work adopted exploration bonus. The most commonly used exploration bonus is based on counting. That is, for each pair , maintain a integer value that indicates how many times the agent performed action at state so far at time . Counter-based methods have been widely used both in practice and in theory (Strehl & Littman, 2005, 2008; Kolter & Ng, 2009; Bellemare et al., 2016; Tang et al., 2017; Ostrovski et al., 2017). However, the limitation of these methods still exist in that the exploratory value of a state-action pair is evaluated with respect only to its immediate outcome, one step ahead (Choshen et al., 2018). Recent work (Choshen et al., 2018) proposes an exploration method for model-free RL that generalizes the counter-based exploration bonus methods and takes into account long term exploratory value of actions rather than a single step look-ahead. Inspired by their use of propagated exploration value, we propose a model-free RL method that utilizes this long-term exploration bonus with provable efficiency. We show that our proposed method finds a near-optimal policy in polynomial time, and give experimental evidence that it is an efficient exploration method.

2 Preliminaries

2.1 Markov decision processes

A standard assumption of RL is that the environment is (discounted-reward and finite) Markov decision processes (MDP). Here we only introduce the notational framework used in this work. A finite MDP

is a tuple , where is a finite set of states; is a finite set of possible actions; is the transition distribution; is the reward distribution; is a discount factor with . We assume that all the (random) immediate rewards are nonnegative and are upper-bounded by a constant . A policy is a mapping that assigns to every history

a probability mass function over the actions

. Following a policy in the MDP means that . A stationary policy is a mapping . The discounted state- and action-value functions will be denoted by and , respectively, for any (not necessarily stationary) policy . The optimal state- and action-value functions will be denoted by and .

Given a stream of experience generated when algorithm interacts with , we define its value at time , conditioned on the past as . Let be an upper bound on the state values in the MDP.

2.2 Sample Complexity

One of the common evaluation criteria for RL algorithms is to count the number of non-optimal actions taken. Roughly, this quantity tells how many mistakes the agent make at most.

Definition 1 (Kakade 2003).

Let be a prescribed accuracy and be an allowed probability of failure. The expression is a sample complexity bound for algorithm , if the following holds: Take any , , , , , and any MDP with states, actions, discount factor , and rewards bounded by . Let interact with , resulting in the process Then, independently of the choice of , with probability at least , the number of timesteps such that is at most .

An algorithm with sample complexity that is polynomial in is called PAC-MDP (probably approximately correct in MDPs)

2.3 Previous sample complexity results

algorithm (Kearns & Singh, 2002) and its successor, R-max (Brafman & Tennenholtz, 2002), were the first algorithms that have polynomial time bounds for finding near-optimal policies. These methods maintain a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model. The model is initialized in an optimistic fashion: all actions in all states return the maximal possible reward and the model is updated each time when a state becomes known. R-max has the sample complexity of . The MBIE algorithm (Strehl & Littman, 2005, 2008) applies confidence bounds to compute an optimistic policy and has the same sample complexity . There are variants of R-max algorithms, such as the OIM algorithm (Szita & Lőrincz, 2008) and MoRMax (Szita & Szepesvári, 2010). MoRMax is shown to have the smallest sample-complexity among discounted finite MDPs. All of these algorithms mentioned are model-based. Unlike aforementioned methods which build an approximate model of the environment, Delayed Q-learning (Strehl et al., 2006) rather approximate an action value function directly. Delayed Q-learning is the first model-free method with known complexity bounds with sample-complexity.

2.4 -values

Choshen et al. (2018) propose a method using a parallel -value MDP which has the same transition model as the original MDP, but has no rewards associated with any of the state-actions. Hence, the true value of all state-action pairs is 0. With the initial value of 1 for all state-action pairs, they show (empirically) that these -values represent the missing knowledge and thus can be used for propagating directed exploration. Intuitively, the value of at a given timestep during training stands for uncertainty and decreases each time the agents experiences pair. On-policy SARSA (Singh et al., 2000) update rule is applied to the -value MDP, where the acting policy is selected on the original MDP.

While the -value MDP is training, the proposed method uses a log transformation applied to -values to get the corresponding exploration bonus term for the original MDP. This bonus term is shown to be equivalent counter-based methods for finite MDPs when the discount factor of the -value MDP is set to 0 if a fixed learning rate is used for all updates. Hence, Choshen et al. (2018) argue that, with , the logarithm of -Values can be thought of as a generalization of visit counters, with propagation of the values along state-action pairs. Although the empirical results demonstrate efficient exploration in the experiments used in their work, the theoretical analysis of their proposed algorithm is lacking, essentially only showing convergence with infinite visiting. In this work, we show that -value can be incorporated in PAC-MDP with theoretical guarantee.

2.5 Delayed Q-learning

In Delayed Q-learning (Strehl et al., 2006, 2009), the agent only observes one sample transition for the action it takes in the current state. Delayed Q-learning uses optimistic initialization of the value function, and waits until transitions from are gathered before considering an update of (this is where “delay” comes from). When is sufficiently large (but is still bounded by a polynomial), the new value of is still optimistic with high probability (Li, 2009). It maintains the known state-action set , similar to the approaches introduced in earlier model-based PAC-MDP algorithms (Brafman & Tennenholtz, 2002) as well as Boolean flags for each state-action pair that is set as TRUE when a pair does not belong to the set , which allows an update to . These tools allow us to bound the number of occurrence of the undesired “escape” events from . A variant of Delayed Q-learning uses techniques such as interval estimation to attempt an update before -th time as long as the current estimate satisfies the update criterion (Strehl, 2007).

3 Directed Delayed Q-learning

Our proposed algorithm, Directed Delayed Q-learning, maintains Q-value estimates, and -value estimates, for each state-action pair . At each timestep , let denote the algorithm’s current Q-value estimate and denote its current -value estimate. The agent always acts greedily with respect to Q-value estimates plus the exploration bonus, meaning that if state is the -th state reached, the next action is chosen by


Let denote for convenience. Our proposed method is based on Delayed Q-learning (Strehl et al., 2006, 2009). Our proposed method modifies Delayed Q-learning in that we introduce an exploration bonus using -values to take into account the long term exploratory value of actions and we perform delayed updates to -values along with Q-values. We also adopt the interval estimation technique (Strehl, 2007) to update the value function whenever a current Monte Carlo estimate differs from the target value function sufficiently, instead of waiting until the agent collects a fixed number of samples to estimate a new value function for each attempted update. The term is introduced to account for Monte Carlo estimate errors in the case of a premature delay, where is the inner counter of state-action pairs within each update and resets after a successful update or the -th attempted update. Note that differs from a global counter which keeps track of the number of state-action visits for the entire duration of learning and which is generalized by -values. It is also important to note that the proposed -value based exploration bonus can still be applied to the fully delayed version of Delayed Q-learning with fixed delay intervals (in fact, with the same PAC bound). We apply the interval estimation technique for empirical performance gains.

Furthermore, there are differences between our proposed method and Choshen et al. (2018) in that Choshen et al. (2018) still apply dithering strategies (-greedy and softmax policies) over the sum of Q-value and a exploration bonus based -value. On the other hand, our proposed algorithm acts greedily with respect to Equation (1). We also update -values with off-policy updates rather than on-policy to ensure monotonic decrease in value for every update.

In addition to Q-value and -value estimates, similarly to Delayed Q-learning, our algorithm maintains a Boolean variable , for each .111The maintenance of the variable is essentially the same as Delayed Q-learning. For details, see (Strehl et al., 2006, 2009). This variable indicates whether the agent currently considers a modification to its Q-value and -value estimates. The algorithm also relies on other free parameters, and a positive integer , a positive real number , -value discount factor , and the base of log transformation . In the analysis which is provided in Appendix, we provide precise values for these parameters in terms of the other inputs that guarantee the resulting algorithm is PAC-MDP. We provide an efficient implementation, Algorithm 1, of Directed Delayed Q-learning.

  for all  do
               // Q-value estimate
               // exploration value estimate
               // inner loop estimate for Q-values
               // inner loop estimate for -values
               // inner counter
               // beginning time of attempted update
               // the LEARN flags
  end for
            // time of most recent Q-value change
  for  do
     Let denote the state at time
      Choose action
     Observe immediate reward and next state
     if  then
     end if
     if  then
        if   then
        else if  then
           if  then
           end if
        end if
     end if
  end for
Algorithm 1 Directed Delayed Q-learning

3.1 Update Criteria

While the agent considers learning for a given state-action pair , each time is experienced, the agents updates its surrogate Q-value and -value estimates, and and attempts an update to the global Q-value and -value up to . If the update fails even at -th time, the agent discards the current surrogate estimates and starts collecting new samples. If successful, the following updates occur: and . To ensure that every successful update decreases by at least , we require the following condition to be satisfied for an update to occur:

If the above condition does not hold, then there is no update to be performed for and .

Figure 1: Flow of Q-value and -value updates during execution of Directed Delayed Q-learning. Delay can be up to times per each attempted update, which may or may not succeed

3.2 Main Theoretical Result

The main theoretical result, whose proof is provided in Appendix in the supplementary material, is that the Directed Delayed Q-learning algorithm is PAC-MDP:

Theorem 1.

Let be any MDP and let and be two positive real numbers. If Directed Delayed Q-learning is executed on MDP , then then the following holds. Let denote the policy of Directed Delayed Q-learning at time and denote the state at time . With probability at least , is true for all but


4 Experiments

To assess the empirical performances of Directed Delayed Q-learning, we compared its performance to other model-free RL methods as well as different values of . Experiments were run on chain MDPs with varying length . The agent begins at the far left state and at every time step has the choice to move left or right. Each move can fail with probability 0.2, which results in the opposite action. The agent receives a small reward for reaching the leftmost state, but the optimal policy is to attempt to move to the far right state and receive a much larger reward . Chains with length and are reported below. These environments are intended to be expository rather than entirely realistic. Balancing a well known and mildly successful strategy versus an unknown, but potentially more rewarding, approach can emerge in many practical applications (Osband et al., 2016).

Method Cumulative reward
Directed Delayed QL 7089.5948.98
Delayed QL 4325.3859.31
QL + -greedy 2435.11134.3
Table 1: Results on Chain MDPs with
Method Cumulative reward
Directed Delayed QL 5581.0294.72
Delayed QL 531.9558.66
QL + -greedy 2.980.012
Table 2: Results on Chain MDPs with

On all experiments, each algorithm ran for 10,000 timesteps and the undiscounted total sum of reward was recorded. Tables 1 and 2

show the average and 95% confidence intervals over 300 independent test runs. The results show that Directed Delayed Q-learning significantly outperforms other model-free methods. Especially, we notice the gap between the performances of the algorithms increases exponentially as the chain length

increases, which suggests that the larger value of is beneficial especially in environments where reward is more sparse and deeper exploration is required.

5 Conclusion

We presented Directed Delayed Q-learning, a provably efficient model-free reinforcement-learning algorithm which takes into account long term exploratory information. It has the same desirable sample complexity as Delayed Q-learning. The experiments show that Directed Delayed Q-learning shows significantly better performance compared to other model-free RL methods on challenging environments.


  • Bellemare et al. (2016) Bellemare, Marc, Srinivasan, Sriram, Ostrovski, Georg, Schaul, Tom, Saxton, David, and Munos, Remi. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479, 2016.
  • Brafman & Tennenholtz (2002) Brafman, Ronen I and Tennenholtz, Moshe. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
  • Choshen et al. (2018) Choshen, Leshem, Fox, Lior, and Loewenstein, Yonatan. Dora the explorer: Directed outreaching reinforcement action-selection. In International Conference on Learning Representations, 2018.
  • Kakade (2003) Kakade, Sham Machandranath. On the sample complexity of reinforcement learning. PhD thesis, University College London, 2003.
  • Kearns & Singh (2002) Kearns, Michael and Singh, Satinder. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209–232, 2002.
  • Kolter & Ng (2009) Kolter, J Zico and Ng, Andrew Y. Near-bayesian exploration in polynomial time. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 513–520. ACM, 2009.
  • Li (2009) Li, Lihong. A unifying framework for computational reinforcement learning theory. Rutgers The State University of New Jersey-New Brunswick, 2009.
  • Little & Sommer (2013) Little, Daniel Ying-Jeh and Sommer, Friedrich Tobias. Learning and exploration in action-perception loops. Frontiers in neural circuits, 7:37, 2013.
  • Osband et al. (2016) Osband, Ian, Van Roy, Benjamin, and Wen, Zheng. Generalization and exploration via randomized value functions. In International Conference on Machine Learning, pp. 2377–2386, 2016.
  • Ostrovski et al. (2017) Ostrovski, Georg, Bellemare, Marc G, Oord, Aaron van den, and Munos, Rémi. Count-based exploration with neural density models. arXiv preprint arXiv:1703.01310, 2017.
  • Singh et al. (2000) Singh, Satinder, Jaakkola, Tommi, Littman, Michael L, and Szepesvári, Csaba. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38(3):287–308, 2000.
  • Strehl (2007) Strehl, Alexander L. Probably approximately correct (PAC) exploration in reinforcement learning. PhD thesis, Rutgers University-Graduate School-New Brunswick, 2007.
  • Strehl & Littman (2005) Strehl, Alexander L and Littman, Michael L. A theoretical analysis of model-based interval estimation. In Proceedings of the 22nd international conference on Machine learning, pp. 856–863. ACM, 2005.
  • Strehl & Littman (2008) Strehl, Alexander L and Littman, Michael L. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
  • Strehl et al. (2006) Strehl, Alexander L, Li, Lihong, Wiewiora, Eric, Langford, John, and Littman, Michael L. Pac model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pp. 881–888. ACM, 2006.
  • Strehl et al. (2009) Strehl, Alexander L, Li, Lihong, and Littman, Michael L. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research, 10(Nov):2413–2444, 2009.
  • Szita & Lőrincz (2008) Szita, István and Lőrincz, András. The many faces of optimism: a unifying approach. In Proceedings of the 25th international conference on Machine learning, pp. 1048–1055. ACM, 2008.
  • Szita & Szepesvári (2010) Szita, István and Szepesvári, Csaba. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 1031–1038, 2010.
  • Tang et al. (2017) Tang, Haoran, Houthooft, Rein, Foote, Davis, Stooke, Adam, Chen, OpenAI Xi, Duan, Yan, Schulman, John, DeTurck, Filip, and Abbeel, Pieter. # exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2750–2759, 2017.
  • Thrun (1992) Thrun, Sebastian B. Efficient exploration in reinforcement learning. Technical report, Carnegie-Mellon University, 1992.

Appendix A Analysis

In this section, we show the proof of the main theoretical result, Theorem 1. The proofs follow the structure of the work of (Strehl et al., 2009), but specify some of steps for our proposed method. The following theorem (Theorem 10 in Strehl et al. 2009) will come in handy to show that our proposed algorithm is PAC-MDP.

Theorem 2 (Strehl et al. 2009).

Let be any greedy learning algorithm such that, for every timestep , there exists a set of state-action pairs that depends only on the agent’s history up to timestep . We assume that unless, during timestep , an update to some state-action value occurs or the escape event happens. Let be the known state-action MDP and be the current greedy policy, that is, for all states , . Furthermore, assume for all and . Suppose that for any inputs and , with probability at least , the following conditions hold for all states , actions , and timesteps : (a) (optimism), (b) (accuracy), and (c) the total number of updates of action-value estimates plus the number of times the escape event from , , can occur is bounded by (learning complexity). Then, when is executed on any MDP , it will follow a 4-optimal policy from its current state on all but

timesteps, with probability at least .

Recall that we define to be for convenience. We first bound the number of successful updates. Since every successful update of results in a decrease of at least and is initialized to . We have at most successful updates of a fixed state-action pair . Therefore, the total number of successful updates is at most . So there can be at most attempted updates for each pair . Hence, there are at most of total attempted updates.

Following the construction of the set of the low Bellman error state-action pairs in Delayed Q-learning (Strehl et al., 2009), during timestep of learning, we define to be the set of all state-action pairs such that:

Definition 2.

Define Event A1 to be the event that for all timesteps , if and an attempted update of (s,a) occurs during timestep , then the update will be successful, where are last timesteps during which is experienced consecutively.

During any given infinite-length execution of Directed Delayed Q-learning, when as above, our value function estimate is very inconsistent with our current value function estimates. Thus, we would expect our next attempted update to succeed. The next lemma shows that this update occurs with high probability. The proof of the lemma follows the structure of the lemma of (Strehl et al., 2009), but also bound the -value estimates and specify additional parameter values. We specify a value and first consider values and .

Lemma 1.

The probability that A1 is violated during execution of Directed Delayed Q-learning is at most with , and .


Fix a state-action pair and suppose that it has been visited times until timestep , at steps . Consider rewards, , and next states, for

. Define the random variables

. Clearly, . Using the Hoeffding bound with the choice of above, it can be shown that


holds with probability at least . Similarly, define the random variable where . Note that . Again, using the Hoeffding bound with the choice of above, it can be shown that


holds with probability at least . Note that given , we can choose the constants and such that implies . Here we choose (this choice will be useful when we bound )

We choose since . Hence, it does not matter what value is (as long as ) to determine the PAC-bound. We show that if an attempted update is performed for using these samples, then the resulting update will succeed with high probability.

The first inequality follows from and for all and . The second inequality follows from (3) and (4) along with a suitable choice of . The third step uses the assumption on A1, i.e. , therefore (2) doesn’t hold. Hence, if we choose , then with probability at least we have:


The following lemma states that our proposed algorithm will maintain optimistic action values with high probability.

Lemma 2.

During execution of Directed Delayed Q-learning, holds for all timesteps and state-action pairs , with probability at least .


Fix a state-action pair and suppose that it has been visited times until timestep , at steps . Define the random variables by Note that and for all and the sequence is a martingale difference sequence. Applying Azuma’s lemma, we have


Let the right-hand side be equal to . Then with , we have that


holds for all attempted updates, with probability at least . Assuming this equation does hold, the proof of the lemma is by induction on the timestep . Note that since for all and , it suffices to show for all . For the base case, note that for all . Now, suppose that holds true for all . Hence, and for all . Then we have . ∎

Lemma 3 (Strehl et al. 2009).

The number of timesteps such that a state-action pair is experienced is at most .


See Lemma 25 in (Strehl et al., 2009) for proof. Note that although the proposed algorithm can update Q-value and -values estimates before attempts. it can take up to attempts (and still not succeed) in the worst case. Therefore, the analysis for this lemma is the same as (Strehl et al., 2009)

Next, we bound for all state-action pair.

Lemma 4.

If and , then for all and .


Since , and , we have

Using these Lemmas we can prove the main result, Theorem 1.


(of Theorem 1) We show that combining Lemmas satisfies the conditions of Theorem 2. First, set as in Lemma 1 and let and . Let . Then, by Lemma 4, for all and . By Lemma 1, event A1 holds with probablity at least . Then, the optimism condition (a) is satisfied by Lemma 2. Note that for all if , then equation (1) holds. Otherwise . Hence, and can be off by at most in reward at each time . Therefore, , which satisfies condition (b); see, e.g. (Strehl et al., 2009). Now, from Lemma 3, we have , where is the number of updates and escape events that occur during execution of Directed Delayed Q-learning. Hence, putting the results together, the algorithm will follow a -optimal policy from its current state on all but

timesteps. ∎