In reinforcement learning (RL), an agent’s goal is to maximize the expected sum of future rewards obtained through interactions with a unknown environment. In doing so, the agent must balance exploration – acting to improve its knowledge of the environment – and exploitation: acting to maximize rewards according to its current knowledge. In the tabular setting, where each state can be modelled in isolation, near-optimal exploration is by now well understood and a number of algorithms provide finite time guarantees (Brafman and Tennenholtz, 2002; Strehl and Littman, 2008; Jaksch et al., 2010; Szita and Szepesvári, 2010; Osband and Van Roy, 2014; Azar et al., 2017). To guarantee near-optimality, however, the sample complexity of theoretically-motivated exploration algorithms must scale at least linearly with the number of states in the environment (Azar et al., 2012).
Yet, recent empirical successes have shown that practical exploration is not hopeless (Bellemare et al., 2016; Ostrovski et al., 2017; Pathak et al., 2017; Plappert et al., 2018; Fortunato et al., 2018; Burda et al., 2018). In this paper we use the term approximate exploration to describe algorithms which sacrifice near-optimality in order to explore more quickly. A desirable characteristic of these algorithms is fast convergence to a reasonable policy; near-optimality may be achieved when the environment is “nice enough”.
Our specific aim is to gain new theoretical understanding of the pseudo-count method, introduced by Bellemare et al. (2016)
as a means of estimating visit counts in non-tabular settings, and how this pertains to approximate exploration. Our study revolves around the MBIE-EB algorithm(Model-based Interval Estimation with Exploration Bonuses; Strehl and Littman, 2008) as a simple illustration of the general “optimism in the face of uncertainty” principle in an approximate exploration setting; MBIE-EB drives exploration by augmenting the empirical reward function with a count-based exploration bonus, which can be either derived from real counts or pseudo-counts.
As a warm-up, we construct an explicitly approximate exploration algorithm by applying MBIE-EB to an abstract environment based on state abstraction (Li et al., 2006; Abel et al., 2016). In this setting we derive performance bounds that simultaneously depend on the quality and size of the aggregation: by taking a finer or coarser aggregation, one can trade off exploration speed and accuracy. We then relate pseudo-counts to these aggregations and show how using pseudo-counts within MBIE-EB can lead to under-exploration (failing to achieve theoretical guarantees) or over-exploration (using an excessive number of samples to do so). Additionally, we quantify the magnitude of both phenomena.
Finally, we show that using pseudo-counts for exploration in the wild, as has been done in practice, produces implicitly approximate exploration. Specifically, under certain assumptions on the density model generating the pseudo-counts, these behave approximately as if derived from a particular abstraction. This is in general problematic, as in pathological cases this prohibits any kind of theoretical guarantees. As an interesting corollary, we find a surprising mismatch between the behaviour of these pseudo-counts and what might be expected given the abstraction they implicitly define.
2 Background and Notations
We consider a Markov decision process (MDP) represented by a 5-tuplewith a finite state space, a finite set of actions,
a transition probability distribution,a reward function, and the discount factor. The goal of reinforcement learning is to find the optimal policy which maximizes the expected discounted sum of future rewards. For any policy , the -value of any state-action pair describes the expected discounted return after taking action in state , then following and can be obtained using the Bellman equation
We also introduce which is the expected discounted return when starting in and following . The -value of the optimal policy verifies the optimal Bellman equation
We also write . Furthermore we assume without loss of generality that rewards are bounded between 0 and 1, and we denote by Qmax the maximum value.
2.1 Approximate state abstraction
We use here the notation from Abel et al. (2016). An abstraction is defined as a mapping from the state space of a ground MDP, , to that of an abstract MDP, , using a state aggregation function . We will write and for the ground and abstract MDPs respectively. The abstract state space is defined as the image of the ground state space by the mapping
We will write for the abstract state associated to a state in the ground space. We define
Let be a weighting such that for all , and . We define the abstract rewards and transition functions as the following convex combinations
Prior work such as Li et al. (2006) has been mostly focused on exact abstraction in MDPs. While interesting, this notion is usually too restrictive and we will instead consider approximate abstractions (Abel et al., 2016)
Let and , defines an approximate state abstraction as follows
Throughout this paper we will illustrate our results with the model similarity abstraction, also known as approximate homomorphism (Ravindran and Barto, 2004) or -equivalent MDP (Even-Dar and Mansour, 2003)
Given , we let be such that:
Where co-aggregated states have close rewards and transition probabilities to other aggregations.
Let and be the optimal policies in the abstract and ground MDPs. We are interested in the quality of the policy learned in the abstraction when applied in the ground MDP. For a state and a state aggregation function , we define such that
We will also write and (resp. and ) the optimal Q-value and value functions in the ground (resp. abstract) MDP.
2.2 Optimal exploration and model-based interval estimation exploration bonus (MBIE-EB)
Exploration efficiency in reinforcement learning can be evaluated using the notions of sample complexity and PAC-MDP introduced by Kakade et al. (2003). We now briefly introduce both of these.
Define the sample complexity of an algorithm A to be the number of time steps where its policy at state is not -optimal: . An algorithm A is said to be PAC-MDP ("Probably Approximately Correct for MDPs") if given a fixed and its sample complexity is less than a polynomial function in the parameters with probability at least .
We focus on MBIE-EB as a simple algorithm based on the state-action visit count, noting that more refined algorithms now exist with better sample guarantees (e.g. Azar et al., 2017; Dann et al., 2017) and that our analysis would extend easily to other algorithms based on state-action visit count. MBIE-EB learns the optimal policy by solving an empirical MDP based upon estimates of rewards and transitions and augments rewards with an exploration bonus
Theorem 1 (Strehl and Littman (2008)).
Let and consider an MDP . Let denote MBIE-EB executed on M with parameter , with an algorithmic constant, and let denote the state at time t. With probability at least , will hold for all but time steps, with
Pseudo-counts have been proposed as a way to estimate counts using a density model over state-action pairs. Given a sequence of states and a sequence of actions, we write the probability assigned to after training on . After training on , we write the new probability assigned as , where denotes the concatenation of sequences and . We require the model to be learning-positive i.e and define the pseudo-count
Which is derived from requiring a one unit increase of the pseudo-count after observing :
Where is the pseudo-count total. We also define the empirical density derived from the state-action visit count
Notice that when the pseudo-count is consistent and recovers . We will also be interested in exploration in abstractions, and to that end define the count of an aggregation A
3 Explicitly approximate exploration
While PAC-MDP algorithms provide guarantees that the agent will act close to optimally with high probability, their sample complexity must increase at least linearly with the size of the state space (Azar et al., 2012). In practice, algorithms are often given small budgets and may not be able to discover the optimal policy within this time. Dealing with smaller sample budgets is exactly the motivation behind approximate exploration methods such as Bellemare et al. (2016)’s, which we will analyze in greater detail in later sections.
When faced with a small budget, it might be appealing to sacrifice near-optimality for sample complexity. One way to do so is to derive the exploratory policy from a given abstraction. We call this process explicitly approximate exploration. As we now show, using a model similarity abstraction is a particularly appealing scheme for explicitly approximate exploration. MBIE-EB applied to the abstract MDP solves the following equation
To provide a setting facilitating exploration we first require the abstraction to have sub-optimality bounded in :
An abstraction is said to have sub-optimality bounded in if there exists a function , monotonically increasing in , with such that
And for any policy .
Definition 3 requires that for small enough we can recover a near-optimal policy using while working with a state space that can be significantly smaller than the ground state space. This property is verified by several abstractions studied by Abel et al. (2016).
Though when the abstraction is only approximate, learning the optimal policy of the abstract MDP does not imply recovering the optimal policy of the ground MDP.
For any , there exists and an MDP which defines a model similarity abstraction of parameter over its abstract space such that is not -optimal.
Given an approximate abstraction with sub-optimality bounded in , let , the (time-dependent) policy obtained while running MBIE-EB in the abstract MDP with and the derived MBIE policy , then with probability , the following bound holds for all but time steps:
Proposition 2 informs us that even though we cannot guarantee -optimality for arbitrary , the abstraction may explore significantly faster, with a sample complexity that depends on rather than .
4 Under- and over-exploration with pseudo-counts
Results from Ostrovski et al. (2017) suggest that the choice of density model plays a crucial role in the exploratory value of pseudo-counts bonuses. Thus far, the only theoretical guarantee concerning pseudo-counts is given by Theorem 2 from Bellemare et al. (2016) and quantifies the asymptotic behaviour of pseudo-counts derived from a density model. We provide here an analysis of the finite time behaviour of pseudo-counts which is then used to give PAC-MDP guarantees. We show that for any given abstraction a density model can be learned over the abstraction then used to approximate the bonus of Equation 6.
Let be a density model and a state abstraction with abstract state space . We define a density model over :
Similarly, . We also define a pseudo-count and total count such that,
We begin with two assumptions on our density model.
Given an abstraction , there exists constants such that for all and all sequences ,
Suppose Assumption 1 holds. Then the ratio of pseudo-counts to empirical counts is bounded and we have
Theorem 2 gives a sufficient condition for the pseudo-counts to behave multiplicatively like empirical counts. As already observed by Bellemare et al. (2016), this requires that tracks the empirical distribution , in particular converging at a rate of . However, our result allows this rate to vary over time.
Our result highlights the interplay between the choice of abstraction and the behaviour of the pseudo-counts. On one hand, applying Assumption 1 is quite restrictive, requiring that the density model basically match the empirical distribution. By choosing a coarser abstraction we relax this requirement, at the cost of near-optimality. In Section 5 we will instantiate the result by viewing the density model as inducing a particular state abstraction.
We now consider the following variant of MBIE-EB:
In this variant, the exploration bonus need not match the empirical count. To understand the effect of this change, consider the following two related settings. In the first setting, increases slowly and consistently underestimates . The pseudo-count exploration bonus, which is inversely proportional to , will therefore remain high for a longer time. In the second setting, increases quickly and consistently overestimates . In turn, the pseudo-count bonus will go to zero much faster than the bonus derived from empirical counts. These two settings correspond to what we call under- and over-exploration, respectively. We will use Theorem 2 to quantify these two effects.
Suppose that satisfies Assumption 1, by rearranging the terms, we find that, for any ,
Hence the uncertainty over carries over to the exploration bonus. Critically, the constant in MBIE-EB is tuned to guarantee that each state is visited at least times, with probability . The following lemma relates a change in with a change in these two quantities.
Perhaps unsurprisingly, over-exploration is rather mild, while under-exploration can cause exploration to fail altogether. A pseudo-count bonus derived from a density model satisfying the assumption of Theorem 2 must under-explore, unless (which implies , since is a probability distribution).
Lemma 1 suggests that we can correct for under-exploration by using a larger constant , for
Consider a variant A’ of MBIE-EB defined with an exploration bonus derived from a density model satisfying the assumption of Theorem 2, and the exploration constant . Then A’
does not under-explore, and
over-explores by a factor of at most .
5 Implicitly approximate exploration
In previous sections we studied an algorithm which is aware of, and takes into account, the state abstraction. In practice, however, bonus-based methods have been combined to a number of function approximation schemes; as noted by Bellemare et al. (2016), the degree of compatibility between the value function and the exploration bonus is sure to impact performance. We now combine the ideas of the two previous sections and study how the particular density model used to generate pseudo-counts induces an implicit approximation.
When does Assumption 1 hold? In general we cannot expect it to be valid for any given abstraction, in particular it is unrealistic to hope that it will be verified in the ground state space. On the other hand, it is natural to assume that there exists an abstraction defined by the density model which satisfies the assumption with reasonable constants. In this section we will see that a density model defines an induced abstraction. In turn, we will quantify how this abstraction provides us with a handle into Assumption 1.
5.1 Induced abstraction
From a density model, we define a state abstraction function as follows.
For , the induced abstraction is such that
In words, two ground states are aggregated if the density model always assigns a close likelihood to both for each action. For example, this is the case when the visit counts of nearby states in a grid world are aggregated together; we will study such a model shortly. The definition of this abstraction is independent of the sequence of states the model was trained on and only depends on the model. From this definition co-aggregated states have similar pseudo-count, from Theorem 2, for two ground states
Suppose that the induced pseudo-count satisfies Assumption 1. One may expect that this is sufficient to obtain similar guarantees to those of Theorem 3, by relating the ground pseudo-count (computed from ) to the abstract pseudo-count (which we could compute from ). In particular, for a small , we may expect the following relationship
whereby an abstract state’s pseudo-count is divided uniformly between the pseudo-counts of the states of the abstraction. Surprisingly, this is not the case, and in fact as we will show is greater than its corresponding . The following makes this precise:
Let be a density model and , , , and as before. Then for and
with , and are given by:
For an exact abstraction ()
Two remarks are in order. First, for any kind of aggregation, implies . Second, when , that is, as the density concentrates within a single aggregation, then the pseudo-counts for individual states grow unboundedly. Our result highlights an intriguing property of pseudo-counts: when the density model generalizes (in our case, by assigning the same probability to aggregated states) then the pseudo-counts of individual states increase faster than under the true, empirical density.
One particularly striking instance of this effect occurs when is itself defined from an abstraction . That is, consider the density model which assigns a uniform probability to all states within an aggregation:
Lemma 3.1 applies and we deduce that the pseudo-count associated with is greater than the visit count for its aggregation: . From Lemma 3.1 we conclude that, unless the induced abstraction is trivial, we cannot prevent under-exploration when using a pseudo-count based bonus. One way to derive meaningful guarantees is to bound the lemma’s multiplicative constant, by requiring that no abstraction be visited too often.
Consider a state . If there exists such that then for an exact abstraction:
In particular this result justifies how pseudo-counts generalize to unseen states, while a pair may have not been observed, the pseudo-count will increases as long as other pairs in the same aggregation are being visited.
One way to guarantee the existence of a uniform constant in Proposition 3 is to inject random noise in the behaviour of the agent, for example by acting -greedily with respect to the MBIE-EB -values. In this case, a bound on
can be derived by considering the rate of convergence to the stationary distribution of the induced Markov chain(see e.g. Fill, 1991).
5.2 Over-estimation impact on exploration
We provide now an example of MDP (see Figure 0(a)) where the overestimation described previously can hurt exploration.
In this example, the initial state distribution is uniform over states . Each episode lasts for a single timestep. The agent can either choose the action left, transition to collecting a small reward or choose the action right which leads to state with probability collecting a reward , otherwise, the agent remains at the same state. In this setting it seems natural to aggregate states as they share similar properties.
We apply MBIE-EB on this environment and compare pseudo-counts derived from a density model similar to Equation 8 with the empirical count of the aggregation. From Corollary 3.1 we know that for any action and state , furthermore at the beginning of training, as the agent explores and alternate between the two actions at similar frequency, the overestimation grows linearly. When is small this can induce the agent to under-explore and choose the sub-optimal action left. We run our example with , action left value is whereas it is for action right. Figure 0(b) shows the time to converge to the optimal policy over 20 seeds for different values of MBIE-EB constant . While this example is pathological, it shows the impact pseudo-count over-estimation can have on exploration, in the next section we provide a way around this issue.
5.3 Correcting for counts over-estimation
The over-estimation issue detailed in Corollary 3.1 is a consequence of pseudo-counts postulating in their defintion that the count of a single state should increase after updating the density model. In practice when a state is visited the count of every other state in the same aggregation should increase too. It is possible to derive a new pseudo-count bonus verifying this property as we shall see now
Let be the pseudo-count defined such that for any state-action pair
with the pseudo-count total. Then can be computed as follow
where and .
For an exact induced abstraction, does not suffer from the over-estimation previously mentioned and we have
for any state action pair .
Theorem 4 shows that it is possible to mitigate pseudo-counts over-estimation at the cost of more compute as the density model needs to be updated twice at each timestep. For the density model defined from an abstraction in Equation (8) the pseudo count will this time exactly match the count of the abstraction. It should also be noted that we have , so any reward bonus derived from will be higher than if it was derived from instead which may be beneficial in the function approximation where the intrinsic reward would provide more signal.
5.4 Empirical evaluation
Combining Theorem 2 and Lemma 2 (or applying Theorem 4) we can bound the ratio of pseudo counts to empirical counts for a given abstraction verifying Assumption 1. Nevertheless the impact of a bonus derived from an abstraction to explore in the ground state space has not been quantified. This was referred to by Bellemare et al. (2016) as the lack of compatibility between the exploration bonus and the value function. While we were not able to derive theoretical results regarding this particular case, we provide an empirical study on a grid world.
We use a 9-room domain (see Figure 1(a)) where the agent starts from the bottom left and needs to reach one of four top right states to receive a positive reward of 1. The agent has access to four actions: up, down, left, right. Transitions are deterministic; moving into walls leaves the agent in the same position. The environment runs until the agent reaches the goal, at which point the agent is rewarded and the episode starts over from the initial position.
We compare MBIE-EB using the empirical count from Equation (4) with the variant of MBIE-EB using pseudo-counts bonuses - MBIE-EB-PC - from Equation (7) 111We did not notice any significant difference between and and used for all experiments. Pseudo-counts are derived from a density model (Equation 8) which assigns a uniform probability to states within the same room as shown in Figure 1(b). We also investigate the impact of an -greedy policy as proposed in the previous subsection. Figure 3 depicts the cumulative rewards received by both our agents for different values of and
. Each experiment is averaged over 5 seeds, shaded error represents variance. It demonstrates that:
MBIE-EB fulfills the task relatively well in most instances while the lack of compatibility between the value function and the pseudo count exploration bonus can impact performance to the point where MBIE-EB-PC fails completely (Figure 2(b)).
While MBIE-EB is not much affected by the -greedy policy, the parameter is critical for MBIEB-EB-PC. While the pseudo-count bonus provides a signal to explore across room., a high value of is necessary for the agent to maneuver within individual rooms. In order to avoid under-exploration, higher values of work best.
By not assigning a count to every state action pair, MBIE-EB-PC can act greedily with respect to environment and achieves a higher cumulative reward in the first 10,000 timesteps than MBIE-EB.
MBIE-EB-PC is more robust to a wider range of values of , suggesting that exploration in the ground MDP is more subject to over-exploration.
6 Related Work
Performance bounds for efficient learning of MDPs have been thoroughly studied. In the context of PAC-MDP algorithms, model-based approaches such as Rmax (Brafman and Tennenholtz, 2002), MBIE and MBIE-EB (Strehl and Littman, 2008) or (Kearns and Singh, 2002) build an empirical model of a set of the environment state-actions pairs using the agent’s past experience. Strehl et al. (2006) also investigated the model-free case with delayed Q-learning and showed that they could lower the sample complexity dependence on state space dimension. Bayesian Exploration Bonus proposed by Kolter and Ng (2009) is not PAC-MDP but offers the guarantee to act optimally with respect to the agent’s prior except for a polynomial number of timesteps.
In the average reward case, UCRL (Jaksch et al., 2010) was shown to obtain a low regret on MDPs with finite diameter. Many extensions exploit the structure of the MDP to improve further the regret bound (Ortner, 2013; Osband and Van Roy, 2014; Hutter, 2014; Fruit et al., 2018; Ok et al., 2018). Similarly Kearns and Koller (1999) presented a variant of which is also PAC-MDP. Temporal abstraction in the form of extended actions (Sutton et al., 1999) has been recently studied for exploration. Brunskill and Li (2014) proposed a variant of Rmax for SMDPs and Fruit and Lazaric (2017) extended UCRL to MDPs where a set of options is available, both have shown promising results when a good set of options is available.
Finding abstractions in order to handle large state spaces remains a long standing goal in reinforcement learning, a lot of work in the literature has been focused on finding metrics to quantify state similarity (Bean et al., 1987; Andre and Russell, 2002). Li et al. (2006) provided an unifying view on exact abstractions that preserve the optimality. Metrics related to the model similarity metric include bisimulation (Ferns et al., 2004, 2006), bounded parameters MDPs (Givan et al., 2000), -similarity (Even-Dar and Mansour, 2003; Ortner, 2007).
In this work we build on previous results related to state abstraction and exploration. We highlighted how they can help to understand better the success of exploration using pseudo-counts in the non-tabular case. As it turns out, with finite time, optimal exploration might be too hard to obtain and we have to settle for approximate solution that trade off speed convergence and guarantee w.r.t to the policy learned.
It is unlikely that practical exploration will enjoy near-optimality guarantees as powerful as those given by theoretical methods. In most environments, there are simply too many places to get lost. Alternative schemes – such as the value-based exploration idea proposed by Leike (2016) – may help but only so much. In our work, we showed that abstractions allow us to impose a certain prior on the shape that exploration needs to take.
We also found that pseudo-count based methods, like other abstraction-based schemes, can fail dramatically when they are incompatible with the environment. While this is expected given their practical trade-off, we believe our work moves us towards a better understanding of bonus-based methods in practice. An interesting question is whether adaptive schemes can be designed that would enjoy both the speed of exploration of coarse abstractions with the near-optimality guarantees of fine ones.
We would like to thank Mohammad Azar, Sai Krishna, Tristan Deleu, Alexandre Piché, Carles Gelada and Michael Noukhovitch for careful reading and insightful comments on an earlier version of the paper. This work was funded by FRQNT through the CHIST-ERA IGLU project.
Abel et al. 
David Abel, D. Ellis Hershkowitz, and Michael L. Littman.
Near optimal behavior via approximate state abstraction.
Proceedings of the International Conference on Machine Learning, pages 2915–2923, 2016.
- Andre and Russell  David Andre and Stuart J Russell. State abstraction for programmable reinforcement learning agents. In AAAI/IAAI, pages 119–125, 2002.
- Azar et al.  Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert Kappen. On the sample complexity of reinforcement learning with a generative model. In Proceedings of the International Conference on Machine Learning, 2012.
- Azar et al.  Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, 2017.
- Bean et al.  James C Bean, John R Birge, and Robert L Smith. Aggregation in dynamic programming. Operations Research, 35(2):215–220, 1987.
- Bellemare et al.  Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
- Brafman and Tennenholtz  Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
- Brunskill and Li  Emma Brunskill and Lihong Li. PAC-inspired option discovery in lifelong reinforcementlearning. In Proceedings of the International Conference on Machine Learning, pages 316–324, 2014.
- Burda et al.  Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
- Dann et al.  Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and Regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, 2017.
- Even-Dar and Mansour  Eyal Even-Dar and Yishay Mansour. Approximate equivalence of Markov decision processes. In Learning Theory and Kernel Machines, pages 581–594. Springer, 2003.
Ferns et al. 
Norman Ferns, Prakash Panangaden, and Doina Precup.
Metrics for finite Markov decision processes.
Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 162–169. AUAI Press, 2004.
- Ferns et al.  Norman Ferns, Pablo Samuel Castro, Doina Precup, and Prakash Panangaden. Methods for computing state similarity in Markov decision processes. In Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2006.
- Fill  J. A. Fill. Eigenvalue bounds on convergence to stationarity for nonreversible markov chains. Annals of Applied Probability, 1:62–87, 1991.
- Fortunato et al.  Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for exploration. In Proceedings of the International Conference on Learning Representations, 2018.
- Fruit and Lazaric  Ronan Fruit and Alessandro Lazaric. Exploration–Exploitation in MDPs with Options. Artificial Intelligence and Statistics, 2017.
- Fruit et al.  Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In Proceedings of the International Conference on Machine Learning, volume 80, pages 1573–1581, 2018.
- Givan et al.  Robert Givan, Sonia Leach, and Thomas Dean. Bounded-parameter Markov decision processes. Artificial Intelligence, 122(1-2):71–109, 2000.
- Hutter  Marcus Hutter. Extreme state aggregation beyond MDPs. In International Conference on Algorithmic Learning Theory, pages 185–199. Springer, 2014.
- Jaksch et al.  Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- Kakade et al.  Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD thesis, University of London, 2003.
- Kearns and Koller  Michael Kearns and Daphne Koller. Efficient reinforcement learning in factored MDPs. IJCAI, 16, 1999.
- Kearns and Singh  Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209–232, 2002.
- Kolter and Ng  J Zico Kolter and Andrew Y Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 513–520. ACM, 2009.
- Leike  Jan Leike. Exploration potential. arXiv preprint arXiv:1609.04994, 2016.
- Li  Lihong Li. A unifying framework for computational reinforcement learning theory. PhD thesis, Rutgers University-Graduate School-New Brunswick, 2009.
- Li et al.  Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for MDPs. In ISAIM, 2006.
- Ok et al.  Jungseul Ok, Alexandre Proutiere, and Damianos Tranos. Exploration in structured reinforcement learning. In Advances in Neural Information Processing Systems 31, pages 8888–8896, 2018.
- Ortner  Ronald Ortner. Pseudometrics for state aggregation in average reward Markov decision processes. In International Conference on Algorithmic Learning Theory, pages 373–387. Springer, 2007.
- Ortner  Ronald Ortner. Adaptive aggregation for reinforcement learning in average reward markov decision processes. Annals of Operations Research, 208(1):321–336, 2013.
- Osband and Van Roy  Ian Osband and Benjamin Van Roy. Near-optimal reinforcement learning in factored MDPs. In Advances in Neural Information Processing Systems, pages 604–612, 2014.
- Ostrovski et al.  Georg Ostrovski, Marc G. Bellemare, Aäron van den Oord, and Rémi Munos. Count-based exploration with Neural Density Models. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 2721–2730. PMLR, 2017.
- Pathak et al.  Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the International Conference on Machine Learning, 2017.
- Plappert et al.  Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. In Proceedings of the International Conference on Learning Representations, 2018.
- Ravindran and Barto  Balaraman Ravindran and Andrew G Barto. Approximate homomorphisms: A framework for non-exact minimization in markov decision processes. In Proceedings of the Fifth International Conference on Knowledge Based Computer Systems, 2004.
- Strehl and Littman  Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
- Strehl et al.  Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. PAC model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM, 2006.
- Sutton et al.  Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
- Szita and Szepesvári  István Szita and Csaba Szepesvári. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning, 2010.
Appendix A Proofs
For a model similarity abstraction we have the following inequality:
Note that we have the following inequalities
The previous Lemma can be used to show a model similarity abstraction has sub-optimality bounded in and improves the bound of Abel et al. , which has a dependency, due to an issue in the original proof. To the best of our knowledge, ours is the first complete proof of this result.
A model similarity abstraction (Def. 1) has sub-optimality bounded in
Using similar arguments than in Lemma 3 we can show that: