1 Introduction
In reinforcement learning (RL), an agent’s goal is to maximize the expected sum of future rewards obtained through interactions with a unknown environment. In doing so, the agent must balance exploration – acting to improve its knowledge of the environment – and exploitation: acting to maximize rewards according to its current knowledge. In the tabular setting, where each state can be modelled in isolation, nearoptimal exploration is by now well understood and a number of algorithms provide finite time guarantees (Brafman and Tennenholtz, 2002; Strehl and Littman, 2008; Jaksch et al., 2010; Szita and Szepesvári, 2010; Osband and Van Roy, 2014; Azar et al., 2017). To guarantee nearoptimality, however, the sample complexity of theoreticallymotivated exploration algorithms must scale at least linearly with the number of states in the environment (Azar et al., 2012).
Yet, recent empirical successes have shown that practical exploration is not hopeless (Bellemare et al., 2016; Ostrovski et al., 2017; Pathak et al., 2017; Plappert et al., 2018; Fortunato et al., 2018; Burda et al., 2018). In this paper we use the term approximate exploration to describe algorithms which sacrifice nearoptimality in order to explore more quickly. A desirable characteristic of these algorithms is fast convergence to a reasonable policy; nearoptimality may be achieved when the environment is “nice enough”.
Our specific aim is to gain new theoretical understanding of the pseudocount method, introduced by Bellemare et al. (2016)
as a means of estimating visit counts in nontabular settings, and how this pertains to approximate exploration. Our study revolves around the MBIEEB algorithm
(Modelbased Interval Estimation with Exploration Bonuses; Strehl and Littman, 2008) as a simple illustration of the general “optimism in the face of uncertainty” principle in an approximate exploration setting; MBIEEB drives exploration by augmenting the empirical reward function with a countbased exploration bonus, which can be either derived from real counts or pseudocounts.As a warmup, we construct an explicitly approximate exploration algorithm by applying MBIEEB to an abstract environment based on state abstraction (Li et al., 2006; Abel et al., 2016). In this setting we derive performance bounds that simultaneously depend on the quality and size of the aggregation: by taking a finer or coarser aggregation, one can trade off exploration speed and accuracy. We then relate pseudocounts to these aggregations and show how using pseudocounts within MBIEEB can lead to underexploration (failing to achieve theoretical guarantees) or overexploration (using an excessive number of samples to do so). Additionally, we quantify the magnitude of both phenomena.
Finally, we show that using pseudocounts for exploration in the wild, as has been done in practice, produces implicitly approximate exploration. Specifically, under certain assumptions on the density model generating the pseudocounts, these behave approximately as if derived from a particular abstraction. This is in general problematic, as in pathological cases this prohibits any kind of theoretical guarantees. As an interesting corollary, we find a surprising mismatch between the behaviour of these pseudocounts and what might be expected given the abstraction they implicitly define.
2 Background and Notations
We consider a Markov decision process (MDP) represented by a 5tuple
with a finite state space, a finite set of actions,a transition probability distribution,
a reward function, and the discount factor. The goal of reinforcement learning is to find the optimal policy which maximizes the expected discounted sum of future rewards. For any policy , the value of any stateaction pair describes the expected discounted return after taking action in state , then following and can be obtained using the Bellman equationWe also introduce which is the expected discounted return when starting in and following . The value of the optimal policy verifies the optimal Bellman equation
We also write . Furthermore we assume without loss of generality that rewards are bounded between 0 and 1, and we denote by Qmax the maximum value.
2.1 Approximate state abstraction
We use here the notation from Abel et al. (2016). An abstraction is defined as a mapping from the state space of a ground MDP, , to that of an abstract MDP, , using a state aggregation function . We will write and for the ground and abstract MDPs respectively. The abstract state space is defined as the image of the ground state space by the mapping
(1) 
We will write for the abstract state associated to a state in the ground space. We define
(2)  
(3) 
Let be a weighting such that for all , and . We define the abstract rewards and transition functions as the following convex combinations
Prior work such as Li et al. (2006) has been mostly focused on exact abstraction in MDPs. While interesting, this notion is usually too restrictive and we will instead consider approximate abstractions (Abel et al., 2016)
Definition 1.
Let and , defines an approximate state abstraction as follows
Throughout this paper we will illustrate our results with the model similarity abstraction, also known as approximate homomorphism (Ravindran and Barto, 2004) or equivalent MDP (EvenDar and Mansour, 2003)
Example 1.
Given , we let be such that:
Where coaggregated states have close rewards and transition probabilities to other aggregations.
Let and be the optimal policies in the abstract and ground MDPs. We are interested in the quality of the policy learned in the abstraction when applied in the ground MDP. For a state and a state aggregation function , we define such that
We will also write and (resp. and ) the optimal Qvalue and value functions in the ground (resp. abstract) MDP.
2.2 Optimal exploration and modelbased interval estimation exploration bonus (MBIEEB)
Exploration efficiency in reinforcement learning can be evaluated using the notions of sample complexity and PACMDP introduced by Kakade et al. (2003). We now briefly introduce both of these.
Definition 2.
Define the sample complexity of an algorithm A to be the number of time steps where its policy at state is not optimal: . An algorithm A is said to be PACMDP ("Probably Approximately Correct for MDPs") if given a fixed and its sample complexity is less than a polynomial function in the parameters with probability at least .
We focus on MBIEEB as a simple algorithm based on the stateaction visit count, noting that more refined algorithms now exist with better sample guarantees (e.g. Azar et al., 2017; Dann et al., 2017) and that our analysis would extend easily to other algorithms based on stateaction visit count. MBIEEB learns the optimal policy by solving an empirical MDP based upon estimates of rewards and transitions and augments rewards with an exploration bonus
(4) 
Theorem 1 (Strehl and Littman (2008)).
Let and consider an MDP . Let denote MBIEEB executed on M with parameter , with an algorithmic constant, and let denote the state at time t. With probability at least , will hold for all but time steps, with
(5) 
2.3 Pseudocounts
Pseudocounts have been proposed as a way to estimate counts using a density model over stateaction pairs. Given a sequence of states and a sequence of actions, we write the probability assigned to after training on . After training on , we write the new probability assigned as , where denotes the concatenation of sequences and . We require the model to be learningpositive i.e and define the pseudocount
Which is derived from requiring a one unit increase of the pseudocount after observing :
Where is the pseudocount total. We also define the empirical density derived from the stateaction visit count
Notice that when the pseudocount is consistent and recovers . We will also be interested in exploration in abstractions, and to that end define the count of an aggregation A
3 Explicitly approximate exploration
While PACMDP algorithms provide guarantees that the agent will act close to optimally with high probability, their sample complexity must increase at least linearly with the size of the state space (Azar et al., 2012). In practice, algorithms are often given small budgets and may not be able to discover the optimal policy within this time. Dealing with smaller sample budgets is exactly the motivation behind approximate exploration methods such as Bellemare et al. (2016)’s, which we will analyze in greater detail in later sections.
When faced with a small budget, it might be appealing to sacrifice nearoptimality for sample complexity. One way to do so is to derive the exploratory policy from a given abstraction. We call this process explicitly approximate exploration. As we now show, using a model similarity abstraction is a particularly appealing scheme for explicitly approximate exploration. MBIEEB applied to the abstract MDP solves the following equation
(6) 
To provide a setting facilitating exploration we first require the abstraction to have suboptimality bounded in :
Definition 3.
An abstraction is said to have suboptimality bounded in if there exists a function , monotonically increasing in , with such that
And for any policy .
Definition 3 requires that for small enough we can recover a nearoptimal policy using while working with a state space that can be significantly smaller than the ground state space. This property is verified by several abstractions studied by Abel et al. (2016).
Though when the abstraction is only approximate, learning the optimal policy of the abstract MDP does not imply recovering the optimal policy of the ground MDP.
Proposition 1.
For any , there exists and an MDP which defines a model similarity abstraction of parameter over its abstract space such that is not optimal.
We can nevertheless benefit from exploring using the abstract MDP. Combining Theorem 1 and Definition 3:
Proposition 2.
Given an approximate abstraction with suboptimality bounded in , let , the (timedependent) policy obtained while running MBIEEB in the abstract MDP with and the derived MBIE policy , then with probability , the following bound holds for all but time steps:
Proposition 2 informs us that even though we cannot guarantee optimality for arbitrary , the abstraction may explore significantly faster, with a sample complexity that depends on rather than .
4 Under and overexploration with pseudocounts
Results from Ostrovski et al. (2017) suggest that the choice of density model plays a crucial role in the exploratory value of pseudocounts bonuses. Thus far, the only theoretical guarantee concerning pseudocounts is given by Theorem 2 from Bellemare et al. (2016) and quantifies the asymptotic behaviour of pseudocounts derived from a density model. We provide here an analysis of the finite time behaviour of pseudocounts which is then used to give PACMDP guarantees. We show that for any given abstraction a density model can be learned over the abstraction then used to approximate the bonus of Equation 6.
Definition 4.
Let be a density model and a state abstraction with abstract state space . We define a density model over :
Similarly, . We also define a pseudocount and total count such that,
We begin with two assumptions on our density model.
Assumption 1.
Given an abstraction , there exists constants such that for all and all sequences ,
Theorem 2.
Suppose Assumption 1 holds. Then the ratio of pseudocounts to empirical counts is bounded and we have
Theorem 2 gives a sufficient condition for the pseudocounts to behave multiplicatively like empirical counts. As already observed by Bellemare et al. (2016), this requires that tracks the empirical distribution , in particular converging at a rate of . However, our result allows this rate to vary over time.
Our result highlights the interplay between the choice of abstraction and the behaviour of the pseudocounts. On one hand, applying Assumption 1 is quite restrictive, requiring that the density model basically match the empirical distribution. By choosing a coarser abstraction we relax this requirement, at the cost of nearoptimality. In Section 5 we will instantiate the result by viewing the density model as inducing a particular state abstraction.
We now consider the following variant of MBIEEB:
(7) 
In this variant, the exploration bonus need not match the empirical count. To understand the effect of this change, consider the following two related settings. In the first setting, increases slowly and consistently underestimates . The pseudocount exploration bonus, which is inversely proportional to , will therefore remain high for a longer time. In the second setting, increases quickly and consistently overestimates . In turn, the pseudocount bonus will go to zero much faster than the bonus derived from empirical counts. These two settings correspond to what we call under and overexploration, respectively. We will use Theorem 2 to quantify these two effects.
Suppose that satisfies Assumption 1, by rearranging the terms, we find that, for any ,
Hence the uncertainty over carries over to the exploration bonus. Critically, the constant in MBIEEB is tuned to guarantee that each state is visited at least times, with probability . The following lemma relates a change in with a change in these two quantities.
Lemma 1.
Perhaps unsurprisingly, overexploration is rather mild, while underexploration can cause exploration to fail altogether. A pseudocount bonus derived from a density model satisfying the assumption of Theorem 2 must underexplore, unless (which implies , since is a probability distribution).
Lemma 1 suggests that we can correct for underexploration by using a larger constant , for
Theorem 3.
Consider a variant A’ of MBIEEB defined with an exploration bonus derived from a density model satisfying the assumption of Theorem 2, and the exploration constant . Then A’

does not underexplore, and

overexplores by a factor of at most .
5 Implicitly approximate exploration
In previous sections we studied an algorithm which is aware of, and takes into account, the state abstraction. In practice, however, bonusbased methods have been combined to a number of function approximation schemes; as noted by Bellemare et al. (2016), the degree of compatibility between the value function and the exploration bonus is sure to impact performance. We now combine the ideas of the two previous sections and study how the particular density model used to generate pseudocounts induces an implicit approximation.
When does Assumption 1 hold? In general we cannot expect it to be valid for any given abstraction, in particular it is unrealistic to hope that it will be verified in the ground state space. On the other hand, it is natural to assume that there exists an abstraction defined by the density model which satisfies the assumption with reasonable constants. In this section we will see that a density model defines an induced abstraction. In turn, we will quantify how this abstraction provides us with a handle into Assumption 1.
5.1 Induced abstraction
From a density model, we define a state abstraction function as follows.
Definition 5.
For , the induced abstraction is such that
In words, two ground states are aggregated if the density model always assigns a close likelihood to both for each action. For example, this is the case when the visit counts of nearby states in a grid world are aggregated together; we will study such a model shortly. The definition of this abstraction is independent of the sequence of states the model was trained on and only depends on the model. From this definition coaggregated states have similar pseudocount, from Theorem 2, for two ground states
Suppose that the induced pseudocount satisfies Assumption 1. One may expect that this is sufficient to obtain similar guarantees to those of Theorem 3, by relating the ground pseudocount (computed from ) to the abstract pseudocount (which we could compute from ). In particular, for a small , we may expect the following relationship
whereby an abstract state’s pseudocount is divided uniformly between the pseudocounts of the states of the abstraction. Surprisingly, this is not the case, and in fact as we will show is greater than its corresponding . The following makes this precise:
Lemma 2.
Let be a density model and , , , and as before. Then for and
with , and are given by:
Corollary 3.1.
For an exact abstraction ()
Two remarks are in order. First, for any kind of aggregation, implies . Second, when , that is, as the density concentrates within a single aggregation, then the pseudocounts for individual states grow unboundedly. Our result highlights an intriguing property of pseudocounts: when the density model generalizes (in our case, by assigning the same probability to aggregated states) then the pseudocounts of individual states increase faster than under the true, empirical density.
One particularly striking instance of this effect occurs when is itself defined from an abstraction . That is, consider the density model which assigns a uniform probability to all states within an aggregation:
(8) 
Lemma 3.1 applies and we deduce that the pseudocount associated with is greater than the visit count for its aggregation: . From Lemma 3.1 we conclude that, unless the induced abstraction is trivial, we cannot prevent underexploration when using a pseudocount based bonus. One way to derive meaningful guarantees is to bound the lemma’s multiplicative constant, by requiring that no abstraction be visited too often.
Proposition 3.
Consider a state . If there exists such that then for an exact abstraction:
In particular this result justifies how pseudocounts generalize to unseen states, while a pair may have not been observed, the pseudocount will increases as long as other pairs in the same aggregation are being visited.
One way to guarantee the existence of a uniform constant in Proposition 3 is to inject random noise in the behaviour of the agent, for example by acting greedily with respect to the MBIEEB values. In this case, a bound on
can be derived by considering the rate of convergence to the stationary distribution of the induced Markov chain
(see e.g. Fill, 1991).5.2 Overestimation impact on exploration
We provide now an example of MDP (see Figure 0(a)) where the overestimation described previously can hurt exploration.
In this example, the initial state distribution is uniform over states . Each episode lasts for a single timestep. The agent can either choose the action left, transition to collecting a small reward or choose the action right which leads to state with probability collecting a reward , otherwise, the agent remains at the same state. In this setting it seems natural to aggregate states as they share similar properties.
We apply MBIEEB on this environment and compare pseudocounts derived from a density model similar to Equation 8 with the empirical count of the aggregation. From Corollary 3.1 we know that for any action and state , furthermore at the beginning of training, as the agent explores and alternate between the two actions at similar frequency, the overestimation grows linearly. When is small this can induce the agent to underexplore and choose the suboptimal action left. We run our example with , action left value is whereas it is for action right. Figure 0(b) shows the time to converge to the optimal policy over 20 seeds for different values of MBIEEB constant . While this example is pathological, it shows the impact pseudocount overestimation can have on exploration, in the next section we provide a way around this issue.
5.3 Correcting for counts overestimation
The overestimation issue detailed in Corollary 3.1 is a consequence of pseudocounts postulating in their defintion that the count of a single state should increase after updating the density model. In practice when a state is visited the count of every other state in the same aggregation should increase too. It is possible to derive a new pseudocount bonus verifying this property as we shall see now
Theorem 4.
Let be the pseudocount defined such that for any stateaction pair
with the pseudocount total. Then can be computed as follow
where and .
For an exact induced abstraction, does not suffer from the overestimation previously mentioned and we have
for any state action pair .
Theorem 4 shows that it is possible to mitigate pseudocounts overestimation at the cost of more compute as the density model needs to be updated twice at each timestep. For the density model defined from an abstraction in Equation (8) the pseudo count will this time exactly match the count of the abstraction. It should also be noted that we have , so any reward bonus derived from will be higher than if it was derived from instead which may be beneficial in the function approximation where the intrinsic reward would provide more signal.
5.4 Empirical evaluation
Combining Theorem 2 and Lemma 2 (or applying Theorem 4) we can bound the ratio of pseudo counts to empirical counts for a given abstraction verifying Assumption 1. Nevertheless the impact of a bonus derived from an abstraction to explore in the ground state space has not been quantified. This was referred to by Bellemare et al. (2016) as the lack of compatibility between the exploration bonus and the value function. While we were not able to derive theoretical results regarding this particular case, we provide an empirical study on a grid world.
We use a 9room domain (see Figure 1(a)) where the agent starts from the bottom left and needs to reach one of four top right states to receive a positive reward of 1. The agent has access to four actions: up, down, left, right. Transitions are deterministic; moving into walls leaves the agent in the same position. The environment runs until the agent reaches the goal, at which point the agent is rewarded and the episode starts over from the initial position.
We compare MBIEEB using the empirical count from Equation (4) with the variant of MBIEEB using pseudocounts bonuses  MBIEEBPC  from Equation (7) ^{1}^{1}1We did not notice any significant difference between and and used for all experiments. Pseudocounts are derived from a density model (Equation 8) which assigns a uniform probability to states within the same room as shown in Figure 1(b). We also investigate the impact of an greedy policy as proposed in the previous subsection. Figure 3 depicts the cumulative rewards received by both our agents for different values of and
. Each experiment is averaged over 5 seeds, shaded error represents variance. It demonstrates that:

MBIEEB fulfills the task relatively well in most instances while the lack of compatibility between the value function and the pseudo count exploration bonus can impact performance to the point where MBIEEBPC fails completely (Figure 2(b)).

While MBIEEB is not much affected by the greedy policy, the parameter is critical for MBIEBEBPC. While the pseudocount bonus provides a signal to explore across room., a high value of is necessary for the agent to maneuver within individual rooms. In order to avoid underexploration, higher values of work best.

By not assigning a count to every state action pair, MBIEEBPC can act greedily with respect to environment and achieves a higher cumulative reward in the first 10,000 timesteps than MBIEEB.

MBIEEBPC is more robust to a wider range of values of , suggesting that exploration in the ground MDP is more subject to overexploration.
6 Related Work
Performance bounds for efficient learning of MDPs have been thoroughly studied. In the context of PACMDP algorithms, modelbased approaches such as Rmax (Brafman and Tennenholtz, 2002), MBIE and MBIEEB (Strehl and Littman, 2008) or (Kearns and Singh, 2002) build an empirical model of a set of the environment stateactions pairs using the agent’s past experience. Strehl et al. (2006) also investigated the modelfree case with delayed Qlearning and showed that they could lower the sample complexity dependence on state space dimension. Bayesian Exploration Bonus proposed by Kolter and Ng (2009) is not PACMDP but offers the guarantee to act optimally with respect to the agent’s prior except for a polynomial number of timesteps.
In the average reward case, UCRL (Jaksch et al., 2010) was shown to obtain a low regret on MDPs with finite diameter. Many extensions exploit the structure of the MDP to improve further the regret bound (Ortner, 2013; Osband and Van Roy, 2014; Hutter, 2014; Fruit et al., 2018; Ok et al., 2018). Similarly Kearns and Koller (1999) presented a variant of which is also PACMDP. Temporal abstraction in the form of extended actions (Sutton et al., 1999) has been recently studied for exploration. Brunskill and Li (2014) proposed a variant of Rmax for SMDPs and Fruit and Lazaric (2017) extended UCRL to MDPs where a set of options is available, both have shown promising results when a good set of options is available.
Finding abstractions in order to handle large state spaces remains a long standing goal in reinforcement learning, a lot of work in the literature has been focused on finding metrics to quantify state similarity (Bean et al., 1987; Andre and Russell, 2002). Li et al. (2006) provided an unifying view on exact abstractions that preserve the optimality. Metrics related to the model similarity metric include bisimulation (Ferns et al., 2004, 2006), bounded parameters MDPs (Givan et al., 2000), similarity (EvenDar and Mansour, 2003; Ortner, 2007).
Conclusion
In this work we build on previous results related to state abstraction and exploration. We highlighted how they can help to understand better the success of exploration using pseudocounts in the nontabular case. As it turns out, with finite time, optimal exploration might be too hard to obtain and we have to settle for approximate solution that trade off speed convergence and guarantee w.r.t to the policy learned.
It is unlikely that practical exploration will enjoy nearoptimality guarantees as powerful as those given by theoretical methods. In most environments, there are simply too many places to get lost. Alternative schemes – such as the valuebased exploration idea proposed by Leike (2016) – may help but only so much. In our work, we showed that abstractions allow us to impose a certain prior on the shape that exploration needs to take.
We also found that pseudocount based methods, like other abstractionbased schemes, can fail dramatically when they are incompatible with the environment. While this is expected given their practical tradeoff, we believe our work moves us towards a better understanding of bonusbased methods in practice. An interesting question is whether adaptive schemes can be designed that would enjoy both the speed of exploration of coarse abstractions with the nearoptimality guarantees of fine ones.
Acknowledgements
We would like to thank Mohammad Azar, Sai Krishna, Tristan Deleu, Alexandre Piché, Carles Gelada and Michael Noukhovitch for careful reading and insightful comments on an earlier version of the paper. This work was funded by FRQNT through the CHISTERA IGLU project.
References

Abel et al. [2016]
David Abel, D. Ellis Hershkowitz, and Michael L. Littman.
Near optimal behavior via approximate state abstraction.
In
Proceedings of the International Conference on Machine Learning
, pages 2915–2923, 2016.  Andre and Russell [2002] David Andre and Stuart J Russell. State abstraction for programmable reinforcement learning agents. In AAAI/IAAI, pages 119–125, 2002.
 Azar et al. [2012] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert Kappen. On the sample complexity of reinforcement learning with a generative model. In Proceedings of the International Conference on Machine Learning, 2012.
 Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, 2017.
 Bean et al. [1987] James C Bean, John R Birge, and Robert L Smith. Aggregation in dynamic programming. Operations Research, 35(2):215–220, 1987.
 Bellemare et al. [2016] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 1471–1479, 2016.
 Brafman and Tennenholtz [2002] Ronen I Brafman and Moshe Tennenholtz. Rmaxa general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
 Brunskill and Li [2014] Emma Brunskill and Lihong Li. PACinspired option discovery in lifelong reinforcementlearning. In Proceedings of the International Conference on Machine Learning, pages 316–324, 2014.
 Burda et al. [2018] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
 Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and Regret: Uniform PAC bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, 2017.
 EvenDar and Mansour [2003] Eyal EvenDar and Yishay Mansour. Approximate equivalence of Markov decision processes. In Learning Theory and Kernel Machines, pages 581–594. Springer, 2003.

Ferns et al. [2004]
Norman Ferns, Prakash Panangaden, and Doina Precup.
Metrics for finite Markov decision processes.
In
Proceedings of the 20th conference on Uncertainty in artificial intelligence
, pages 162–169. AUAI Press, 2004.  Ferns et al. [2006] Norman Ferns, Pablo Samuel Castro, Doina Precup, and Prakash Panangaden. Methods for computing state similarity in Markov decision processes. In Proceedings of the TwentySecond Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2006.
 Fill [1991] J. A. Fill. Eigenvalue bounds on convergence to stationarity for nonreversible markov chains. Annals of Applied Probability, 1:62–87, 1991.
 Fortunato et al. [2018] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for exploration. In Proceedings of the International Conference on Learning Representations, 2018.
 Fruit and Lazaric [2017] Ronan Fruit and Alessandro Lazaric. Exploration–Exploitation in MDPs with Options. Artificial Intelligence and Statistics, 2017.
 Fruit et al. [2018] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient biasspanconstrained explorationexploitation in reinforcement learning. In Proceedings of the International Conference on Machine Learning, volume 80, pages 1573–1581, 2018.
 Givan et al. [2000] Robert Givan, Sonia Leach, and Thomas Dean. Boundedparameter Markov decision processes. Artificial Intelligence, 122(12):71–109, 2000.
 Hutter [2014] Marcus Hutter. Extreme state aggregation beyond MDPs. In International Conference on Algorithmic Learning Theory, pages 185–199. Springer, 2014.
 Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
 Kakade et al. [2003] Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD thesis, University of London, 2003.
 Kearns and Koller [1999] Michael Kearns and Daphne Koller. Efficient reinforcement learning in factored MDPs. IJCAI, 16, 1999.
 Kearns and Singh [2002] Michael Kearns and Satinder Singh. Nearoptimal reinforcement learning in polynomial time. Machine learning, 49(23):209–232, 2002.
 Kolter and Ng [2009] J Zico Kolter and Andrew Y Ng. NearBayesian exploration in polynomial time. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 513–520. ACM, 2009.
 Leike [2016] Jan Leike. Exploration potential. arXiv preprint arXiv:1609.04994, 2016.
 Li [2009] Lihong Li. A unifying framework for computational reinforcement learning theory. PhD thesis, Rutgers UniversityGraduate SchoolNew Brunswick, 2009.
 Li et al. [2006] Lihong Li, Thomas J Walsh, and Michael L Littman. Towards a unified theory of state abstraction for MDPs. In ISAIM, 2006.
 Ok et al. [2018] Jungseul Ok, Alexandre Proutiere, and Damianos Tranos. Exploration in structured reinforcement learning. In Advances in Neural Information Processing Systems 31, pages 8888–8896, 2018.
 Ortner [2007] Ronald Ortner. Pseudometrics for state aggregation in average reward Markov decision processes. In International Conference on Algorithmic Learning Theory, pages 373–387. Springer, 2007.
 Ortner [2013] Ronald Ortner. Adaptive aggregation for reinforcement learning in average reward markov decision processes. Annals of Operations Research, 208(1):321–336, 2013.
 Osband and Van Roy [2014] Ian Osband and Benjamin Van Roy. Nearoptimal reinforcement learning in factored MDPs. In Advances in Neural Information Processing Systems, pages 604–612, 2014.
 Ostrovski et al. [2017] Georg Ostrovski, Marc G. Bellemare, Aäron van den Oord, and Rémi Munos. Countbased exploration with Neural Density Models. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 2721–2730. PMLR, 2017.
 Pathak et al. [2017] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiositydriven exploration by selfsupervised prediction. In Proceedings of the International Conference on Machine Learning, 2017.
 Plappert et al. [2018] Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter space noise for exploration. In Proceedings of the International Conference on Learning Representations, 2018.
 Ravindran and Barto [2004] Balaraman Ravindran and Andrew G Barto. Approximate homomorphisms: A framework for nonexact minimization in markov decision processes. In Proceedings of the Fifth International Conference on Knowledge Based Computer Systems, 2004.
 Strehl and Littman [2008] Alexander L Strehl and Michael L Littman. An analysis of modelbased interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
 Strehl et al. [2006] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. PAC modelfree reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM, 2006.
 Sutton et al. [1999] Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 Szita and Szepesvári [2010] István Szita and Csaba Szepesvári. Modelbased reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning, 2010.
Appendix A Proofs
Lemma 3.
For a model similarity abstraction we have the following inequality:
Proof.
Note that we have the following inequalities
Then
∎
Remark.
The previous Lemma can be used to show a model similarity abstraction has suboptimality bounded in and improves the bound of Abel et al. [2016], which has a dependency, due to an issue in the original proof. To the best of our knowledge, ours is the first complete proof of this result.
Lemma 4.
A model similarity abstraction (Def. 1) has suboptimality bounded in
(9) 
Proof.
Using similar arguments than in Lemma 3 we can show that: