 # Unifying Count-Based Exploration and Intrinsic Motivation

We consider an agent's uncertainty about its environment and the problem of generalizing this uncertainty across observations. Specifically, we focus on the problem of exploration in non-tabular reinforcement learning. Drawing inspiration from the intrinsic motivation literature, we use density models to measure uncertainty, and propose a novel algorithm for deriving a pseudo-count from an arbitrary density model. This technique enables us to generalize count-based exploration algorithms to the non-tabular case. We apply our ideas to Atari 2600 games, providing sensible pseudo-counts from raw pixels. We transform these pseudo-counts into intrinsic rewards and obtain significantly improved exploration in a number of hard games, including the infamously difficult Montezuma's Revenge.

## Code Repositories

### a3c_theano

a3c with theano and python multiprocessing for atari

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Exploration algorithms for Markov Decision Processes (MDPs) are typically concerned with reducing the agent’s uncertainty over the environment’s reward and transition functions. In a tabular setting, this uncertainty can be quantified using confidence intervals derived from Chernoff bounds, or inferred from a posterior over the environment parameters. In fact, both confidence intervals and posterior shrink as the inverse square root of the state-action visit count

, making this quantity fundamental to most theoretical results on exploration.

Count-based exploration methods directly use visit counts to guide an agent’s behaviour towards reducing uncertainty. For example, Model-based Interval Estimation with Exploration Bonuses

(MBIE-EB; Strehl and Littman, 2008) solves the augmented Bellman equation

 V(x)=maxa∈A[^R(x,a)+γE^P[V(x′)]+βN(x,a)−1/2],

involving the empirical reward , the empirical transition function , and an exploration bonus proportional to . This bonus accounts for uncertainties in both transition and reward functions and enables a finite-time bound on the agent’s suboptimality.

In spite of their pleasant theoretical guarantees, count-based methods have not played a role in the contemporary successes of reinforcement learning (e.g. Mnih et al., 2015). Instead, most practical methods still rely on simple rules such as -greedy. The issue is that visit counts are not directly useful in large domains, where states are rarely visited more than once.

Answering a different scientific question, intrinsic motivation aims to provide qualitative guidance for exploration (Schmidhuber, 1991; Oudeyer et al., 2007; Barto, 2013). This guidance can be summarized as “explore what surprises you”. A typical approach guides the agent based on change in prediction error, or learning progress. If is the error made by the agent at time over some event A, and the same error after observing a new piece of information, then learning progress is

 en(A)−en+1(A).

Intrinsic motivation methods are attractive as they remain applicable in the absence of the Markov property or the lack of a tabular representation, both of which are required by count-based algorithms. Yet the theoretical foundations of intrinsic motivation remain largely absent from the literature, which may explain its slow rate of adoption as a standard approach to exploration.

In this paper we provide formal evidence that intrinsic motivation and count-based exploration are but two sides of the same coin. Specifically, we consider a frequently used measure of learning progress, information gain (Cover and Thomas, 1991)

. Defined as the Kullback-Leibler divergence of a prior distribution from its posterior, information gain can be related to the confidence intervals used in count-based exploration. Our contribution is to propose a new quantity, the

pseudo-count, which connects information-gain-as-learning-progress and count-based exploration.

We derive our pseudo-count from a density model over the state space. This is in departure from more traditional approaches to intrinsic motivation that consider learning progress with respect to a transition model. We expose the relationship between pseudo-counts, a variant of Schmidhuber’s compression progress we call prediction gain, and information gain. Combined to Kolter and Ng’s negative result on the frequentist suboptimality of Bayesian bonuses, our result highlights the theoretical advantages of pseudo-counts compared to many existing intrinsic motivation methods.

The pseudo-counts we introduce here are best thought of as “function approximation for exploration”. We bring them to bear on Atari 2600 games from the Arcade Learning Environment (Bellemare et al., 2013), focusing on games where myopic exploration fails. We extract our pseudo-counts from a simple density model and use them within a variant of MBIE-EB. We apply them to an experience replay setting and to an actor-critic setting, and find improved performance in both cases. Our approach produces dramatic progress on the reputedly most difficult Atari 2600 game, Montezuma’s Revenge: within a fraction of the training time, our agent explores a significant portion of the first level and obtains significantly higher scores than previously published agents.

## 2 Notation

We consider a countable state space . We denote a sequence of length from by , the set of finite sequences from by , write to mean the concatenation of and a state , and denote the empty sequence by . A model over is a mapping from

. That is, for each the model provides a probability distribution

 ρn(x):=ρ(x;x1:n).

Note that we do not require to be strictly positive for all and . When it is, however, we may understand to be the usual conditional probability of given .

We will take particular interest in the empirical distribution derived from the sequence . If is the number of occurrences of a state in the sequence , then

 μn(x):=μ(x;x1:n):=Nn(x)n.

We call the the empirical count function, or simply empirical count. The above notation extends to state-action spaces, and we write to explicitly refer to the number of occurrences of a state-action pair when the argument requires it. When

is generated by an ergodic Markov chain, for example if we follow a fixed policy in a finite-state MDP, then the limit point of

is the chain’s stationary distribution.

In our setting, a density model is any model that assumes states are independently (but not necessarily identically) distributed; a density model is thus a particular kind of generative model. We emphasize that a density model differs from a forward model, which takes into account the temporal relationship between successive states. Note that is itself a density model.

## 3 From Densities to Counts

In the introduction we argued that the visit count (and consequently, ) is not directly useful in practical settings, since states are rarely revisited. Specifically, is almost always zero and cannot help answer the question “How novel is this state?” Nor is the problem solved by a Bayesian approach: even variable-alphabet models (e.g. Hutter, 2013) must assign a small, diminishing probability to yet-unseen states. To estimate the uncertainty of an agent’s knowledge, we must instead look for a quantity which generalizes across states. Guided by ideas from the intrinsic motivation literature, we now derive such a quantity. We call it a pseudo-count as it extends the familiar notion from Bayesian estimation.

### 3.1 Pseudo-Counts and the Recoding Probability

We are given a density model over . This density model may be approximate, biased, or even inconsistent. We begin by introducing the recoding probability of a state :

 ρ′n(x):=ρ(x;x1:nx).

This is the probability assigned to by our density model after observing a new occurrence of . The term “recoding” is inspired from the statistical compression literature, where coding costs are inversely related to probabilities (Cover and Thomas, 1991). When

 ρ′n(x)=Prρ(Xn+2=x|X1…Xn=x1:n,Xn+1=x).

We now postulate two unknowns: a pseudo-count function , and a pseudo-count total . We relate these two unknowns through two constraints:

 ρn(x)=^Nn(x)^nρ′n(x)=^Nn(x)+1^n+1. (1)

In words: we require that, after observing one instance of , the density model’s increase in prediction of that same should correspond to a unit increase in pseudo-count. The pseudo-count itself is derived from solving the linear system (1):

 ^Nn(x)=ρn(x)(1−ρ′n(x))ρ′n(x)−ρn(x)=^nρn(x). (2)

Note that the equations (1) yield (with ) when , and are inconsistent when . These cases may arise from poorly behaved density models, but are easily accounted for. From here onwards we will assume a consistent system of equations.

###### Definition 1 (Learning-positive density model).

A density model is learning-positive if for all and all , .

By inspecting (2), we see that

1. if and only if is learning-positive;

2. if and only if ; and

3. if and only if .

In many cases of interest, the pseudo-count matches our intuition. If then . Similarly, if is a Dirichlet estimator then recovers the usual notion of pseudo-count. More importantly, if the model generalizes across states then so do pseudo-counts.

### 3.2 Estimating the Frequency of a Salient Event in Freeway

As an illustrative example, we employ our method to estimate the number of occurrences of an infrequent event in the Atari 2600 video game Freeway (Figure 1, screenshot). We use the Arcade Learning Environment (Bellemare et al., 2013). We will demonstrate the following:

1. Pseudo-counts are roughly zero for novel events,

2. they exhibit credible magnitudes,

3. they respect the ordering of state frequency,

4. they grow linearly (on average) with real counts,

5. they are robust in the presence of nonstationary data.

These properties suggest that pseudo-counts provide an appropriate generalized notion of visit counts in non-tabular settings.

In Freeway, the agent must navigate a chicken across a busy road. As our example, we consider estimating the number of times the chicken has reached the very top of the screen. As is the case for many Atari 2600 games, this naturally salient event is associated with an increase in score, which ALE translates into a positive reward. We may reasonably imagine that knowing how certain we are about this part of the environment is useful. After crossing, the chicken is teleported back to the bottom of the screen.

To highlight the robustness of our pseudo-count, we consider a nonstationary policy which waits for 250,000 frames, then applies the up action for 250,000 frames, then waits, then goes up again. The salient event only occurs during up periods. It also occurs with the cars in different positions, thus requiring generalization. As a point of reference, we record the pseudo-counts for both the salient event and visits to the chicken’s start position.

We use a simplified, pixel-level version of the CTS model for Atari 2600 frames proposed by Bellemare et al. (2014), ignoring temporal dependencies. While the CTS model is rather impoverished in comparison to state-of-the-art density models for images (e.g. Van den Oord et al., 2016), its count-based nature results in extremely fast learning, making it an appealing candidate for exploration. Further details on the model may be found in the appendix. Figure 1: Pseudo-counts obtained from a CTS density model applied to Freeway, along with a frame representative of the salient event (crossing the road). Shaded areas depict periods during which the agent observes the salient event, dotted lines interpolate across periods during which the salient event is not observed. The reported values are 10,000-frame averages.

Examining the pseudo-counts depicted in Figure 1 confirms that they exhibit the desirable properties listed above. In particular, the pseudo-count is almost zero on the first occurrence of the salient event; it increases slightly during the 3rd period, since the salient and reference events share some common structure; throughout, it remains smaller than the reference pseudo-count. The linearity on average and robustness to nonstationarity are immediate from the graph. Note, however, that the pseudo-counts are a fraction of the real visit counts (inasmuch as we can define “real”): by the end of the trial, the start position has been visited about 140,000 times, and the topmost part of the screen, 1285 times. Furthermore, the ratio of recorded pseudo-counts differs from the ratio of real counts. Both effects are quantifiable, as we shall show in Section 5.

## 4 The Connection to Intrinsic Motivation

Having argued that pseudo-counts appropriately generalize visit counts, we will now show that they are closely related to information gain, which is commonly used to quantify novelty or curiosity and consequently as an intrinsic reward. Information gain is defined in relation to a mixture model over a class of density models . This model predicts according to a weighted combination from :

 ξn(x):=ξ(x;x1:n):=∫ρ∈Mwn(ρ)ρ(x;x1:n)dρ,

with the posterior weight of . This posterior is defined recursively, starting from a prior distribution over :

 wn+1(ρ):=wn(ρ,xn+1)wn(ρ,x):=wn(ρ)ρ(x;x1:n)ξn(x). (3)

Information gain is then the Kullback-Leibler divergence from prior to posterior that results from observing :

 IGn(x):=IG(x;x1:n):=KL(wn(⋅,x)∥wn).

Computing the information gain of a complex density model is often impractical, if not downright intractable. However, a quantity which we call the prediction gain provides us with a good approximation of the information gain. We define the prediction gain of a density model (and in particular, ) as the difference between the recoding log-probability and log-probability of :

 PGn(x):=logρ′n(x)−logρn(x).

Prediction gain is nonnegative if and only if is learning-positive. It is related to the pseudo-count:

 ^Nn(x)≈(ePGn(x)−1)−1,

with equality when . As the following theorem shows, prediction gain allows us to relate pseudo-count and information gain.

###### Theorem 1.

Consider a sequence . Let be a mixture model over a class of learning-positive models . Let be the pseudo-count derived from (Equation 2). For this model,

 IGn(x)≤PGn(x)≤^Nn(x)−1 % and PGn(x)≤^Nn(x)−1/2.

Theorem 1 suggests that using an exploration bonus proportional to , similar to the MBIE-EB bonus, leads to a behaviour at least as exploratory as one derived from an information gain bonus. Since pseudo-counts correspond to empirical counts in the tabular setting, this approach also preserves known theoretical guarantees. In fact, we are confident pseudo-counts may be used to prove similar results in non-tabular settings.

On the other hand, it may be difficult to provide theoretical guarantees about existing bonus-based intrinsic motivation approaches. Kolter and Ng (2009) showed that no algorithm based on a bonus upper bounded by for any can guarantee PAC-MDP optimality. Again considering the tabular setting and combining their result to Theorem 1, we conclude that bonuses proportional to immediate information (or prediction) gain are insufficient for theoretically near-optimal exploration: to paraphrase Kolter and Ng, these methods produce explore too little in comparison to pseudo-count bonuses. By inspecting (2) we come to a similar negative conclusion for bonuses proportional to the L1 or L2 distance between and .

Unlike many intrinsic motivation algorithms, pseudo-counts also do not rely on learning a forward (transition and/or reward) model. This point is especially important because a number of powerful density models for images exist (Van den Oord et al., 2016), and because optimality guarantees cannot in general exist for intrinsic motivation algorithms based on forward models.

## 5 Asymptotic Analysis

In this section we analyze the limiting behaviour of the ratio . We use this analysis to assert the consistency of pseudo-counts derived from tabular density models, i.e. models which maintain per-state visit counts. In the appendix we use the same result to bound the approximation error of pseudo-counts derived from directed graphical models, of which our CTS model is a special case.

Consider a fixed, infinite sequence from . We define the limit of a sequence of functions with respect to the length of the subsequence . We additionally assume that the empirical distribution converges pointwise to a distribution , and write for the recoding probability of under . We begin with two assumptions on our density model.

###### Assumption 1.

The limits

 (a) r(x):=limn→∞ρn(x)μn(x)(b) ˙r(x):=limn→∞ρ′n(x)−ρn(x)μ′n(x)−μn(x)

exist for all ; furthermore, .

Assumption (a) states that should eventually assign a probability to proportional to the limiting empirical distribution . In particular there must be a state for which , unless . Assumption (b), on the other hand, imposes a restriction on the learning rate of relative to ’s. As both and exist, Assumption 1 also implies that and have a common limit.

###### Theorem 2.

Under Assumption 1, the limit of the ratio of pseudo-counts to empirical counts exists for all . This limit is

 limn→∞^Nn(x)Nn(x)=r(x)˙r(x)(1−μ(x)r(x)1−μ(x)).

The model’s relative rate of change, whose convergence to we require, plays an essential role in the ratio of pseudo- to empirical counts. To see this, consider a sequence generated i.i.d. from a distribution over a finite state space, and a density model defined from a sequence of nonincreasing step-sizes :

 ρn(x)=(1−αn)ρn−1(x)+αnI{xn=x},

with initial condition . For , this density model is the empirical distribution. For , we may appeal to well-known results from stochastic approximation (e.g. Bertsekas and Tsitsiklis, 1996) and find that almost surely

 limn→∞ρn(x)=μ(x)butlimn→∞ρ′n(x)−ρn(x)μ′n(x)−μn(x)=∞.

Since , we may think of Assumption 1(b) as also requiring to converge at a rate of for a comparison with the empirical count to be meaningful. Note, however, that a density model that does not satisfy Assumption 1(b) may still yield useful (but incommensurable) pseudo-counts.

###### Corollary 1.

Let with and consider the count-based estimator

 ρn(x)=Nn(x)+ϕ(x)n+∑x′∈Xϕ(x′).

If is the pseudo-count corresponding to then for all with .

## 6 Empirical Evaluation

In this section we demonstrate the use of pseudo-counts to guide exploration. We return to the Arcade Learning Environment, now using the CTS model to generate an exploration bonus.

### 6.1 Exploration in Hard Atari 2600 Games

From 60 games available through the Arcade Learning Environment we selected five “hard” games, in the sense that an -greedy policy is inefficient at exploring them. We used a bonus of the form

 R+n(x,a):=β(^Nn(x)+0.01)−1/2, (4)

where was selected from a coarse parameter sweep. We also compared our method to the optimistic initialization trick proposed by Machado et al. (2015). We trained our agents’ Q-functions with Double DQN (van Hasselt et al., 2016), with one important modification: we mixed the Double Q-Learning target with the Monte Carlo return. This modification led to improved results both with and without exploration bonuses (details in the appendix). Figure 2: Average training score with and without exploration bonus or optimistic initialization in 5 Atari 2600 games. Shaded areas denote inter-quartile range, dotted lines show min/max scores.

Figure 2 depicts the result of our experiment, averaged across 5 trials. Although optimistic initialization helps in Freeway, it otherwise yields performance similar to DQN. By contrast, the count-based exploration bonus enables us to make quick progress on a number of games, most dramatically in Montezuma’s Revenge and Venture.

Montezuma’s Revenge is perhaps the hardest Atari 2600 game available through the ALE. The game is infamous for its hostile, unforgiving environment: the agent must navigate a number of different rooms, each filled with traps. Due to its sparse reward function, most published agents achieve an average score close to zero and completely fail to explore most of the 24 rooms that constitute the first level (Figure 3, top). By contrast, within 50 million frames our agent learns a policy which consistently navigates through 15 rooms (Figure 3, bottom). Our agent also achieves a score higher than anything previously reported, with one run consistently achieving 6600 points by 100 million frames (half the training samples used by Mnih et al. (2015)). We believe the success of our method in this game is a strong indicator of the usefulness of pseudo-counts for exploration.111A video of our agent playing is available at https://youtu.be/0yI2wJ6F8r0. Figure 3: “Known world” of a DQN agent trained for 50 million frames with (right) and without (left) count-based exploration bonuses, in Montezuma’s Revenge.

### 6.2 Exploration for Actor-Critic Methods

We next used our exploration bonuses in conjunction with the A3C (Asynchronous Advantage Actor-Critic) algorithm of Mnih et al. (2016). One appeal of actor-critic methods is their explicit separation of policy and Q-function parameters, which leads to a richer behaviour space. This very separation, however, often leads to deficient exploration: to produce any sensible results, the A3C policy must be regularized with an entropy cost. We trained A3C on 60 Atari 2600 games, with and without the exploration bonus (4). We refer to our augmented algorithm as A3C+. Full details and additional results may be found in the appendix.

We found that A3C fails to learn in 15 games, in the sense that the agent does not achieve a score 50% better than random. In comparison, there are only 10 games for which A3C+ fails to improve on the random agent; of these, 8 are games where DQN fails in the same sense. We normalized the two algorithms’ scores so that 0 and 1 are respectively the minimum and maximum of the random agent’s and A3C’s end-of-training score on a particular game. Figure 4 depicts the in-training median score for A3C and A3C+, along with 1st and 3rd quartile intervals. Not only does A3C+ achieve slightly superior median performance, but it also significantly outperforms A3C on at least a quarter of the games. This is particularly important given the large proportion of Atari 2600 games for which an -greedy policy is sufficient for exploration. Figure 4: Median and interquartile performance across 60 Atari 2600 games for A3C and A3C+.

## 7 Related Work

Information-theoretic quantities have been repeatedly used to describe intrinsically motivated behaviour. Closely related to prediction gain is Schmidhuber (1991)’s notion of compression progress, which equates novelty with an agent’s improvement in its ability to compress its past. More recently, Lopes et al. (2012) showed the relationship between time-averaged prediction gain and visit counts in a tabular setting; their result is a special case of Theorem 2. Orseau et al. (2013) demonstrated that maximizing the sum of future information gains does lead to optimal behaviour, even though maximizing immediate information gain does not (Section 4). Finally, there may be a connection between sequential normalized maximum likelihood estimators and our pseudo-count derivation (see e.g. Ollivier, 2015).

Intrinsic motivation has also been studied in reinforcement learning proper, in particular in the context of discovering skills (Singh et al., 2004; Barto, 2013). Recently, Stadie et al. (2015) used a squared prediction error bonus for exploring in Atari 2600 games. Closest to our work is Houthooft et al. (2016)’s variational approach to intrinsic motivation, which is equivalent to a second order Taylor approximation to prediction gain. Mohamed and Rezende (2015) also considered a variational approach to the different problem of maximizing an agent’s ability to influence its environment.

Aside for Orseau et al.’s above-cited work, it is only recently that theoretical guarantees for exploration have emerged for non-tabular, stateful settings. We note Pazis and Parr (2016)’s PAC-MDP result for metric spaces and Leike et al. (2016)

’s asymptotic analysis of Thompson sampling in general environments.

## 8 Future Directions

The last few years have seen tremendous advances in learning representations for reinforcement learning. Surprisingly, these advances have yet to carry over to the problem of exploration. In this paper, we reconciled counts, the fundamental unit of uncertainty, with prediction-based heuristics and intrinsic motivation. Combining our work with more ideas from deep learning and better density models seems a plausible avenue for quick progress in practical, efficient exploration. We now conclude by outlining a few research directions we believe are promising.

Induced metric. We did not address the question of where the generalization comes from. Clearly, the choice of density model induces a particular metric over the state space. A better understanding of this metric should allow us to tailor the density model to the problem of exploration.

Compatible value function. There may be a mismatch in the learning rates of the density model and the value function: DQN learns much more slowly than our CTS model. As such, it should be beneficial to design value functions compatible with density models (or vice-versa).

The continuous case.

Although we focused here on countable state spaces, we can as easily define a pseudo-count in terms of probability density functions. At present it is unclear whether this provides us with the right notion of counts for continuous spaces.

#### Acknowledgments

The authors would like to thank Laurent Orseau, Alex Graves, Joel Veness, Charles Blundell, Shakir Mohamed, Ivo Danihelka, Ian Osband, Matt Hoffman, Greg Wayne, Will Dabney, and Aäron van den Oord for their excellent feedback early and late in the writing, and Pierre-Yves Oudeyer and Yann Ollivier for pointing out additional connections to the literature.

## References

• Barto (2013) Barto, A. G. (2013). Intrinsic motivation and reinforcement learning. In Intrinsically Motivated Learning in Natural and Artificial Systems, pages 17–47. Springer.
• Bellemare et al. (2014) Bellemare, M., Veness, J., and Talvitie, E. (2014). Skip context tree switching. In

Proceedings of the 31st International Conference on Machine Learning

, pages 1458–1466.
• Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. (2013). The Arcade Learning Environment: An evaluation platform for general agents.

Journal of Artificial Intelligence Research

, 47:253–279.
• Bellemare et al. (2016) Bellemare, M. G., Ostrovski, G., Guez, A., Thomas, P. S., and Munos, R. (2016). Increasing the action gap: New operators for reinforcement learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
• Bertsekas and Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific.
• Cover and Thomas (1991) Cover, T. M. and Thomas, J. A. (1991). Elements of information theory. John Wiley & Sons.
• Houthooft et al. (2016) Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. (2016). Variational information maximizing exploration.
• Hutter (2013) Hutter, M. (2013). Sparse adaptive dirichlet-multinomial-like processes. In Proceedings of the Conference on Online Learning Theory.
• Kolter and Ng (2009) Kolter, Z. J. and Ng, A. Y. (2009). Near-bayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning.
• Leike et al. (2016) Leike, J., Lattimore, T., Orseau, L., and Hutter, M. (2016). Thompson sampling is asymptotically optimal in general environments. In Proceedings of the Conference on Uncertainty in Artificial Intelligence.
• Lopes et al. (2012) Lopes, M., Lang, T., Toussaint, M., and Oudeyer, P.-Y. (2012). Exploration in model-based reinforcement learning by empirically estimating learning progress. In Advances in Neural Information Processing Systems 25.
• Machado et al. (2015) Machado, M. C., Srinivasan, S., and Bowling, M. (2015). Domain-independent optimistic initialization for reinforcement learning. AAAI Workshop on Learning for General Competency in Video Games.
• Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning.
• Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.
• Mohamed and Rezende (2015) Mohamed, S. and Rezende, D. J. (2015). Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems 28.
• Ollivier (2015) Ollivier, Y. (2015). Laplace’s rule of succession in information geometry. arXiv preprint arXiv:1503.04304.
• Orseau et al. (2013) Orseau, L., Lattimore, T., and Hutter, M. (2013). Universal knowledge-seeking agents for stochastic environments. In Proceedings of the Conference on Algorithmic Learning Theory.
• Oudeyer et al. (2007) Oudeyer, P., Kaplan, F., and Hafner, V. (2007). Intrinsic motivation systems for autonomous mental development.

IEEE Transactions on Evolutionary Computation

, 11(2):265–286.
• Pazis and Parr (2016) Pazis, J. and Parr, R. (2016). Efficient PAC-optimal exploration in concurrent, continuous state MDPs with delayed updates. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
• Schmidhuber (1991) Schmidhuber, J. (1991). A possibility for implementing curiosity and boredom in model-building neural controllers. In From animals to animats: proceedings of the first international conference on simulation of adaptive behavior.
• Schmidhuber (2008) Schmidhuber, J. (2008). Driven by compression progress. In Knowledge-Based Intelligent Information and Engineering Systems. Springer.
• Singh et al. (2004) Singh, S., Barto, A. G., and Chentanez, N. (2004). Intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems 16.
• Stadie et al. (2015) Stadie, B. C., Levine, S., and Abbeel, P. (2015). Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814.
• Strehl and Littman (2008) Strehl, A. L. and Littman, M. L. (2008). An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8):1309 – 1331.
• Van den Oord et al. (2016) Van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. (2016). In Proceedigns of the 33rd International Conference on Machine Learning.
• van Hasselt et al. (2016) van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double Q-learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
• Veness et al. (2015) Veness, J., Bellemare, M. G., Hutter, M., Chua, A., and Desjardins, G. (2015). Compress and control. In Proceedings of the 29th AAAI Conference on Artificial Intelligence.
• Wainwright and Jordan (2008) Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305.

## Appendix A The Connection to Intrinsic Motivation

The following provides an identity connecting information gain and prediction gain.

###### Lemma 1.

Consider a mixture model over with prediction gain and information gain , a fixed , and let be the posterior of over after observing . Let be the same posterior after observing a second time, and let denote the prediction gain of . Then

 PGn(x)=KL(w′n∥wn)+KL(w′n∥w′′n)=IGn(x)+KL(w′n∥w′′n)+Ew′n[PGρn(x)].

In particular, if is a class of non-adaptive models in the sense that for all , then

 PGn(x)=KL(w′n∥wn)+KL(w′n∥w′′n)=IGn(x)+KL(w′n∥w′′n).

A model which is non-adaptive is also learning-positive in the sense of Definition 1. Many common mixture models, for example Dirichlet-multinomial estimators, are mixtures over non-adaptive models.

###### Proof.

We rewrite the posterior update rule (3) to show that for any and any ,

 ξn(x)=ρn(x)wn(ρ)wn(ρ,x).

Write . Now

 PGn(x)=logξ′n(x)ξn(x) =IGn(x)+KL(w′n∥w′′n)+Ew′n[PGρn(x)].\qed

The second statement follows immediately.

###### Lemma 2.

The functions and are nonnegative on .

###### Proof.

The statement regarding follows directly from the Taylor expansion for . Now, the first derivative of is . It is clearly positive for . For ,

 ex−2x=∞∑i=0xii!−2x≥1−x≥0.

Since , the second result follows. ∎

###### Proof (Theorem 1).

The inequality follows directly from Lemma 1, the nonnegativity of the Kullback-Leibler divergence, and the fact that all models in are learning-positive. For the inequality , we write

 ^Nn(x)−1 =(1−ξ′n(x))−1ξ′n(x)−ξn(x)ξn(x) =(1−ξ′n(x))−1(ξ′n(x)ξn(x)−1) (a)=(1−ξ′n(x))−1(ePGn(x)−1) (b)≥ePGn(x)−1 (c)≥PGn(x),

where (a) follows by definition of prediction gain, (b) from , and (c) from Lemma 2. Using the second part of Lemma 2 in (c) yields the inequality . ∎

## Appendix B Asymptotic Analysis

We begin with a simple lemma which will prove useful throughout.

###### Lemma 3.

The rate of change of the empirical distribution, , is such that

 n(μ′n(x)−μn(x))=1−μ′n(x).
###### Proof.

We expand the definition of and :

 n(μ′n(x)−μn(x)) =n[Nn(x)+1n+1−Nn(x)n] =[nn+1(Nn(x)+1)−Nn(x)] =[1−Nn(x)+1n+1] =1−μ′n(x).

Using this lemma, we derive an asymptotic relationship between and .

###### Proof (Theorem 2).

We expand the definition of and :

 ^Nn(x)Nn(x) =ρn(x)(1−ρ′n(x))Nn(x)(ρ′n(x)−ρn(x)) =ρn(x)(1−ρ′n(x))nμn(x)(ρ′n(x)−ρn(x)) =ρn(x)(μ′n(x)−μn(x))μn(x)(ρ′n(x)−ρn(x))1−ρ′n(x)n(μ′n(x)−μn(x)) =ρn(x)μn(x)μ′n(x)−μn(x)ρ′n(x)−ρn(x)1−ρ′n(x)1−μ′n(x),

with the last line following from Lemma 3. Under Assumption 1, all terms of the right-hand side converge as . Taking the limit on both sides,

 limn→∞^Nn(x)Nn(x) (a)=r(x)˙r(x)limn→∞1−ρ′n(x)1−μ′n(x) (b)=r(x)˙r(x)1−μ(x)r(x)1−μ(x),

where (a) is justified by the existence of the relevant limits and , and (b) follows from writing as , where all limits involved exist. ∎

### b.1 Directed Graphical Models

We say that is a factored state space if it is the Cartesian product of subspaces, i.e. . This factored structure allows us to construct approximate density models over , for example by modelling the joint density as a product of marginals. We write the factor of a state as , and write the sequence of the factor across as .

We will show that directed graphical models (Wainwright and Jordan, 2008) satisfy Assumption 1. A directed graphical model describes a probability distribution over a factored state space. To the factor is associated a parent set . Let denote the value of the factors in the parent set. The factor model is , with the understanding that is allowed to make a different prediction for each value of . The state is assigned the joint probability

 ρ\textscgm(x;x1:n):=k∏i=1ρin(xi;xπ(i)).

Common choices for include the conditional empirical distribution and the Dirichlet estimator.

###### Proposition 1.

Suppose that each factor model converges to the conditional probability distribution and that for each with ,

 limn→∞ρi(xi;x1:nx,xπ(i))−ρi(xi;x1:n,xπ(i))μ(xi;x1:nx,xπ(i))−μ(xi;x1:n,xπ(i))=1.

Then for all with , the density model satisfies Assumption 1 with

The CTS density model used in our experiments is in fact a particular kind of induced graphical model. The result above thus describes how the pseudo-counts computed in Section 3.2 are asymptotically related to the empirical counts.

###### Proof.

By hypothesis, . Combining this with ,

 r(x) =limn→∞ρ\textscdgm(x;x1:n)μn(x) =limn→∞∏ki=1ρin(xi;xπ(i))μn(x) =∏ki=1μ(xi|xπ(i))μ(x).

Similarly,

 ˙r(x) =limn→∞ρ′\textscdgm(x;x1:n)−ρ\textscdgm(x;x1:n)μ′n(x)−μn(x) (a)=limn→∞(ρ′\textscdgm(x;x1:n)−ρ\textscdgm(x;x1:n))n1−μ′n(x) =limn→∞(ρ′\textscdgm(x;x1:n)−ρ\textscdgm(x;x1:n))n1−μ(x),

where in (a) we used the identity derived in the proof of Theorem 2. Now

 ˙r(x) =(1−μ(x))−1limn→∞(ρ′\textscdgm(x;x1:n)−ρ\textscdgm(x;x1:n))n =(1−μ(x))−1limn→∞(k∏i=1ρi(xi;x1:nx,xπ(i))−k∏i=1ρi(xi;x1:n,xπ(i)))n.

Let and . The difference of products above is

 (k∏i=1ρi(xi;x1:nx,xπ(i))−k∏i=1ρi(xi;x1:n,xπ(i))) =(c′1c′2…c′k−c1c2…ck) =(c′1−c1)(c′2…c′k)+c1(c′2…c′k−c2…ck)

and

 ˙r(x)=(1−μ(x))−1limn→∞k∑i=1n(c′i−ci)(∏jic′j).

By the hypothesis on the rate of change of and the identity , we have

 limn→∞n(c′i−ci)=1−μ(xi|xπ(i)).

Since the limits of and are both , we deduce that

 ˙r(x)=∑ki=1(1−μ(xi|xπ(i))∏j≠iμ(xj|xπj(x))1−μ(x).

Now, if then also for each factor . Hence . ∎

### b.2 Tabular Density Models (Corollary 1)

We shall prove the following, which includes Corollary 1 as a special case.

###### Lemma 4.

Consider . Suppose that for all and every

1. , and

2. .

Let be the count-based estimator

 ρn(x)=Nn(x)+ϕ(x,x1:n)n+∑x∈Xϕ(x,x1:n).

If is the pseudo-count corresponding to then for all with .

Condition 2 is satisfied if with monotonically increasing in (but not too quickly!) and converging to some distribution for all sequences . This is the case for most tabular density models.

###### Proof.

We will show that the condition on the rate of change required by Proposition 1 is satisfied under the stated conditions. Let , , and . By hypothesis,

 ρn(x)=Nn(x)+ϕn(x)n+ϕnρ′n(x)=Nn(x)+ϕ′n(x)+1n+ϕ′n+1.

Note that we do not require . Now

 ρ′n(x)−ρn(x) =n+ϕnn+ϕnρ′n(x)−ρn(x) =n+1+ϕ′nn+ϕnρ′n(x)−ρn(x)−(1+(ϕ′n−ϕn))ρ′n(x)n+ϕn =1n+ϕn[(Nn(x)+1+ϕ′n(x)−(Nn(x)+ϕn(x))−(1+(ϕ′n−ϕn))ρ′n(x)] =1n+ϕn[1−ρ′n(x)+(ϕ′n(x)−ϕn(x))−ρ′n(x)(ϕ′n−ϕn)].

Using Lemma 3 we deduce that

 ρ′n(x)−ρn(x)μ′n(x)−μn(x)=nn+ϕn1−ρ′n(x)+ϕ′n(x)−ϕn(x)+ρ′n(x)(ϕ′n−ϕn)1−μ′n(x).

Since and similarly for , then pointwise implies that also. For any ,

 0≤limn→∞ϕn(