1 Introduction
In reinforcement learning (RL), agents acting in unknown environments face the exploration versus exploitation tradeoff. Without adequate exploration, the agent might fail to discover effective control strategies, particularly in complex domains. Both PACMDP algorithms, such as MBIEEB [1], and Bayesian algorithms such as Bayesian Exploration Bonuses (BEB) [2]
have managed this tradeoff by assigning exploration bonuses to novel states. In these methods, the novelty of a stateaction pair is derived from the number of times an agent has visited that pair. While these approaches offer strong formal guarantees, their requirement of an enumerable representation of the agent’s environment renders them impractical for largescale tasks. As such, exploration in large RL tasks is still most often performed using simple heuristics, such as the epsilongreedy strategy
[3], which can be inadequate in more complex settings.In this paper, we evaluate several exploration strategies that can be scaled up to complex tasks with highdimensional inputs. Our results show that Boltzman exploration and Thompson sampling significantly improve on the naïve epsilongreedy strategy. However, we show that the biggest and most consistent improvement can be achieved by assigning exploration bonuses based on a learned model of the system dynamics with learned representations. To that end, we describe a method that learns a state representation from observations, trains a dynamics model using this representation concurrently with the policy, and uses the misprediction error in this model to asses the novelty of each state. Novel states are expected to disagree more strongly with the model than those states that have been visited frequently in the past, and assigning exploration bonuses based on this disagreement can produce rapid and effective exploration.
Using learned model dynamics to assess a state’s novelty presents several challenges. Capturing an adequate representation of the agent’s environment for use in dynamics predictions can be accomplished by training a model to predict the next state from the previous groundtruth stateaction pair. However, one would not expect pixel intensity values to adequately capture the salient features of a given statespace. To provide a more suitable representation of the system’s state space, we propose a method for encoding the state space into lower dimensional domains. To achieve sufficient generality and scalability, we modeled the system’s dynamics with a deep neural network. This allows for onthefly learning of a model representation that can easily be trained in parallel to learning a policy.
Our main contribution is a scalable and efficient method for assigning exploration bonuses in large RL problems with complex observations, as well as an extensive empirical evaluation of this approach and other simple alternative strategies, such as Boltzman exploration and Thompson sampling. Our approach assigns modelbased exploration bonuses from learned representations and dynamics, using only the observations and actions. It can scale to large problems where Bayesian approaches to exploration become impractical, and we show that it achieves significant improvement in learning speed on the task of learning to play Atari games from raw images [24]. Our approach achieves stateoftheart results on a number of games, and achieves particularly large improvements for games on which human players strongly outperform prior methods. Aside from achieving a high final score, our method also achieves substantially faster learning. To evaluate the speed of the learning process, we propose the AUC100 benchmark to evaluate learning progress on the Atari domain.
2 Preliminaries
We consider an infinitehorizon discounted Markov decision process (MDP), defined by the tuple
, where is a finite set of states, a finite set of actions,the transition probability distribution,
the reward function, an initial state distribution, and the discount factor. We are interested in finding a policy that maximizes the expected reward over all time. This maximization can be accomplished using a variety of reinforcement learning algorithms.In this work, we are concerned with online reinforcement learning wherein the algorithm receives a tuple at each step. Here, is the previous state, is the previous action, is the new state, and is the reward collected as a result of this transition. The reinforcement learning algorithm must use this tuple to update its policy and maximize longterm reward and then choose the new action . It is often insufficient to simply choose the best action based on previous experience, since this strategy can quickly fall into a local optimum. Instead, the learning algorithm must perform exploration. Prior work has suggested methods that address the exploration problem by acting with “optimism under uncertainty.” If one assumes that the reinforcement learning algorithm will tend to choose the best action, it can be encouraged to visit stateaction pairs that it has not frequently seen by augmenting the reward function to deliver a bonus for visiting novel states. This is accomplished with the augmented reward function
(1) 
where is a novelty function designed to capture the novelty of a given stateaction pair. Prior work has suggested a variety of different novelty functions e.g., [1, 2] based on state visitation frequency.
While such methods offer a number of appealing guarantees, such as nearBayesian exploration in polynomial time [2], they require a concise, often discrete representation of the agent’s stateaction space to measure state visitation frequencies. In our approach, we will employ function approximation and representation learning to devise an alternative to these requirements.
3 Model Learning For Exploration Bonuses
We would like to encourage agent exploration by giving the agent exploration bonuses for visiting novel states. Identifying states as novel requires we supply some representation of the agent’s state space, as well as a mechanism to use this representation to assess novelty. Unsupervised learning methods offer one promising avenue for acquiring a concise representation of the state with a good similarity metric. This can be accomplished using dimensionality reduction, clustering, or graphbased techniques
[4, 5]. In our work, we draw on recent developments in representation learning with neural networks, as discussed in the following section. However, even with a good learned state representation, maintaining a table of visitation frequencies becomes impractical for complex tasks. Instead, we learn a model of the task dynamics that can be used to assess the novelty of a new state.Formally, let denote the encoding of the state , and let be a dynamics predictor parameterized by . takes an encoded version of a state at time and the agent’s action at time and attempts to predict an encoded version of the agent’s state at time . The parameterization of is discussed further in the next section.
For each state transition , we can attempt to predict from using our predictive model . This prediction will have some error
(2) 
Let , the normalized prediction error at time , be given by . We can assign a novelty function to via
(3) 
where is a decay constant. We can now realize our augmented reward function as
(4) 
This approach is motivated by the idea that, as our ability to model the dynamics of a particular stateaction pair improves, we have come to understand the state better and hence its novelty is lower. When we don’t understand the stateaction pair well enough to make accurate predictions, we assume that more knowledge about that particular area of the model dynamics is needed and hence a higher novelty measure is assigned.
Using learned model dynamics to assign novelty functions allows us to address the exploration versus exploitation problem in a nongreedy way. With an appropriate representation , even when we encounter a new stateaction pair , we expect to be accurate so long as enough similar stateaction pairs have been encountered.
Our modelbased exploration bonuses can be incorporated into any online reinforcement learning algorithm that updates the policy based on state, action, reward tuples of the form , such as Qlearning or actorcritic algorithms. Our method is summarized in Algorithm 1. At each step, we receive a tuple and compute the Euclidean distance between the encoded state to the prediction made by our model . This is used to compute the explorationaugmented reward using Equation (4). The tuples are stored in a memory bank at the end of every step. Every step, the policy is updated. ^{1}^{1}1In our implementation, the memory bank is used to retrain the RL algorithm via experience replay once per epoch (50000 steps). Hence, 49999 of these policy updates will simply do nothing.
Once per epoch, corresponding to 50000 observations in our implementation, the dynamics model
is updated to improve its accuracy. If desired, the representation encoder can also be updated at this time. We found that retraining once every 5 epochs to be sufficient.This approach is modular and compatible with any representation of and , as well as any reinforcement learning method that updates its policy based on a continuous stream of observation, action, reward tuples. Incorporating exploration bonuses does make the reinforcement learning task nonstationary, though we did not find this to be a major issue in practice, as shown in our experimental evaluation. In the following section, we discuss the particular choice for and that we use for learning policies for playing Atari games from raw images.
4 Deep Learning Architectures
Though the dynamics model and the encoder from the previous section can be parametrized by any appropriate method, we found that using deep neural networks for both achieved good empirical results on the Atari games benchmark. In this section, we discuss the particular networks used in our implementation.
4.1 Autoencoders
The most direct way of learning a dynamics model is to directly predict the state at the next time step, which in the Atari games benchmark corresponds to the next frame’s pixel intensity values. However, directly predicting these pixel intensity values is unsatisfactory, since we do not expect pixel intensity to capture the salient features of the environment in a robust way. In our experiments, a dynamics model trained to predict raw frames exhibited extremely poor behavior, assigning exploration bonuses in near equality at most time steps, as discussed in our experimental results section.
To overcome these difficulties, we seek a function which encodes a lower dimensional representation of the state
. For the task of representing Atari frames, we found that an autoencoder could be used to successfully obtain an encoding function
and achieve dimensionality reduction and feature extraction
[6].Our autoencoder has 8 hidden layers, followed by a Euclidean loss layer, which computes the distance between the output features and the original input image. The hidden layers are reduced in dimension until maximal compression occurs with 128 units. After this, the activations are decoded by passing through hidden layers with increasingly large size. We train the network on a set of 250,000 images and test on a further set of 25,000 images. We compared two separate methodologies for capturing these images.

Static AE: A random agent plays for enough time to collect the required images. The autoencoder is trained offline before the policy learning algorithm begins.

Dynamic AE: Initialize with an epsilongreedy strategy and collect images and actions while the agent acts under the policy learning algorithm. After 5 epochs, train the auto encoder from this data. Continue to collect data and periodically retrain the auto encoder in parallel with the policy training algorithm.
We found that the reconstructed input achieves a small but nontrivial residual on the test set regardless of which auto encoder training technique is utilized, suggesting that in both cases it learns underlying features of the state space while avoiding overfitting.
To obtain a lower dimensional representation of the agent’s state space, a snapshot of the network’s first six layers is saved. The sixth layer’s output (circled in figure one) is then utilized as an encoding for the original state space. That is, we construct an encoding by running through the first six hidden layers of our autoencoder and then taking the sixth layers output to be . In practice, we found that using the sixth layer’s output (rather than the bottleneck at the fifth layer) obtained the best model learning results. See the appendix for further discussion on this result.
4.2 Model Learning Architecture
Equipped with an encoding , we can now consider the task of predicting model dynamics. For this task, a much simpler two layer neural network suffices. takes as input the encoded version of a state at time along with the agent’s action and seeks to predict the encoded next frame . Loss is computed via a Euclidean loss layer regressing on the ground truth . We find that this model initially learns a representation close to the identity function and consequently the loss residual is similar for most stateaction pairs. However, after approximately 5 epochs, it begins to learn more complex dynamics and consequently better identify novel states. We evaluate the quality of the learned model in the appendix.
5 Related Work
Exploration is an intensely studied area of reinforcement learning. Many of the pioneering algorithms in this area, such as [7] and [8], achieve efficient exploration that scales polynomially with the number of parameters in the agent’s state space (see also [9, 10]). However, as the size of state spaces increases, these methods quickly become intractable. A number of prior methods also examine various techniques for using models and prediction to incentivize exploration [11, 12, 13, 14]. However, such methods typically operate directly on the transition matrix of a discrete MDP, and do not provide for a straightforward extension to very large or continuous spaces, where function approximation is required. A number of prior methods have also been proposed to incorporate domainspecific factors to improve exploration. DoshiVelez et al. [15] proposed incorporating priors into policy optimization, while Lang et al. [16] developed a method specific to relational domains. Finally, Schmidhuber et al. have developed a curiosity driven approach to exploration which uses model predictors to aid in control [17].
Several exploration techniques have been proposed that can extend more readily to large state spaces. Among these, methods such as CPACE [18] and metric [19] require a good metric on the state space that satisfies the assumptions of the algorithm. The corresponding representation learning issue has some parallels to the representation problem that we address by using an autoecoder, but it is unclear how the appropriate metric for the prior methods can be acquired automatically on tasks with raw sensory input, such as the Atari games in our experimental evaluation. Methods based on MonteCarlo tree search can also scale gracefully to complex domains [20], and indeed previous work has applied such techniques to the task of playing Atari games from screen images [21]. However, this approach is computationally very intensive, and requires access to a generative model of the system in order to perform the tree search, which is not always available in online reinforcement learning. On the other hand, our method readily integrates into any online reinforcement learning algorithm.
Finally, several recent papers have focused on driving the Q value higher. In [22], the authors use network dropout to perform Thompson sampling. In Boltzman exploration, a positive probability is assigned to any possible action according to its expected utility and according to a temperature parameter [23]. Both of these methods focus on controlling Q values rather than modelbased exploration. A comparison to both is provided in the next section.
6 Experimental Results
We evaluate our approach on 14 games from the Arcade Learning Environment [24]. The task consists of choosing actions in an Atari emulator based on raw images of the screen. Previous work has tackled this task using Qlearning with epsilongreedy exploration [3], as well as Monte Carlo tree search [21] and policy gradient methods [25]. We use Deep Q Networks (DQN) [3] as the reinforcement learning algorithm within our method, and compare its performance to the same DQN method using only epsilongreedy exploration, Boltzman exploration, and a Thompson sampling approach.
The results for 14 games in the Arcade Learning Environment are presented in Table 1. We chose those games that were particularly challenging for prior methods and ones where human experts outperform prior learning methods. We evaluated two versions of our approach; using either an autoencoder trained in advance by running epsilongreedy Qlearning to collect data (denoted as “Static AE”), or using an autoencoder trained concurrently with the model and policy on the same image data (denoted as “Dynamic AE”). Table 1 also shows results from the DQN implementation reported in previous work, along with human expert performance on each game [3]. Note that our DQN implementation did not attain the same score on all of the games as prior work due to a shorter running time. Since we are primarily concerned with the rate of learning and not the final results, we do not consider this a deficiency. To directly evaluate the benefit of including exploration bonuses, we compare the performance of our approach primarily to our own DQN implementation, with the prior scores provided for reference.
In addition to rawgame scores, and learning curves, we also analyze our results on a new benchmark we have named Area Under Curve 100 (AUC100). For each game, this benchmark computes the area under the gamescore learning curve (using the trapezoid rule to approximate the integral). This area is then normalized by 100 times the score maximum game score achieved in [3], which represents 100 epochs of play at the bestknown levels. This metric more effectively captures improvements to the game’s learning rate and does not require running the games for 1000 epochs as in [3]. For this reason, we suggest it as an alternative metric to raw gamescore.
Bowling
The policy without exploration tended to fixate on a set pattern of nocking down six pins per frame. When bonuses were added, the dynamics learner quickly became adept at predicting this outcome and was thus encouraged to explore other release points.
Frostbite
This game’s dynamics changed substantially via the addition of extra platforms as the player progressed. As the dynamics of these more complex systems was not well understood, the system was encouraged to visit them often (which required making further progress in the game).
Seaquest
A submarine must surface for air between bouts of fighting sharks. However, if the player resurfaces too soon they will suffer a penalty with effects on the game’s dynamics. Since these effects are poorly understood by the model learning algorithm, resurfacing receives a high exploration bonus and hence the agent eventually learns to successfully resurface at the correct time.
bert
Exploration bonuses resulted in a lower score. In bert, the background changes color after level one. The dynamics predictor is unable to quickly adapt to such a dramatic change in the environment and consequently, exploration bonuses are assigned in near equality to almost every state that is visited. This negatively impacts the final policy.
Learning curves for each of the games are shown in Figure (3). Note that both of the exploration bonus algorithms learn significantly faster than epsilongreedy Qlearning, and often continue learning even after the epsilongreedy strategy converges. All games had the inputs normalized according to [3] and were run for 100 epochs (where one epoch is 50,000 time steps). Between each epoch, the policy was updated and then the new policy underwent 10,000 time steps of testing. The results represent the average testing score across three trials after 100 epoch each.
Game  20cmDQN  
100 epochs  95cmExploration  
Static AE  
100 epochs  25cmExploration  
Dynamic AE  
100 epochs  20cmBoltzman  
Exploration  
100 epochs  20cmThompson  
Sampling  
100 epochs  20cmDQN [3]  
1000 epochs  95cmHuman  
Expert [3]  
Alien  1018  1436  1190  1301  1322  3069  6875 
Asteroids  1043  1486  939  1287  812  1629  13157 
Bank Heist  102  131  95  101  129  429.7  734.4 
Beam Rider  1604  1520  1640  1228  1361  6846  5775 
Bowling  68.1  130  133  113  85.2  42.4  154.8 
Breakout  146  162  178  219  222  401.2  31.8 
Enduro  281  264  277  284  236  301.8  309.6 
Freeway  10.5  10.5  12.5  13.9  12.0  30.3  29.6 
Frostbite  369  649  380  605  494  328.3  4335 
Montezuma  0.0  0.0  0.0  0  0  0.0  4367 
Pong  17.6  18.5  18.2  18.2  18.2  18.9  9.3 
bert  4649  3291  3263  4014  3251  10596  13455 
Seaquest  2106  2636  4472  3808  1337  5286  20182 
Space Invaders  634  649  716  697  459  1976  1652 
The results show that more nuanced exploration strategies generally improve on the naive epsilon greedy approach, with the Boltzman and Thompson sampling methods achieving the best results on three of the games. However, exploration bonuses achieve the fastest learning and the best results most consistently, outperforming the other three methods on 7 of the 14 games in terms of AUC100.
7 Conclusion
In this paper, we evaluated several scalable and efficient exploration algorithms for reinforcement learning in tasks with complex, highdimensional observations. Our results show that a new method based on assigning exploration bonuses most consistently achieves the largest improvement on a range of challenging Atari games, particularly those on which human players outperform prior learning methods. Our exploration method learns a model of the dynamics concurrently with the policy. This model predicts a learned representation of the state, and a function of this prediction error is added to the reward as an exploration bonus to encourage the policy to visit states with high novelty.
One of the limitations of our approach is that the misprediction error metric assumes that any misprediction in the state is caused by inaccuracies in the model. While this is true in determinstic environments, stochastic dynamics violate this assumption. An extension of our approach to stochastic systems requires a more nuanced treatment of the distinction between stochastic dynamics and uncertain dynamics, which we hope to explore in future work. Another intriguing direction for future work is to examine how the learned dynamics model can be incorporated into the policy learning process, beyond just providing exploration bonuses. This could in principle enable substantially faster learning than purely modelfree approaches.
References

[1]
A. L. Strehl and M. L. Littman,
An Analysis of ModelBased Interval Estimation for Markov Decision Processes.
Journal of Computer and System Sciences, 74, 1209Ð1331. 
[2]
J. Z. Kolter and A. Y. Ng,
NearBayesian Exploration in Polynomial Time
Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1Ð8, 2009.
 [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, et al. Humanlevel Control Through Deep Reinforcement Learning. Nature, 518(7540):529Ð533, 2015.
 [4] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer series in statistics. Springer, New York, 2001.
 [5] M. Belkin, P. Niyogi and V. Sindhwani, Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. JMLR, vol. 7, pp. 23992434, Nov. 2006.
 [6] G. E. Hinton and R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science, 313, 504–507, 2006.
 [7] R. I. Brafman and M. Tennenholtz, Rmax, a general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research, 2002.
 [8] M. Kearns and D. Koller, Efficient reinforcement learning in factored MDPs. Proc. IJCAI, 1999.
 [9] M. Kearns and S. Singh, Nearoptimal reinforcement learning in polynomial time. Machine Learning Journal, 2002.
 [10] W. D. Smart and L. P. Kaelbling, Practical reinforcement learning in continuous spaces. Proc. ICML, 2000.
 [11] J. Sorg, S. Singh, R. L. Lewis, VarianceBased Rewards for Approximate Bayesian Reinforcement Learning. Proc. UAI, 2010.
 [12] M. Lopes, T. Lang, M. Toussaint and P.Y. Oudeyer, Exploration in Modelbased Reinforcement Learning by Empirically Estimating Learning Progress. NIPS, 2012.

[13]
M. Geist and O. Pietquin,
Managing Uncertainty within Value Function Approximation in Reinforcement Learning
. W. on Active Learning and Experimental Design, 2010.
 [14] M. Araya, O. Buffet, and V. Thomas. Nearoptimal BRL Using Optimistic Local Transitions. (ICML12), ser. ICML 12, J. Langford and J. Pineau, Eds. New York, NY, USA: Omnipress, Jul. 2012, pp. 97104.
 [15] F. DoshiVelez, D. Wingate, N. Roy, and J. Tenenbaum, Nonparametric Bayesian Policy Priors for Reinforcement Learning. NIPS, 2014.
 [16] T. Lang, M. Toussaint, K. Keristing, Exploration in relational domains for modelbased reinforcement learning Proc. AAMAS, 2014.
 [17] Juergen Schmidhuber Developmental Robotics, Optimal Artificial Curiosity, Creativity, Music, and the Fine Arts. . Connection Science, vol. 18 (2), p 173187. 2006.
 [18] J. Pazis and R. Parr, PAC Optimal Exploration in Continuous Space Markov Decision Processes. Proc. AAAI, 2013.
 [19] Kakade, S., Kearns, M., and Langford, J. (2003). Exploration in metric state spaces. Proc. ICML.
 [20] A. Guez, D. Silver, P. Dayan, Efficient BayesAdaptive Reinforcement Learning using SampleBased Search. NIPS, 2014.
 [21] X. Guo, S. Singh, H. Lee, R. Lewis, X. Wang, Deep Learning for RealTime Atari Game Play Using Offline MonteCarlo Tree Search Planning. NIPS, 2014.
 [22] Y Gal, Z Ghahramani. Dropout as a Bayesian approximation: Insights and applications Deep Learning Workshop, ICML
 [23] David Carmel and Shaul Markovitch. Exploration Strategies for Modelbased Learning in Multiagent Systems Autonomous Agents and MultiAgent Systems Volume 2, Issue 2 , pp 141172
 [24] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, The Arcade Learning Environment: An Evaluation Platform for General Agents. JAIR. Volume 47, p.235279. June 2013.
 [25] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. Trust Region Policy Optimization Arxiv preprint 1502.05477.
8 Appendix
8.1 On auto encoder layer selection
Recall that we trained an autoencoder to encode the game’s state space. We then trained a predictive model on the next autoencoded frame rather than directly training on the pixel intensity values of the next frame. To obtain the encoded space, we ran each state through an eight layer autoencoder for training and then utilized the autoencoder’s sixth layer as an encoded state space. We chose to use the sixth layer rather than the bottleneck fourth layer because we found that, over 20 iterations of Seaquest at 100 epochs per iteration, using this layer for encoding delivered measurably better performance than using the bottleneck layer. The results of that experiment are presented below.
8.2 On the quality of the learned model dynamics
Evaluating the quality of the learned dynamics model is somewhat difficult because the system is rewarded achieving higher error rates. A dynamics model that converges quickly is not useful for exploration bonuses. Nevertheless, when we plot the mean of the normalized residuals across all games and all trials used in our experiments, we see that the errors of the learned dynamics models continually decrease over time. The mean normalized residual after 100 epochs is approximately half of the maximal mean achieved. This suggests that each dynamics model was able to correctly learn properties of underlying dynamics for its given game.
8.3 Raw AUC100 scores
Game  20cmDQN  95cmExploration  
Static AE  25cmExploration  
Dynamic AE  20cmBoltzman  
Exploration  20cmThompson  
Sampling  
Alien  0.153  0.198  0.171  0.187  0.204 
Asteroids  0.259  0.415  0.254  0.456  0.223 
Bank Heist  0.0715  0.1459  0.089  0.089  0.1303 
Beam Rider  0.1122  0.0919  0.1112  0.0817  0.0897 
Bowling  0.964  1.493  1.836  1.338  1.122 
Breakout  0.191  0.202  0.192  0.294  0.254 
Enduro  0.518  0.495  0.589  0.538  0.466 
Freeway  0.206  0.213  0.295  0.313  0.228 
Frostbite  0.573  0.971  0.622  0.928  0.746 
Montezuma  0.0  0.0  0.0  0  0 
Pong  0.52  0.56  0.424  0.612  0.612 
bert  0.155  0.104  0.121  0.13  0.127 
Seaquest  0.16  0.172  0.265  0.194  0.174 
Space Invaders  0.205  0.183  0.219  0.183  0.146 