Despite recent entreaties for better practices to yield reproducible research (henderson2018deep; machado2018revisiting), the literature on exploration in reinforcement learning still lacks a systematic comparison between existing methods. In the context of the the Arcade Learning Environment (ALE; bellemare2013arcade), we observe comparisons of agents trained under different regimes: with or without reset, using varying number of training frames, with and without sticky actions (machado2018revisiting) and evaluating only on a small subset of the available games. This makes it nearly impossible to assess the field’s progress towards efficient exploration.
Our goal here is to revisit some of the recent bonus based exploration methods using a common evaluation regime. We do so by
Comparing all methods on the same set of Atari 2600 games;
Applying these bonuses on the same value-based agent architecture, Rainbow (hessel2018rainbow);
Fixing the number of samples each algorithm uses during training to 200 million game frames.
As an additional point of comparison, we also evaluate in the same setting NoisyNets (fortunato18noisy), part of the original Rainbow algorithm and -greedy exploration.
We study three questions relevant to exploration in the ALE:
How well do different methods perform on Montezuma’s Revenge?
Do these methods generalize to bellemare2016unifying
’s set of “hard exploration games”, when their hyperparameters are tuned only onMontezuma’s Revenge?
Do they generalize to other Atari 2600 games?
We find that, despite frequent claims of state-of-the-art results in Montezuma’s Revenge, when the learning algorithm and sample complexity are kept fixed across the different methods, little to no performance gain can be observed over older methods. Furthermore, our results suggest that performance on Montezuma’s Revenge is not indicative of performance on other hard exploration games. In fact, on 5 out of 6 hard exploration games performance of considered bonus-based methods is on-par with an -greedy algorithm, and significantly lower than human-level performance. Finally, we find that, while exploration bonuses improve performance on hard exploration games, they typically hurt performance on the easier Atari 2600 games. Taken together, our results suggests that more research is needed to make bonus-based exploration robust and reliable, and serve as a reminder of the pitfalls of developing and evaluating methods primarily on a single domain.
2 Related Work
Exploration methods may encourage agents toward unexplored parts of the state space in different ways. Count-based methods generalize previous work that was limited to tabular methods (strehl2008analysis)
and estimate counts in high dimension(bellemare2016unifying; count-based; tang2017exploration; choshen2018dora; machado18_SR). Prediction error has also been used as a novelty signal to compute an exploration bonus (stadie2015incentivizing; pathak17curiositydriven; burda2018exploration)osband2016deep; o2017uncertainty; touati2018randomized).
burda2018large benchmarks various exploration methods based on prediction error within a set of simulated environment including some Atari 2600 games. However their study differs from ours as their setting ignore the environment reward and instead learns exclusively from the intrinsic reward signal.
3 Exploration methods
We focus on bonus-based methods that encourage exploration through a reward signal. At each time-step the agent is trained with the reward where is the extrinsic reward provided by the environment, the intrinsic reward computed by agent and a scaling parameter. We now summarize different ways to compute the intrinsic reward .
Pseudo-counts (bellemare2016unifying; count-based) were proposed as way to estimate counts in high dimension states spaces using a density model. The agent is then encouraged to visit states with a low visit count. Let be a density model over the state space and the density assigned to after being trained on a sequence of states . We will write the density assigned to if were to be updated with . We require to be learning positive (i.e ) and define the prediction gain as . The pseudo-count can then be used to compute the intrinsic reward
CTS (bellemare2014skip) and PixelCNN (oord2016pixel) have been both used as density models. We will disambiguate these agent by the name of their density model.
3.2 Intrinsic Curiosity Module
Intrinsic Curiosity Module (ICM, pathak17curiositydriven) promotes exploration via curiosity. pathak17curiositydriven formulates curiosity as the agent’s ability to predict the consequence of its own actions in a learned feature space. ICM includes a learned embedding, a forward and an inverse model. The embedding is trained through the inverse model, which in turn, has to predict the agent’s action between two states and using their embedding and . Given a transition the intrinsic reward is then given by the error of the forward model in the embedding space between and the predicted estimate
3.3 Random Network Distillation
Random Network Distillation (RND, burda2018exploration)
derives a bonus from the prediction error of a random network. The intuition is that the prediction error will be low on states that are similar to those previously visited and high on newly visited states?. A neural networkwith parameters is trained to predict the output of a fixed randomly initialized neural network :
Though is does not generate an exploration bonus, we also evaluate NoisyNets (fortunato18noisy) as it was chosen as the exploration strategy of the original Rainbow implementation (hessel2018rainbow). NoisyNets add noise in parameter space and propose to replace standard fully-connected layers by a noisy version that combines a deterministic and a noisy stream:
are random variable anddenotes elementwise multiplication.
4 Evaluation protocol
We evaluate two key properties of exploration methods in the ALE:
Sample efficiency: obtaining a decent policy quickly.
Robustness: performing well across different games of the ALE with the same set of hyperparameters.
Sample efficiency is a key objective for exploration methods, yet, because published agents are often trained under different regimes it is often not possible to directly compare their performance. They often employ different reinforcement learning algorithms, varying quantity of training frames or inconsistent hyperparameter tuning. As a remedy, we fix our training protocol and train bonus-based methods with a common agent, the Rainbow implementation provided by the Dopamine framework (castro2018dopamine) which includes Rainbow’s three most important component: -step updates (mnih2016asynchronous), prioritized experience replay (schaul2015prioritized) and distributional reinforcement learning (bellemare2017distributional). To avoid introducing bias in favor of a particular method we also kept the original hyperparameters fixed. Our agents are trained for 200 million frames following mnih2015human’s original setting. Nevertheless we also acknowledge the emerging trend of training agents an order of magnitude longer in order to produce a high-scoring policy, irrespective of the sample cost (Espeholt2018IMPALASD; burda2018exploration; kapturowski2018recurrent).
The ALE was designed with the assumption that few games would be used for training and the remaining ones for evaluation. Nonetheless it has become common to do hyperparameter tuning on Montezuma’s Revenge and only evaluate on other ALE’s hard exploration games with sparse rewards: Freeway, Gravitar, Solaris, Venture, Private Eye. While this may be due to limited computational resources doing so however may come to a price on easier exploration problems as we will see later on. For this reason we chose to also evaluate performance on the original Atari training set111Freeway, Asterix, Beam Rider, Seaquest, Space Invaders. Except for Freeway these are all considered easy exploration problems (bellemare2016unifying).
5 Empirical Results
In this section we present an experimental study of exploration methods using the protocol described previously.
5.1 Montezuma’s revenge
We begin by establishing a benchmark of bonus-based methods on Montezuma’s Revenge when each method is tuned on the same game. Details regarding implementation and hyperparameter tuning may be found in Appendix B. Figure 1 shows training curves (averaged over 5 random seeds) for Rainbow augmented with different exploration bonuses.
As anticipated, -greedy exploration performs poorly. Other strategies are able to consistently reach 2500 points and often make further progress. However, we find pseudo-count with CTS still outperforms recent bonuses and reaches a score of 5000 points within 200 millions frames. Of note, the performance we report for each method improves on the performance originaly reported by the authors. This is mostly due to the fact these methods are based on weaker Deep Q-Network (mnih2015human) variants. This emphasize again the importance of the agent architecture to evaluate exploration methods.
Regarding RND performance, we note that our implementation only uses Eq. (3) bonus and does not appeal to other techniques presented in the same paper that were shown to be critical to the final performance of the algorithm. Though, we might expect that such techniques would also benefit other bonus based methods and leave it to future work.
5.2 Hard exploration games
We now turn our attention to the set of games categorized as hard exploration games by bellemare2016unifying that is often used as an evaluation set for exploration methods. Training curves for few games are shown in Figure 2, the remaining ones are in Appendix A. We find that performance of each method on Montezuma’s Revenge does not correlate with performance on other hard exploration problems and the gap between different methods is not as large as it was on Montezuma’s Revenge. Surprisingly, in our setting, there is also no visible difference between -greedy exploration and more sophisticated exploration strategies. -greedy exploration remains competitive and even outperforms other methods by a significant margin on Gravitar. Similar results have been reported previously (machado18_SR; burda2018exploration)
. These games were originally classified as hard exploration problems because DQN with-greedy exploration was unable to reach a high scoring policy; however, these conclusions must be revisited with stronger base agents. Progress in these games may be due to better credit assignment methods and not to the underlying exploration bonus.
5.3 ALE training set
While the benefit of exploration bonuses has been shown on a few games they can also have a negative impact by skewing the reward landscape. To get a more complete picture, we also evaluated our agents on the original Atari training set which includes many easy exploration games. Figure2 shows training curves for Asterix and Seaquest, the remaining games can be found in Appendix A. In this setting we noticed a reversed trend than the one observed on Montezuma’s Revenge. The pseudo-count method ends up performing worse on every game except Seaquest. RND and ICM are able to consistently match the level of -greedy exploration, but not exceed it. The earlier benefits conferred by pseudo-counts result in a considerable detriment when the exploration problem is not difficult. Finally, since NoisyNets optimizes the true environment reward, and not a proxy reward, it consistently matches -greedy and occasionally outperforms. Overall we found that bonus-based methods are generally detrimental in the context of easy exploration problems. Despite its limited performance on Montezuma’s Revenge NoisyNets gave the most consistent results across our evaluation despite its limited performance on Montezuma’s Revenge.
Many exploration methods in reinforcement learning are introduced with confounding factors – longer training duration, different model architecture and new hyper parameters. This obscures the underlying signal of the exploration method. Therefore, following a growing trend in the reinforcement learning community, we advocate for better practices on empirical evaluation for exploration to fairly assess the contribution of newly proposed methods. In a standardized training environment and context, we found that -greedy exploration can often compete with more elaborate methods on the ALE. This shows that more work is still needed to address the exploration problem in complex environments.
The authors would like to thank Hugo Larochelle, Benjamin Eysenbach, Danijar Hafner and Ahmed Touati for insightful discussions as well as Sylvain Gelly for careful reading and comments on an earlier draft of this paper.
Appendix A Additional figures
The variance of the return onMontezuma’s Revenge is high because the reward is a step function, for clarity we also provide all the training curves in Figure 5
Appendix B Hyperparameter tuning
Except for NoisyNets, all other methods are tuned with respect to their final performance on Montezuma’s Revenge after training on 200 million frames on five runs.
b.1 Rainbow and Atari preprocessing
We used the standard architecture and Atari preprocessing from mnih2015human. Following machado2018revisiting recommendations we enable sticky actions and deactivated the termination on life loss heuristic. The remaining hyperparameters were chosen to match hessel2018rainbow implementation.
|Min history to start learning||80K frames|
|Target network update period||32K frames|
|Adam learning rate|
|Distributional min/max values||[-10, 10]|
Every method except NoisyNets is trained with -greedy following the scheduled used in Rainbow with decaying from to over 1M framces.
We kept the original hyperparameter used in fortunato18noisy and hessel2018rainbow.
We followed bellemare2016unifying’s preprocessing, inputs are greyscale images, with pixel values quantized to 8 bins.
We tuned the scaling for and found that worked best.
We tuned the scaling factor and the prediction gain decay constant . We ran a sweep with the following values: , and found and to work best.
We tuned the scaling factor and the scalar that weighs the inverse model loss against the forward model. We ran a sweep with and . We chose and to work best.
Following burda2018exploration we did not clip the intrinsic reward while the extrinsic reward was clipped (we also found in our initial experiments that clipping the intrinsic reward led to worse performance). We tuned the reward scaling factor and Adam learning rate used by RND optimizer. We ran a sweep with and . We found that and worked best.