This paper addresses the “exploration vs. exploitation” dilemma arising in Reinforcement Learning (RL). It suggests that an agent learning through interactions should balance its action selection process between probing the environment to discover new rewards (exploration) and using the information acquired in the past to adopt an acceptable behaviour (exploitation). This trade-off is usually obtained by modifying the actions selected by the RL agent (e.g., -greedy selection, Gibbs Sampling, optimism (Auer and Ortner, 2007; Geist and Pietquin, 2011)), by perturbing the parameters of the agent (Fortunato et al., 2018; Plappert et al., 2018), or by modifying the reward it receives (e.g., exploration bonus or intrinsic motivation (Bellemare et al., 2016; Tang et al., 2017)). Those methods often rely on many meta-parameters that are hard to tune, ad hoc to the problem at hand and, most importantly, can lead to sub-optimal policies.
Here, we adopt a disruptive but simple and generic perspective, where we disentangle explicitly exploration and exploitation in a deep RL architecture, as depicted in Figure 1. Different losses are optimized in parallel, one of them coming from the true RL objective (that is maximizing the cumulative rewards gathered in the environment) and others being related to exploration (e.g., exploration bonus or intrinsic motivation). Every loss is used in turn to compute a policy that generates transitions, all shared in a single replay buffer. Off-policy RL methods are then applied to these transitions to optimize every loss including the true RL one. This approach is generic, as we can combine many existing exploration strategies, and make use of any off-policy RL algorithm.
After discussing related works, we present the general proposed strategy, which we call MuleX for “Multiple losses for eXploration”, as well as a specific instantiation based on DQN with an agent combining an exploiter and an explorer optimizing for a count-based exploratory loss. Then, we showcase this approach on a hard-exploration environment. Notably, we show through an ablation study that the proposed approach is more efficient than any of its individual components, learns faster, and is more stable.
2 Related work
The exploration-exploitation dilemma is a core problem of RL. Its simplest form is the stochastic bandit problem (Bubeck and Cesa-Bianchi, 2012), where an agent has to pull sequentially arms associated to stochastic rewards such as maximizing the expected cumulative reward. In this case, the uncertainty comes only from the stochasticity of the rewards. In RL, things are more involved, as a decision can have long-term consequences.
A common and simple approach is to disrupt the greedy action of the policy, for example with an -greedy policy or with Gibbs sampling (Sutton and Barto, 2018). These common approaches, if simple, are usually inefficient in hard exploration problems (for example, when rewards are scarce and far from the initial state). An alternative approach consists in perturbing the parameters of the agent, instead of its actions (Sehnke et al., 2010; Plappert et al., 2018; Fortunato et al., 2018). This comes with various motivations, such as performing a consistent exploration. It has been recently shown that this kind of exploration has a lower sample complexity (Vemula et al., 2019). A third approach consists in enhancing the reward with an exploration bonus, that can be based on some form of intrinsic motivation, novelty measure, or expert knowledge. For example, novelty can be measured by the number of times a state has been visited (Brafman and Tennenholtz, 2002), an approach that was successfully scaled to deep RL (Bellemare et al., 2016; Tang et al., 2017). Other novelty measures such as prediction error (Burda et al., 2019b) have been proposed. However, modifying the reward changes the problem at hand, and the bonus should generally be carefully annealed. Finally, optimism in the face of uncertainty, a well-known paradigm in online learning, has also been applied to RL (Auer and Ortner, 2007; Geist and Pietquin, 2011)
. It consists in a modification of the action-selection probability to increase the chance of selecting an action with upper confidence bound on the reward. Yet, this is often computationally intractable as it requires computing second order statistics, either on the state visitation frequency, or the model parameters.
All aforementioned approaches share the property of having a single exploration strategy. In some distributed RL algorithms, it is advocated that different workers use different parameters for their exploration strategy, such as the value of in an -greedy strategy (Mnih et al., 2016). Yet, the exploration strategy is homogeneous, and a single loss is optimized by the learner. Jaderberg et al. (2018) propose a similar strategy in a population-based multi-agent approach. Different agents have different (entangled) exploration strategies, and a second optimization process evolves them according to the true environment rewards.
What is common between all these methods is that they entangle exploration and exploitation into a single policy. A recent exception was proposed by Colas et al. (2018). They have first a pure exploration phase (based on intrinsic motivation) to feed a replay buffer that is then, in a second step, used to train the task policy with a classic off-policy algorithm (that has its own exploration mixed in). The approach we propose is different, notably by the fact that we optimize for different losses at the same time: one for exploitation, and one or more for exploration. Our scheme results in distinct policies, which interact for gathering samples.
While MuleX is generally applicable to any type of off-policy RL agent, we describe an instantiation based on DQN (Mnih et al., 2015) for simplicity and extend it to Rainbow in the Appendix.
Consider the standard RL setting where an agent learns to solve a given Markov Decision Process (MDP) defined by the 5-tupleof the state space, action space, transition function, reward function, and discount factor, respectively. To solve this problem, the agent iterates between acting in the environment in order to collect transitions which are stored in a replay buffer, and updating the function approximator for the -function using transitions from that replay buffer by optimizing:
where the target
is the bootstrap estimate of the optimal expected return:
A policy is associated to this -function by taking the best action according to at each step.
The minimal way of doing exploration is by making the policy -greedy, meaning with probability , a random action is taken instead of the optimal one. While this theoretically finds the optimal policy in the limit, it does not work well on hard exploration tasks in practical settings where time is constrained.
A widespread approach to encourage more structured exploration is to augment the environment’s reward with a bonus for exploration, be it through hand-engineered guiding rewards, intrinsic motivation, or curiosity. These boni, denoted by , are typically added to the environment’s reward with some weighting factors, resulting in a different target for the DQN optimization:
The weight needs to be tuned such that it does encourage exploration while simultaneously not drowning the actual task-based reward .
Note that the -function learned using such modified rewards solves a MDP being different from the original one, which in addition is usually non-stationary as the “novelty” of states changes over time. This is an undesired side-effect of encouraging exploration through such boni. A manually tuned annealing schedule is often imposed on for mitigating this, but even doing so, behaviours of exploration can still be observed in the final agent.
One especially prominent type of such bonus is derived from the bandits literature, where favourable theoretical bounds on regret have been shown (Auer, 2002) when adding to an arm’s estimated value, where is the total number of pulls, and is that arm’s number of pulls. This was translated into the RL setting (Strehl and Littman, 2008; Bellemare et al., 2016) by suggesting a bonus related to the number of times a configuration has been visited: , with a visit counter . The term is almost constant and can be absorbed into the weight . One important difference between what is suggested in bandits and what is done in these papers, is that in the bandits literature, this bonus is not added to an arm’s estimated value, but it is only used for the action selection process. One argument for adding the bonus to the reward is to encourage long term planning at visiting unexplored areas of the state space. The bandit approach has been adopted in RL as well (Auer and Ortner, 2007; Geist and Pietquin, 2011) but this line of work does not immediately translate to the deep RL framework.
Our proposal is to disentangle these rewards in the agent by learning a separate -function for each of them:
Note that all -functions are learned from the same, shared transitions. This way, by acting according to each individual Q-function, we obtain one policy corresponding to each reward, which can be used to act according to that reward’s intent. Most importantly, attempts to solve the actual task through the whole training.
Multiple policies are thus available, some focused on exploration, one on solving the task, and the RL agent needs to decide with which one to act and collect transitions. Many strategies can be conceived, including learning-based ones, but they should involve all policies (see Sec. 4.4
). In this work, we show that even using a simple random heuristic, our proposed framework has considerable benefits over typical methods relying on the sum of rewards to optimize the policy. The heuristic works as follows. First, choose which policy to use for acting according to a categorical distribution of parameters
. Then, sample the number of steps for which this policy should act from a geometric distribution with parameter. After acting for that number of steps, rinse and repeat. While setting to the MDP’s
is a reasonable approach, we leave it as a free hyperparameter to be optimized.
4 Experiments and results
We experiment on a grid world environment inspired by two popular environments for exploration in RL research: Montezuma’s Revenge and the classic four-rooms, but where we can explicitly control the various aspects which influence exploration. The environment, which we call Montezuminha and bears similarity to the classic four-rooms, is shown in Figure 2. The agent starts in the upper left one, where a key opens the door to the upper right one. There, a second key opens another door in the first room leading to the third room (lower left). There, the agent can either end the game by finding the exit or explore the last room (lower right) to get an extra reward. Every collected item and reaching the exit give a +1 reward. The maximum score is then +4, provided that the agent does avoid an early exit in the third room, which should require strong exploration or extreme luck.
With this, we can explicitly control for the various factors affecting exploration:
By increasing the size of the rooms, we increase sparsity of rewards. The number of steps along the optimal trajectory grows linearly, while the amount of exploration grows quadratically.
By making the walls teleport the agent to a rewardless parallel world, we add some distracting states whose exploration is not aligned with solving the task.
By adding deadly ghosts which move randomly, we add stochasticity to the environment. This is interesting since real tasks can be stochastic due to partial observability and nature.
We can use an oracle exploration bonus on the plain environment, or an approximate exploration bonus by rendering a textured version of the environment.
We perform our expriments in the Dopamine framework (Castro et al., 2018)
, where we train the agent during 800 iterations, each one consisting of 2500 training steps and then 1250 evaluation steps. We limit each episode to 500 steps and perform gradient-updates using RMSProp on mini-batches of 32 transitions every 4 training steps. Our neural network takes as input a stack of 4 consecutive frames (one frame is shown in Figure1(b)
), and consists of a shared body with two convolutional layers with 16 and 32 kernels and then for each head two dense layers with 64 hidden neurons. All reported scores are obtained by running the task-policyin the environment in “evaluation mode,” i.e. without collecting transitions for training.
We focus on comparing our proposed MuleX approach to the typical Additive approach which optimizes a single policy using as a reward a linear combination of the task and the bonus reward. Our method is instantiated with two policies, one optimizing the task reward and one optimizing only the bonus reward. Note that our purpose is not to compare how different exploration rewards perform, we thus assume that both methods have access to an oracle exploration bonus based on exact state-counts . Other works have proposed ways to extend such exploration bonuses to large problems (Bellemare et al., 2016; Tang et al., 2017; Burda et al., 2019b). As another baseline, we include experiments with -greedy as sole exploration method.
We perform a large-scale comparison between MuleX, Additive, and agents using random hyperparameter search (Bergstra and Bengio, 2012), providing the same budget of 200 trials (repeated 5 times each) to every method. For the agent, we logarithmically search over . For the Additive agent, we logarithmically search over the bonus weight . For the MuleX agent, we search over the switching strategy’s start probabilities and duration . Note that there is no to be tuned in MuleX. For all agents, we logarithmically search over the optimizer’s learning-rate in .
4.1 Faster optimal task-agent
First, we compare the return achieved by MuleX’s task-policy to that achieved by the Additive policy throughout training. Figure 4 depicts these as curves showing the average over all trials (dotted line) as well as the average of the best trials (solid line) according to the area under the curve (AUC). As can be seen, MuleX’s task-policy reaches the highest return about twice as quickly as the Additive policy. In the average case too, MuleX’s task-policy reaches higher return significantly quicker than the Additive policy, because it can focus on solving the task alone from the beginning. This shows that learning separate policies for separate purposes is beneficial over learning a single policy with multiple, entangled purposes.
4.2 Robustness to hyperparameters
Some RL methods can be very sensitive to hyperparameters, which implies that they need extra care to be properly tuned in order to perform at their best.
To investigate the robustness to hyperparameters we consider again the AUC. For better interpretability we normalize it by the AUC of an ideal agent which would immediately obtain an ideal return from the first iteration on. A normalized AUC of 1 represents this ideal agent. We plot the density of normalized AUCs over the 1000 runs of the hyperparameter sweep on a violin plot in Figure 4, including markers for the best, worst, and median performances. These plots visualize the distribution of performances over all trials, and give a sense of robustness to hyperparameter values. As we can see, even with such simple setup, the MuleX approach is significantly simpler to tune than the classical Additive one.
4.3 Robustness of the task-policy w.r.t. initial states
Another advantage of our framework is that the task-policy is inherently more robust for two main reasons:
Continuous exploration: By having distinct task and exploration policies, it is straightforward to keep exploring around well-trodden trajectories. In contrast, Additive explores less around the optimal trajectory over time.
Immediate task solving: When getting into rarely visited states, the policy of the Additive approach exhibits explorative behaviour, because its reward is dominated by the high exploration reward that can be observed in those states. On the other hand, the task-policy learned by MuleX is not contaminated by the exploration bonus and thus still learns to solve the task, i.e. to “get back to the optimal trajectory.”
We experimentally demonstrate this intuition by starting the best final trained policies in all possible states where both doors are open and the extra bonus is collected. In this situation, the top two rooms are far off the optimal trajectory. We then count how many steps are required for the policy to reach the goal starting from there. The result is shown in Figure 5 as a heatmap over all these starting states. Note that the Additive-exploration agent still wants to explore in the rarely seen top half of the map, whereas our MuleX agent’s task policy goes straight to the goal, no matter how far off the optimal trajectory it starts.
4.4 The necessity for all policies to act
Deciding which policy should act in the environment in order to collect transitions is an integral part of MuleX. One could imagine that, in simple environments or when the policies are roughly aligned, it could be enough for the exploration policy to act, and the task policy could be learned completely offline. It is well-known that offline Q-learning with function approximators can lead to overestimation of for some state action pairs and thus seriously impact the performance of the task policy (Fujimoto et al., 2019). We experimentally demonstrate this by instantiating MuleX with two policies and which both optimize for the task-reward only using -greedy exploration. Both policies are trained from the same data and using the same reward. We consider various start probabilities and for the acting strategy. Figure 6 confirms that even in a simple environment like Montezuminha, each policy must act to correct possible overestimation of its function.
4.5 Varying environment factors
We now explicitly control each of the factors of the environment mentioned at the beginning of Section 4 which affect the exploration properties.
4.5.1 Increased reward sparsity (room size)
In order to investigate the effect of (task-)reward sparsity, we grow the room size from the previously used to and . This has the effect that the space to be explored grows quadratically, while the length of the optimal trajectory grows linearly. We thus extend the maximum episode length to and steps as well as the steps per iteration to and for all agents.
Figure 7 shows the results of this experiment. For each room size , we normalize the scores such that the median AUC of -greedy is one. This means that the plot shows how much Additive and MuleX improve over -greedy. The best agents performing increasingly better than the -greedy baseline confirms that using an exploration bonus becomes increasingly important as the rewards get sparser. However, MuleX offers significant advantages over the Additive method. First, its median runs get better as the room size increases, demonstrating the robustness of our approach. Second, MuleX makes better use of the exploration bonus than Additive: while for , the best agents of both methods perform similarly, the gap increases significantly in favor of MuleX for .
4.5.2 Misleading exploration (teleporting walls)
We also propose a significantly harder variant of Montezuminha, where the agent is teleported to a rewardless parallel world whenever it hits a wall. A more detailed description of this environment is given in the Appendix.The main challenge here is that exploring further is not necessarily aligned with solving the task. The results in Figures 9 and 9 again demonstrate that MuleX significantly outperforms the other two baselines.
4.5.3 Stochasticity (random ghost)
While a deterministic environment can be useful for analyzing a method, RL aims at solving a wider range of problems, including stochastic environments. This can pose a problem for exploration methods (Burda et al., 2019a) as well as for algorithms (Ecoffet et al., 2019). For the sake of genericity, we make Montezuminha stochastic by introducing deadly ghosts which move randomly and terminate the episode on contact, without giving negative reward. We test the behaviour of MuleX and our baselines in this stochastic version of Montezuminha. See the Appendixfor more details and the full results. While the results in Figure 11 show that MuleX performs better than both baselines, it is evident that all three methods struggle, suggesting further work is needed in performing exploration in stochastic environments.
4.5.4 Approximate exploration bonus (textures)
Throughout this paper, we have used an oracle for the exploration bonus, because our goal is not to investigate exploration per-se, but rather to investigate new ways of integrating such bonus rewards into the training. However, in most application scenarios, one does not have access to perfect (oracle) exploration boni, and it thus makes sense to evaluate how MuleX and Additive agents behave under imperfect exploration boni. For this, we use a textured version of Montezuminha and implement the SimHash-based exploration bonus proposed in Tang et al. (2017).
The results are shown in Figure 11, and we refer to the Appendixfor more details. The difference of robustness with respect to hyperparameters is even more striking on this example: MuleX can achieve the maximum return for almost any configuration while the Additive method requires some tuning of the hyper-parameters. MuleX also solves the task much faster, as observed previously.
5 Conclusion and future work
MuleX is a new way to address the classic dilemma: exploitation is disentangled from exploration by continuously optimizing a policy on the task reward, while performing exploration by acting according to a policy driven by a separate exploration objective. This new framework provides clear benefits both in terms of sample efficiency and robustness with respect to the initial state.
While we provide some insights on this new way of integrating bonus rewards, this is only a first step. For example, we could consider elaborate actor selection strategies: intuitively, the task policy should act in well-explored part of the state space, whereas there is a need for more exploration in rarely visited states. Another step could be applying these ideas to policy-gradient methods, which have been successfully used at scale.
Furthermore, MuleX seems like a natural candidate for life-long learning: because it does not require any sort of annealing of exploration rewards, it can constantly keep exploring without contaminating the task policy.
Using Confidence Bounds for Exploitation-Exploration Trade-offs.
Journal of Machine Learning Research, 3:397–422, 2002.
- Auer and Ortner (2007) Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems (NIPS), pages 49–56. MIT Press, 2007.
- Bellemare et al. (2016) Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying Count-based Exploration and Intrinsic Motivation. In Advances in Neural Information Processing Systems (NIPS), pages 1471–1479, 2016.
- Bellemare et al. (2017) Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 449–458. JMLR. org, 2017.
- Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. Random Search for Hyper-parameter Optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
- Brafman and Tennenholtz (2002) Ronen I Brafman and Moshe Tennenholtz. R-max - a General Polynomial Time Algorithm for Near-optimal Reinforcement Learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
- Bubeck and Cesa-Bianchi (2012) Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
- Burda et al. (2019a) Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A. Efros. Large-scale study of curiosity-driven learning. In Proceedings of the International Conference on Learning Representations (ICLR), 2019a.
- Burda et al. (2019b) Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. In Proceedings of the International Conference on Learning Representations (ICLR), 2019b.
- Castro et al. (2018) Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G. Bellemare. Dopamine: A Research Framework for Deep Reinforcement Learning. CoRR, abs/1812.06110, 2018.
- Colas et al. (2018) Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms. In Proceedings of the International Conference on Machine Learning (ICML), 2018.
- Ecoffet et al. (2019) Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. Go-Explore: a New Approach for Hard-Exploration Problems. CoRR, abs/1901.10995, 2019.
- Fortunato et al. (2018) Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alexander Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy Networks for Exploration. In Proceedings of the International Conference on Representation Learning (ICLR), 2018.
- Fujimoto et al. (2019) Scott Fujimoto, David Meger, and Doina Precup. Where off-policy deep reinforcement learning fails, 2019. URL https://openreview.net/forum?id=S1zlmnA5K7.
- Geist and Pietquin (2011) Matthieu Geist and Olivier Pietquin. Managing uncertainty within the ktd framework. In Active Learning and Experimental Design workshop in conjunction with AISTATS 2010, pages 157–168, 2011.
Hessel et al. (2018)
Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski,
Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver.
Rainbow: Combining improvements in deep reinforcement learning.
Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018.
- Jaderberg et al. (2018) Max Jaderberg, Wojciech M Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castaneda, Charles Beattie, Neil C Rabinowitz, Ari S Morcos, Avraham Ruderman, et al. Human-level Performance in First-person Multiplayer Games with Population-based Deep Reinforcement Learning. arXiv preprint arXiv:1807.01281, 2018.
- Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level Control Through Deep Reinforcement Learning. Nature, 518(7540):529, 2015.
- Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the International conference on machine learning (ICML), pages 1928–1937, 2016.
- Plappert et al. (2018) Matthias Plappert, Rein Houthooft, Prafulla Dhariwal, Szymon Sidor, Richard Y. Chen, Xi Chen, Tamim Asfour, Pieter Abbeel, and Marcin Andrychowicz. Parameter Space Noise for Exploration. In Proceedings of the International Conference on Learning Representations (ICLR), 2018.
- Schaul et al. (2016) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
- Sehnke et al. (2010) Frank Sehnke, Christian Osendorfer, Thomas Rückstieß, Alex Graves, Jan Peters, and Jürgen Schmidhuber. Parameter-exploring Policy Gradients. Neural Networks, 23(4):551–559, 2010.
- Strehl and Littman (2008) Alexander L. Strehl and Michael L. Littman. An analysis of model-based interval estimation for markov decision processes. J. Comput. Syst. Sci., 74(8):1309–1331, 2008.
- Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT press, 2018.
- Tang et al. (2017) Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # Exploration: A Study of Count-based Exploration for Deep Reinforcement Learning. In Advances in Neural Information Processing Systems (NIPS), pages 2753–2762, 2017.
- Vemula et al. (2019) Anirudh Vemula, Wen Sun, and J Andrew Bagnell. Contrasting Exploration in Parameter and Action Space: A Zeroth-Order Optimization Perspective. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
Appendix A Additional experiments
Rainbow (Hessel et al., 2018) is a combination of extensions that improve over the original DQN (Mnih et al., 2015). We use the Dopamine implementation of Rainbow which includes the following extensions: n-step returns, prioritized experience replay (Schaul et al., 2016) and C51 distributional RL (Bellemare et al., 2017).
Prioritized experience replay was shown to be effective for MDPs with sparse and stationary rewards. This remark drives our choice for MuleX where the task policy is learned using Rainbow while the exploration policy, which is learned from dense and non-stationary rewards is trained using standard DQN. Note that the Additive exploration method in conjunction with Rainbow is an important baseline although the additive reward does not meet the stationarity assumption.
We directly compare the three methods, namely -greedy, additive exploration, MuleX, using Rainbow with the results obtained with standard DQN, given in Figures 4,4. Figures 13,13 show that all methods benefit from the extensions provided in Rainbow. Together with faster learning curves, the most notable impact on MuleX is an additional strong increase of robustness to hyper-parameters as the minimum AUC increased significantly.
a.2 Hard variant of the environment
In the variants of Montezuminha considered so far, exploration is always safe in the sense that the exploration is very much aligned with solving the task, and in the worst case only wastes a bit of time. We introduce a hard variant of Montezuminha, shown on Figure 14. When the agent is located in one of the rooms on the left side and touches the wall, it gets teleported to the same location but on the right side. Then, the only way to escape is to reach the object located in the bottom right room which teleports the agent back to the initial state on the left. This parallel world on the right is only distracting when it comes to solving the task but makes the exploration task much harder.
We show on Figures 9 and 9 the performance of MuleX compared to the baselines. In this variant of Montezuminha, it is now extremely difficult to discover new rewards just by chance which explains the poor performance of the agent learned using -greedy exploration. Compared to the additive baseline, MuleX trains faster and still manages to reach the best return possible within the budget of 800 training iterations.
a.3 Details and full results on stochastic environment
A ghost has a current moving direction, which has a probability of 25% to randomly change at every step. This means that it is possible to reason at least a little about these random ghosts. The ghost can also walk through doors into other rooms, if the doors are open.
We make the environment stochastic by putting one single ghost into the same room the player starts in. We also increase the room size to as otherwise the chance of collision in the first room is much too high.
a.4 Full results on textured version with pseudocounts
Each type of cell has a texture of pixels associated to it, and the status bar (showing collected items) shows the same texture as used in the room view. The textured environment is shown in Figure 19 Because of the increase in size, we also increase the convolutional body’s capacity of the network slightly by adding a convolution and increasing filter sizes.
For the SimHash exploration bonus, we resize the input to , use 10 value bins, and project the result to a random code of size 256. These settings ensure that it does not degenerate to an oracle reward, but could contain mistakes.
The textured environment is computationally much more demanding, and thus we use only a single random seed for each random hyperparameter, resulting in only 200 runs of each method (instead of 5 repeats resulting in 1000 runs for all other experiments).