PyTorch implementation of Never Give Up: Learning Directed Exploration Strategies
We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies. We construct an episodic memory-based intrinsic reward using k-nearest neighbors over the agent's recent experience to train the directed exploratory policies, thereby encouraging the agent to repeatedly revisit all states in its environment. A self-supervised inverse dynamics model is used to train the embeddings of the nearest neighbour lookup, biasing the novelty signal towards what the agent can control. We employ the framework of Universal Value Function Approximators (UVFA) to simultaneously learn many directed exploration policies with the same neural network, with different trade-offs between exploration and exploitation. By using the same neural network for different degrees of exploration/exploitation, transfer is demonstrated from predominantly exploratory policies yielding effective exploitative policies. The proposed method can be incorporated to run with modern distributed RL agents that collect large amounts of experience from many actors running in parallel on separate environment instances. Our method doubles the performance of the base agent in all hard exploration in the Atari-57 suite while maintaining a very high score across the remaining games, obtaining a median human normalised score of 1344.0 achieve non-zero rewards (with a mean score of 8,400) in the game of Pitfall! without using demonstrations or hand-crafted features.READ FULL TEXT VIEW PDF
PyTorch implementation of Never Give Up: Learning Directed Exploration Strategies
Implementation of DiscoMaze environment from sec. 4.1 of arxiv:2002.06038
The problem of exploration remains one of the major challenges in deep reinforcement learning. In general, methods that guarantee finding an optimal policy require the number of visits to each state–action pair to approach infinity. Strategies that become greedy after a finite number of steps may never learn to act optimally; they may converge prematurely to suboptimal policies, and never gather the data they need to learn. Ensuring that all state-action pairs are encountered infinitely often is the general problem of maintaining exploration (François-Lavet et al., 2018; Sutton and Barto, 2018)
. The simplest approach for tackling this problem is to consider stochastic policies with a non-zero probability of selecting all actions in each state, e.g.-greedy or Boltzmann exploration. While these techniques will eventually learn the optimal policy in the tabular setting, they are very inefficient and the steps they require grow exponentially with the size of the state space. Despite these shortcomings, they can perform remarkably well in dense reward scenarios (Mnih et al., 2015). In sparse reward settings, however, they can completely fail to learn, as temporally-extended exploration (also called deep exploration) is crucial to even find the very few rewarding states (Osband et al., 2016).
Recent approaches have proposed to provide intrinsic rewards to agents to drive exploration, with a focus on demonstrating performance in non-tabular settings. These intrinsic rewards are proportional to some notion of saliency quantifying how different the current state is from those already visited (Bellemare et al., 2016; Haber et al., 2018; Houthooft et al., 2016; Oh et al., 2015; Ostrovski et al., 2017; Pathak et al., 2017; Stadie et al., 2015). As the agent explores the environment and becomes familiar with it, the exploration bonus disappears and learning is only driven by extrinsic rewards. This is a sensible idea as the goal is to maximise the expected sum of extrinsic rewards. While very good results have been achieved on some very hard exploration tasks, these algorithms face a fundamental limitation: after the novelty of a state has vanished, the agent is not encouraged to visit it again, regardless of the downstream learning opportunities it might allow (Bellemare et al., 2016; Ecoffet et al., 2019; Stanton and Clune, 2018)
. Other methods estimate predictive forward models(Haber et al., 2018; Houthooft et al., 2016; Oh et al., 2015; Pathak et al., 2017; Stadie et al., 2015) and use the prediction error as the intrinsic motivation. Explicitly building models like this, particularly from observations, is expensive, error prone, and can be difficult to generalize to arbitrary environments. In the absence of the novelty signal, these algorithms reduce to undirected exploration schemes, maintaining exploration in a non-scalable way. To overcome this problem, a careful calibration between the speed of the learning algorithm and that of the vanishing rewards is required (Ecoffet et al., 2019; Ostrovski et al., 2017).
The main idea of our proposed approach is to jointly learn separate exploration and exploitation policies derived from the same network, in such a way that the exploitative policy can concentrate on maximising the extrinsic reward (solving the task at hand) while the exploratory ones can maintain exploration without eventually reducing to an undirected policy. We propose to jointly learn a family of policies, parametrised using the UVFA framework (Schaul et al., 2015a), with various degrees of exploratory behaviour. The learning of the exploratory policies can be thought of as a set of auxiliary tasks that can help build a shared architecture that continues to develop even in the absence of extrinsic rewards (Jaderberg et al., 2016). We use reinforcement learning to approximate the optimal value function corresponding to several different weightings of intrinsic rewards.
We propose an intrinsic reward that combines per-episode and life-long novelty to explicitly encourage the agent to repeatedly visit all controllable states in the environment over an episode. Episodic novelty encourages an agent to periodically revisit familiar (but potentially not fully explored) states over several episodes, but not within the same episode. Life-long novelty gradually down-modulates states that become progressively more familiar across many episodes. Our episodic novelty uses an episodic memory filled with all previously visited states, encoded using the self-supervised objective of Pathak et al. (2017) to avoid uncontrollable parts of the state space. Episodic novelty is then defined as similarity of the current state to previously stored states. This allows the episodic novelty to rapidly adapt within an episode: every observation made by the agent potentially changes the per-episode novelty significantly. Our life-long novelty multiplicatively modulates the episodic similarity signal and is driven by a Random Network Distillation error (Burda et al., 2018b). In contrast to the episodic novelty, the life-long novelty changes slowly, relying upon gradient descent optimisation (as opposed to an episodic memory write for episodic novelty). Thus, this combined notion of novelty is able to generalize in complex tasks with large, high dimensional state spaces in which a given state is never observed twice, and maintain consistent exploration both within an episode and across episodes.
This paper makes the following contributions: (i) defining an exploration bonus combining life-long and episodic novelty to learn exploratory strategies that can maintain exploration throughout the agent’s training process (to never give up), (ii) to learn a family of policies that separate exploration and exploitation using a conditional architecture with shared weights, (iii) experimental evidence that the proposed method is scalable and performs on par or better than state-of-the-art methods on hard exploration tasks. Our work differs from Savinov et al. (2018) in that it is not specialised to navigation tasks, our method incorporates a long-term intrinsic reward and is able to separate exploration and exploitation policies. Unlike Stanton and Clune (2018), our work relies on no privileged information and combines both episodic and non-episodic novelty, obtaining superior results. Our work differs from Beyer et al. (2019) in that we learn multiple policies by sharing weights, rather than just a common replay buffer, and our method does not require exact counts and so can scale to more realistic domains such as Atari. The paper is organized as follows. In Section 2 we describe the proposed intrinsic reward. In Section 3, we describe the proposed agent and general framework. In Section 4 we present experimental evaluation.
We follow the literature on curiosity-driven exploration, where the extrinsic reward is augmented with an intrinsic reward (or exploration bonus). The augmented reward at time is then defined as , where and are respectively the extrinsic and intrinsic rewards, and is a positive scalar weighting the relevance of the latter. Deep RL agents are typically trained on the augmented reward , while performance is measured on extrinsic reward only. This section describes the proposed intrinsic reward .
Our intrinsic reward satisfies three properties: (i) it rapidly discourages revisiting the same state within the same episode, (ii) it slowly discourages visits to states visited many times across episodes, (iii) the notion of state ignores aspects of an environment that are not influenced by an agent’s actions.
We begin by providing a general overview of the computation of the proposed intrinsic reward. Then we provide the details of each one of the components. The reward is composed of two blocks: an episodic novelty module and an (optional) life-long novelty module, represented in red and green respectively in Fig. 1 (right). The episodic novelty module computes our episodic intrinsic reward and is composed of an episodic memory, , and an embedding function , mapping the current observation to a learned representation that we refer to as controllable state. At the beginning of each episode, the episodic memory starts completely empty. At every step, the agent computes an episodic intrinsic reward, , and appends the controllable state corresponding to the current observation to the memory . To determine the bonus, the current observation is compared to the content of the episodic memory. Larger differences produce larger episodic intrinsic rewards. The episodic intrinsic reward promotes the agent to visit as many different states as possible within a single episode. This means that the notion of novelty ignores inter-episode interactions: a state that has been visited thousands of times gives the same intrinsic reward as a completely new state as long as they are equally novel given the history of the current episode.
A life-long (or inter-episodic) novelty module provides a long-term novelty signal to statefully control the amount of exploration across episodes. We do so by multiplicatively modulating the exploration bonus with a life-long curiosity factor, . Note that this modulation will vanish over time, reducing our method to using the non-modulated reward. Specifically, we combine with as follows (see also Fig. 1 (right)):
where is a chosen maximum reward scaling (we fix
for all our experiments). Mixing rewards this way, we leverage the long-term novelty detection thatoffers, while continues to encourage our agent to explore all the controllable states.
Embedding network: maps the current observation to a
-dimensional vector corresponding to its controllable state. Consider an environment that has a lot of variability independent of the agent’s actions, such as navigating a busy city with many pedestrians and vehicles. An agent could visit a large number of different states (collecting large cumulative intrinsic rewards) without taking any actions. This would not lead to performing any meaningful form of exploration. To avoid such meaningless exploration, given two consecutive observations, we train a Siamese network(Bromley et al., 1994; Koch et al., 2015) to predict the action taken by the agent to go from one observation to the next (Pathak et al., 2017). Intuitively, all the variability in the environment that is not affected by the action taken by the agent would not be useful to make this prediction. More formally, given a triplet composed of two consecutive observations, and , and the action taken by the agent , we parameterise the conditional likelihood as where is a one hidden layer MLP followed by a softmax. The parameters of both and
are trained via maximum likelihood. This architecture can be thought of as a Siamese network with a one-layer classifier on top, see Fig.1 (left) for an illustration. For more details about the architecture, see App. H.1
, and hyperparameters, see App.F.
Episodic memory and intrinsic reward: The episodic memory is a dynamically-sized slot-based memory that stores the controllable states in an online fashion (Pritzel et al., 2017). At time , the memory contains the controllable states of all the observations visited in the current episode, . Inspired by theoretically-justified exploration methods turning state-action counts into a bonus reward (Strehl and Littman, 2008), we define our intrinsic reward as
where is the counts for the visits to the abstract state . We approximate these counts as the sum of the similarities given by a kernel function , over the content of . In practice, pseudo-counts are computed using the -nearest neighbors of in the memory , denoted by . The constant guarantees a minimum amount of “pseudo-counts” (fixed to in all our experiments). Note that when is a Dirac delta function, the approximation becomes exact but consequently provides no generalisation of exploration required for very large state spaces. Following Blundell et al. (2016); Pritzel et al. (2017), we use the inverse kernel for ,
where is a small constant (fixed to in all our experiments), is the Euclidean distance and is a running average of the squared Euclidean distance of the -th nearest neighbors. This running average is used to make the kernel more robust to the task being solved, as different tasks may have different typical distances between learnt embeddings. A detailed computation of the episodic reward can be found in Alg. 1 in App. A.1.
Integrating life-long curiosity: In principle, any long-term novelty estimator could be used as a basis for the modulator . We found Random Network Distillation (Burda et al., 2018b, RND) worked well, is simple to implement and easy to parallelize. The RND modulator is defined by introducing a random, untrained convolutional network , and training a predictor network that attempts to predict the outputs of on all the observations that are seen during training by minimizing with respect to the parameters of , . We then define the modulator as a normalized mean squared error, as done in Burda et al. (2018b): where and
are running standard deviation and mean for. For more details about the architecture, see App. H.2, and hyperparameters, see App. F.
In the previous section we described an episodic intrinsic reward for learning policies capable of maintaining exploration in a meaningful way throughout the agent’s training process. We now demonstrate how to incorporate this intrinsic reward into a full agent that maintains a collection of value functions, each with a different exploration-exploitation trade-off.
Using intrinsic rewards as a means of exploration subtly changes the underlying Markov Decision Process (MDP) being solved: if the augmented rewardvaries in ways unpredictable from the action and states, then the decision process may no longer be a MDP, but instead be a Partially Observed MDP (POMDP). Solving PODMPs can be much harder than solving MDPs, so to avoid this complexity we take two approaches: firstly, the intrinsic reward is fed directly as an input to the agent, and secondly, our agent maintains an internal state representation that summarises its history of all inputs (state, action and rewards) within an episode. As the basis of our agent, we use Recurrent Replay Distributed DQN (Kapturowski et al., 2019, R2D2) as it combines a recurrent state, experience replay, off-policy value learning and distributed training, matching our desiderata.
Unlike most of the previously proposed intrinsic rewards (as seen in Section 1), the never-give-up intrinsic reward does not vanish over time, and thus the learned policy will always be partially driven by it. Furthermore, the proposed exploratory behaviour is directly encoded in the value function and as such it cannot be easily turned off. To overcome this problem, we proposed to jointly learn an explicit exploitative policy that is only driven by the extrinsic reward of the task at hand.
Proposed architecture: We propose to use a universal value function approximator (UVFA) to simultaneously approximate the optimal value function with respect to a family of augmented rewards of the form . We employ a discrete number of values including the special case of and where is the maximum value chosen. In this way, one can turn-off exploratory behaviour simply by acting greedily with respect to . Even before observing any extrinsic reward, we are able to learn a powerful representation and set of skills that can be quickly transferred to the exploitative policy. In principle, one could think of having an architecture with only two policies, one with and one with . The advantage of learning a larger number of policies comes from the fact that exploitative and exploratory policies could be quite different from a behaviour standpoint. Having a larger number of policies that change smoothly allows for more efficient training. For a detailed description of the specific values of we use in our experiments, see App.A. We adapt the R2D2 agent that uses the dueling network architecture of Wang et al. (2015)
with an LSTM layer after a convolutional neural network. We concatenate to the output of the network a one-hot vector encoding the value of, the previous action , the previous intrinsic reward and the previous extrinsic reward . We describe the precise architecture in App. H.3 and hyperparameters in App. F.
RL Loss functions:
RL Loss functions:As a training loss we use a transformed Retrace double Q-learning loss. In App. E we provide the details of the computation of the Retrace loss (Munos et al., 2016). In addition, we associate for each a , with , and . We remark that the exploitative policy is associated with the highest discount factor and the most exploratory policy with the smallest discount factor . We can use smaller discount factors for the exploratory policies because the intrinsic reward is dense and the range of values is small, whereas we would like the highest possible discount factor for the exploitative policy in order to be as close as possible from optimizing the undiscounted return. For a detailed description of the specific values of we use in our experiments, see App. A.
Distributed training: Recent works in deep RL have achieved significantly improved performance by running on distributed training architectures that collect large amounts of experience from many actors running in parallel on separate environment instances (Andrychowicz et al., 2018; Barth-Maron et al., 2018; Burda et al., 2018b; Espeholt et al., 2018; Horgan et al., 2018; Kapturowski et al., 2019; Silver et al., 2016). Our agent builds upon the work by Kapturowski et al. (2019) to decouple learning from acting, with actors (256 unless stated otherwise) feeding experience into a distributed replay buffer and the learner training on randomly sampled batches from it in a prioritized way (Schaul et al., 2015b). Please refer to App. A for details.
We begin by analyzing the exploratory policy of the Never Give Up (NGU) agent with a single reward mixture. We perform such analysis by using a minimal example environment in Section 4.1. We observe the performance of its learned policy, as well as highlight the importance of learning a representation for abstract states. In Section 4.2, we analyze the performance of the full NGU agent, evaluating its effectiveness on the Arcade Learning Environment (ALE; Bellemare et al. (2013)). We measure the performance of the agent against baselines on hard exploration games, as well as dense reward games. We expand on the analysis of the NGU agent by running it on the full set of Atari games, as well as showing multiple ablations on important choices of hyperparameters of the model.
In this section we present a simple example to highlight the effectiveness of the exploratory policy of the NGU agent, as well as the importance of estimating the exploration bonus using a controllable state representation. To isolate the effect of the exploratory policy, we restrict the analysis to the case of a single exploratory policy (, with ). We introduce a gridworld environment, Random Disco Maze, implemented with the pycolab game engine (Stepleton, 2017), depicted in Fig. 2 (left). At each episode, the agent finds itself in a new randomly generated maze of size x. The agent can take four actions left, right, up, down, moving a single position at a time. The environment is fully observable. If the agent steps into a wall, the episode ends and a new maze is generated. Crucially, at every time step, the color of each wall fragment is randomly sampled from a set of five possible colors, enormously increasing the number of possible states. This irrelevant variability in color presents a serious challenge to algorithms using exploration bonuses based on novelty, as the agent is likely to never see the same state twice. This experiment is purely exploratory, with no external reward. The goal is to see if the proposed model can learn a meaningful directed exploration policy despite the large visual distractions providing a continual stream of observation novelty to the agent. Fig. 2 shows the percentage of unique states (different positions in the maze) visited by agents trained with the proposed model and one in which the mapping is a fixed random projection (i.e. is untrained). The proposed model learns to explore any maze sampled from the task-distribution. The agent learns a strategy that resembles depth-first search: it explores as far as possible along each branch before backtracking (often requiring backtracking a few dozen steps to reach an unexplored area). The model with random projections, as well as our baseline of RND, do not show such exploratory behaviour111See video of the trained agent here: https://youtu.be/9HTY4ruPrHw. Both models do learn to avoid walking into walls, doing so would limit the amount of intrinsic reward it would receive. However, staying alive is enough: simply oscillating between two states will produce different (and novel) controllable states at every time step.
In this section, we evaluate the effectiveness of the NGU agent on the Arcade Learning Environment (ALE; (Bellemare et al., 2013)). We use standard Atari evaluation protocol and pre-processing as described in Tab. 8 of App. F.4, with the only difference being that we do not use frame stacking. We restrict NGU to using the same setting and data consumption as R2D2, the best performing algorithm on Atari (Kapturowski et al., 2019). While we compare our results with the best published methods on this benchmark, we note that different baselines use very different training regimes with very different computational budgets. Comparing distributed and non-distributed methods is in general difficult. In an effort to properly assess the merits of the proposed model we include two additional baselines: as NGU is based on R2D2 using the Retrace loss (instead of its n-step objective) we include this as a baseline, and since we use RND as a reward modulator, we also include R2D2 with Retrace using the RND intrinsic reward. These methods are all run for billion frames using the same protocol as that of R2D2 (Kapturowski et al., 2019). We detail the use of compute resources of the algorithms in App. D. We report the return averaged over different seeds.
Architecture: We adopt the same core architecture as that used by the R2D2 agent to facilitate comparisons. There are still a few choices to make, namely: the size of the learned controllable states, the clipping factor in (1), and the number of nearest neighbours to use for computing pseudo-counts in (2). We selected these hyperparameters by analysing the performance of the single policy agent, NGU(), on two representative exploration games: Montezuma’s Revenge and Pitfall!. We report this study in App. B. We used the same fixed set of hyperparameters in all the remaining experiments.
NGU agent: We performed further ablations in order to better understand several major design choices of the full NGU agent on a set of Atari games: the set of dense reward games chosen to select the hyperparameters of Mnih et al. (2015), as well as hard exploration games (Montezuma’s Revenge, Pitfall!, and Private Eye). For a detailed description of the results on these games as well as results on more choices of hyperparameters, please see App.C. The ablations we perform are on the number of mixtures , the impact of the off-policy data used (referred to as CMR below), the maximum magnitude of (by default if not explicitly mentioned), the use of RND to scale the intrinsic reward, and the performance of the agent in absence of extrinsic rewards. We denote by Cross Mixture Ratio (CMR) the proportion in the training batches of experience collected using different values of from the one being trained. A CMR of means training each policy only with data produced by the same , while a CMR of means using equal amounts of data produced by and . Our base agent NGU has a CMR of .
The results are shown in Fig. 3. Several conclusions can be extracted from these results: Firstly, sharing experience from all the actors (with CMR of ) slightly harms overall average performance on hard exploration games. This suggests that the power of acting differently for different conditioning mixtures is mostly acquired through the shared weights of the model rather than shared data. Secondly, we observe an improvement, on average, from increasing the number of mixtures on hard exploration games. Thirdly, as one can observe in analyzing the value of , the value of is the best average performing value, whereas and make the average performance worse on those hard exploration games. These values indicate, in this case, the limits in which NGU is either not having highly enough exploratory variants ( too low) or policies become too biased towards exploratory behavior ( too high). Further, the use of the RND factor seems to be greatly beneficial on these hard exploration games. This matches the great success of existing literature, in which long-term intrinsic rewards appear to have a great impact (Bellemare et al., 2016; Ostrovski et al., 2017; Choi et al., 2018). Additionally, as outlined above, the motivation behind studying these variations on this set of games is that those hyperparameters are of general effect, rather than specific to exploration. However, surprisingly, with the exception of the case of removing the extrinsic reward, they seem to have little effect on the dense reward games we analyze (with all error bars overlapping). This suggests that NGU and its hyperparameters are relatively robust: as extrinsic rewards become dense, intrinsic rewards (and their relative weight to the extrinsic rewards) naturally become less relevant. Finally, even without extrinsic reward , we can still obtain average superhuman performance on the 5 dense reward games we evaluate, indicating that the exploration policy of NGU is an adequate high performing prior for this set of tasks. That confirms the findings of Burda et al. (2018a)
, where they showed that there is a high degree of alignment between the intrinsic curiosity objective and the hand-designed extrinsic rewards of many game environments. The heuristics of surviving and exploring what is controllable seem to be highly general and beneficial, as we have seen in the Disco Maze environment in Section4.1, as well as on Atari.
Hard exploration games: We now evaluate the full NGU agent on the six hard exploration games identified by Bellemare et al. (2016). We summarise the results on Tab. 1. The proposed method achieves on similar or higher average return than state-of-the-art baselines on all hard exploration tasks. Remarkably, to the best of our knowledge, this is the first method without use of privileged information that obtains a positive score on Pitfall!, with NGU()-RND obtaining a best score of 15,200. Moreover, in 4 of the 6 games, NGU() appears to substantially improve against the single mixture case NGU(). This shows how the exploitative policy is able to leverage the shared weights with all the intrinsically-conditioned mixtures to explore games in which it is hard to do so, but still optimize towards maximizing the final episode score. In Fig. 4 we can see these conclusions more clearly: both in terms of mean and median human normalized scores, NGU greatly improves upon existing algorithms.
While direct comparison of the scores is interesting, the emphasis of this work is on learning directed exploration strategies that encourage the agent to cover as much of the environment as possible. In Fig. 5 (left) we observe the average episodic return of NGU run with and without RND on Pitfall!. NGU() is able to learn a directed exploration policy capable of exploring an average of rooms per episode, crossing rooms before receiving the first extrinsic reward. We also observe that, in this case, using RND makes our model be less data efficient. This is also the case for NGU(), as observed on NGU()-RND in Tab. 1, the best performing Pitfall! agent. We conjecture three main hypotheses to explain this: firstly, on Pitfall! (and unlike Montezuma’s Revenge) rooms are frequently aliased to one another, thus the agent does not obtain a large reward for discovering new rooms. This phenomenon would explain the results seen in Fig. 5 (right), in which RND greatly improves the results of NGU(). Secondly, the presence of a timer in the observation acts as a spurious source of novelty which greatly increases the number of unique states achievable even within a single room. Thirdly, as analyzed in Section 3.7 of Burda et al. (2018b), RND-trained agents often keep ’interacting with danger’ instead of exploring further, and Pitfall! is a game in which this can be highly detrimental, due to the high amount of dangerous elements in each room. Finally, we observe that NGU() obtains better results than NGU(). Our intuition is that, in this case, a single policy should be simpler to learn and can achieve quite good results on this task, since exploration and exploitation policies are greatly similar.
|Algorithm||Pong||QBert||Breakout||Space Invaders||Beam Rider|
Dense reward games: Tab. 2 shows the results of our method on dense reward games. NGU() underperforms relative to R2D2 on most games (indeed the same can be said of R2D2(Retrace) that serves as the basis of NGU). Since the intrinsic reward signal may be completely misaligned with the goal of the game, these results may be expected. However, there are cases such as Pong, in which NGU() catastrophically fails to learn to perform well. Here is where NGU() solves this issue: the exploitative policy learned by the agent is able to reliably learn to play the game. Nevertheless, NGU() has limitations: even though its learned policies are vastly superhuman and empirically reasonable, they do not match R2D2 on Breakout and Beam Rider. This suggests that the representations learned by using the intrinsic signal still slightly interfere with the learning process of the exploitative mixture. We hypothesize that alleviating this further by having non-shared representations between mixtures should help in solving this issue.
Results on all Atari 57 games: The proposed method achieves an overall median score of 1354.4%, compared to 95% for Nature DQN baseline, 191.8% for IMPALA, 1920.6% for R2D2, and 1451.8% for R2D2 using retrace loss. Please refer to App. G for separate results on individual games. Even though its overall median score is lower than R2D2, NGU maintains good performance on all games, performing above human level on 51 out of the 57 games. This also shows further confirmation that the learned exploitative mixture is still able to focus on maximizing the score of the game, making the algorithm able to obtain great performance across all games.
Analysis of Multiple Mixtures: in Fig. 6, we can see NGU() evaluated with (used in all reported numerical results) against NGU() evaluated with . We can observe different trends in the games: on Q*Bert the policies of the agent seem to converge to the exploitative policy regardless of the condition, with its learning curve being almost identical to the one shown for R2D2 in Kapturowski et al. (2019). As seen in App. G, this is common in many games. The second most common occurrence is what we see on Pitfall! and Beam Rider, in which the policies quantitatively learn very different behaviour. In these cases, the exploitative learns to focus on its objective, and sometimes it does so by benefiting from the learnings of the exploratory policy, as it is the case in Pitfall!222See videos of NGU on Pitfall with , : https://sites.google.com/view/nguiclr2020, where R2D2 never achieves a positive score. Finally, there is the exceptional case of Montezuma’s Revenge, in which the reverse happens: the exploratory policy obtains better score than the exploitative policy. In this case, extremely long-term credit assignment is required in order for the exploitative policy to consolidate the knowledge of the exploratory policy. This is because, to achieve scores that are higher than k, the agent needs to go to the second level of the game, going through many non-greedy and sometimes irreversible actions. For a more detailed analysis of this specific problem, see App. I.2.
We present a reinforcement learning agent that can effectively learn on both sparse and dense reward scenarios. The proposed agent achieves high scores in all Atari hard-exploration games, while still maintaining a very high average score over the whole Atari-57 suite. Remarkably, it is, to the best of our knowledge, the first algorithm to achieve non-zero rewards on the challenging game of Pitfall! without relying on human demonstrations, hand-crafted features, or manipulating the state of the environment. A central contribution of this work is a method for learning policies that can maintain exploration throughout the training process. In the absence of extrinsic rewards, the method produces a policy that aims at traversing all controllable states of the MDP in a depth-first manner. We highlight that this could have impact beyond this specific application and/or algorithmic choices. For instance, one could use it as a behaviour policy to facilitate learning models of the environment or as a prior for planning methods.
The proposed method is able to leverage large amounts of compute by running on distributed training architectures that collect large amounts of experience from many actors running in parallel on separate environment instances. This has been crucial for solving most challenging tasks in deep RL in recent years (Andrychowicz et al., 2018; Espeholt et al., 2018; Silver et al., 2016), and this method is able to utilize such compute to obtain strong performance on the set of hard-exploration games on Atari. While this is certainly a desirable feature and allows NGU to achieve a remarkable performance, it comes at the price of high sample complexity, consuming a large amount of simulated experience taking several days of wall-clock time. An interesting avenue for future research lies in improving the data efficiency of these agents.
Further, the episodic novelty measure relies on the notion of controllable states to drive exploration. As observed on the Atari hard-exploration games, this strategy performs well on several tasks, but it may not be the right signal for some environments. For instance, in some environments it might take more than two consecutive steps to see the consequences of the actions taken by the agent. An interesting line for future research is learning effective controllable states beyond a simple inverse dynamics model.
Additionally, the proposed work relies on the assumption that while different, one can find good exploratory and exploitative policies that are similar enough to be effectively represented using a shared parameterization (implemented using the UVFA framework). This can be limiting when the two policies are almost adversarial. This can be seen in games such as ‘Surround’ and ‘Ice hockey’.
Finally, the hyperparameter depends on the scale of the extrinsic reward. Thus, environments with significantly different extrinsic reward scales, might require different values of . An interesting avenue forward is the dynamic adaptation of , which could be done by using techniques such as Population Based Training (PBT)(Jaderberg et al., 2017) or Meta-gradients(Xu et al., 2018). Another advantage of dynamically tuning this hyperparameter would be to allow for the model to become completely exploitative when the agent has reached a point in which further exploring does not lead to improvements on the exploitative policy. This is not trivially achievable however, as including such a mechanism would require calibrating the adaptation to be aligned to the speed of learning of the exploitative policy.
We thank Daan Wierstra, Steph Hughes-Fitt, Andrea Banino, Meire Fortunato, Melissa Tan, Benigno Uria, Borja Ibarz, Mohammad Gheshlaghi Azar, Remi Munos, Bernardo Avila Pires, Andre Barreto, Vali Irimia, Sam Ritter, David Raposo, Tom Schaul and many other colleagues at DeepMind for helpful discussions and comments on the manuscript.
Journal of Artificial Intelligence Research47, pp. 253–279. Cited by: §4.2, §4.
Foundations and Trends® in Machine Learning11 (3-4), pp. 219–354. Cited by: §1.
ICML deep learning workshop, Vol. 2. Cited by: §2.
The evaluation we do is also identical to the one done in R2D2 Kapturowski et al. (2019): a parallel evaluation worker, which shares weights with actors and learners, runs the Q-network against the environment. This worker and all the actor workers are the two types of workers that draw samples from the environment. For Atari, we apply the standard DQN pre-processing, as used in R2D2. More concretely, this is how actors, evaluators, and learner are run:
Sample from the replay buffer a sequence of augmented rewards , intrinsic rewards , observations , actions and discounts .
Use Q-network to learn from with retrace using the procedure used by R2D2. As specified in Fig. 1, is sampled because it is fed as an input to the network.
Use last frames of the sampled sequences to train the action prediction network as specified in Section 2. This means that, for every batch of sequences, all time steps are used to train the RL loss, whereas only time steps per sequence are used to optimize the action prediction loss.
(If using RND) also use last frames of the sampled sequences to train the predictor of RND as also specified in Section 2.
Evaluator and Actor
Obtain , , , and discount .
With these inputs, compute forward pass of R2D2 to obtain .
With , compute using the embedding network as described in Section 2.
(actor) Insert , , , , and in the replay buffer.
Step on the environment with .
As in R2D2, we train the agent with a single GPU-based learner, performing approximately network updates per second (each update on a mini-batch of length- sequences, as explained below, and each actor performing environment steps per second on Atari. We assign to each actor a fixed value in the set and the actor acts according to an -greedy version of this policy. More concretely for the -th actor we assign the value with . In our experiments, we use the following :
is the sigmoid function. This choice of, as you can see in Fig.7(a), allows to focus more on the two extreme cases which are the fully exploitative policy and very exploratory policy.
In the replay buffer, we store fixed-length sequences of tuples. In all our experiments we collect sequences of length 80 timesteps, where adjacent overlap by 40 time-steps. These sequences never cross episode boundaries. Additionally, we store in the replay the value of the used by the actor as well as the initial recurrent state, that we use to initialize the network at training time. Please refer to Kapturowski et al. (2019) for a detailed experimental of trade-offs on different treatments of recurrent states in replay. Given a single batch of trajectories we unroll both online and target networks on the same sequence of states to generate value estimates. We use prioritized experience replay. We followed the same prioritization scheme proposed in Kapturowski et al. (2019) using a mixture of max and mean of the TD-errors with priority exponent . In addition, we associate for each a such that:
where is the maximum discount factor and is the minimal discount factor. This form allows to have discount factors evenly spaced in log-space between and . For more intuition, we provide a graph of the in Fig.7(b) in App.A. We remark that the exploitative policy is associated with the highest discount factor and the most exploratory policy with the smallest discount factor . We can use smaller discount factors for the exploratory policies because the intrinsic reward is dense and the range of values is small, whereas we would like the highest possible discount factor for the exploitative policy in order to be as close as possible to optimizing the undiscounted return. In our experiments, we use and .
the episodic memory containing at time the previous embeddings .
is the number of nearest neighbours.
is the set of -nearest neighbours of in the memory .
the kernel defined as where is a small constant, is the Euclidean distance and is a running average of the squared Euclidean distance of the -nearest neighbors.
is the pseudo-counts constant.
The space complexity is constant. The number of weights that the network has can be computed from the architecture seen in App. F. Furthermore, for our episodic memory buffer, we pre-allocate memory at the beginning of training, with size detailed in App. F. In cases in which the episode is longer than the size of the memory, the memory acts a ring buffer, deleting oldest entries first.
Time complexity is , where is the number of frames, and is the size of our memory. This is due to the fact that we do one forward pass per frame, and we compute the distance from the embeddings produced by the embeddings network to the contents of our memory in order to retrieve the -nearest neighbors.
As mentioned in Section 4.2, we here show ablations on the size of the learned controllable states, the clipping factor in (1), and the number of nearest neighbours to use for computing pseudo-counts in (2).
Due to the lack of a pure exploitative mode, as seen in 4.2, NGU(N=1) fails to perform well in dense reward games. Therefore, in order to obtain high signal from these ablations, we analyze the performance of NGU(N=1) on the two most popular sparse reward games: Montezuma’s Revenge and Pitfall!.
In Fig. 9 and Fig. 9 we can see the performance of NGU(N=1) with different sizes of the size of the controllable state on Pitfall! and Montezuma’s Revenge respectively. As we can observe, that there is small to no impact on Pitfall!, with scores that sometimes reach more than 25,000 points. On Montezuma’s Revenge is the value that is consistently better than . A size of as the controllable state size sometimes solves the level, but is in general less stable.
We proceed to show a similar analysis on Fig. 11 and Fig. 11 regarding the amount of nearest neighbors on Pitfall! and Montezuma’s Revenge respectively. As we can see, there are slight gains from using more neighbors on Pitfall!, whereas there is a clear difference in performance in using neighbors in Montezuma’s Revenge when compared to using or neighbors.
Finally, we show the performance of NGU(N=1) on Fig. 13 and Fig. 13 regarding the clipping factor Pitfall! and Montezuma’s Revenge respectively. As we can observe, Pitfall! is again robust to the value of this hyperparameter, with marginally worse performance in the case of . This is expected, as RND is generally detrimental to the performance of NGU on Pitfall!, as seen in Section 4.2. On the other hand, the highest value of clipping appears to work best on Montezuma’s Revenge. In our initial investigations, we observed that clipping this value was required on Montezuma’s Revenge to make the algorithm stable. Further analysis is required in order to show the range of values of that are higher than and are detrimental to the performance of NGU(N=1) on this task.
The best score in Montezuma’s Revenge is obtained by using a non-zero Cross Mixture Ratio, even though it is relatively close to the score obtained by NGU().
and have lower average human normalized score on the set of hard exploration games when compared to or . Concretely on the set of hard exploration games of Tab. 3, they only achieve super-human performance on Montezuma’s Revenge.
Even though we have seen that the results of and have lower average on the hard exploration games of Tab. 3, they still individually outperform RND, R2D2, R2D2(Retrace), and R2D2+RND on Pitfall! and Private Eye.
In the case of Private Eye the distance in score might be misleading, as rewards are very sparse of large value. For instance, after reaching a score of 40k, if we ignore minor rewards, there are only two rewards to be collected of around 30k points. This creates what seems to be large differences in scores.
On Breakout, a high score is achieved without extrinsic reward. This is due to the fact that the exploratory policy learns to survive, which eventually leads to a high score.
|Algorithm||Pong||Qbert||Breakout||Space Invaders||Beam Rider||MR||Pitfall!||PrivateEye|
On Tab.4 we show further results on the case of and . We compare them to human performance as well as the base NGU(), with .
As we can observe, in this case the difference in terms of relative performance among games is less pronounced than the ones observed on Tab. 3. In fact, results are slightly better for both values of on all games, with a maximum difference of 1.5k points on Solaris between and . We hypothesize that this is due to the nature of these specific games: the policies learnt on these three games seem to focus on exploitation rather than extended exploration of the environment, and in that case, similar to what we see for dense reward games in Sec. 4.2, the method shows less variability with respect to this hyperparameter.
On Tab. 5 we can see a comparison of the computation used between different algorithms.
Computation is still difficult to compare even when taking actor steps and parameter updates into account: distributed the number of actors in distributed setups will affect how much data the learner will be able to consume, but also how off-policy such data is (e.g. in R2D2, if a learner is learning from many actors, the data that is sampled from the replay buffer will be more recent than with fewer actors).
|Algorithm||Number of actors||Total Number of frames|
|R2D2 Kapturowski et al. (2019)||B|
|R2D2 + RND||B|
|DQN + PixelCNN Ostrovski et al. (2017)||M|
|DQN + CTS Bellemare et al. (2016)||M|
|RND** Burda et al. (2018b)||B ( B)|
|PPO + CoEx Choi et al. (2018)||B|
Retrace (Munos et al., 2016) is an off-policy Reinforcement Learning algorithm that can be used for evaluation or control. In the evaluation setting the goal is mainly to estimate the action-value function of a target policy from trajectories drawn from a behaviour policy . In the control setting the target policy, or more precisely the sequence of target policies, depends on the sequence of -functions that will be generated through the process of approximating . To do so, we consider trajectories starting from the state-action couple and then following the behaviour policy of the form:
with , , and . The expectation is over all admissible trajectories generated by the behaviour policy starting in state doing action and then following the behaviour policy .
The general Retrace operator , that depends on and , is:
where the temporal difference is defined as:
and the cutting traces coefficients as:
Theorem 2 of Munos et al. (2016) explains in which conditions the sequence of -functions:
where depends on the policy-couple converges to the optimal -value . In particular one of the conditions is that the sequence of target policies is greedy or -greedy with respect to (more details can be found in Munos et al. (2016)).
In practice, at a given time , we can only consider finite sampled sequences starting from and then following the behaviour policy . Therefore, we define the finite sampled-Retrace operator as:
In addition, we use two neural networks. One target network and an online network . The target network is used to compute the target value that the online network will try to fit:
In the control scenario the policy chosen is greedy or -greedy with respect to the online network . Then, the online network is optimized to minimize the loss:
More generally, one can use transformed Retrace operators(Pohlen et al., 2018):
where is a real-function and the temporal difference is defined as:
The role of the function is to reduce (squash) the scale of the action-value function to make it easier to approximate for a neural network without changing the optimal property of the operator . In particular, we use the function :
In order to select the hyperparameters used for NGU() for all 57 Atari games, which are shown on Tab. 6, we ran a grid search with the ranges shown on Tab. 9. We used seeds on the set of Atari games shown in Tab. 3. Regarding the hyperparameters concerning the kernel (Kernel and the number of neighbors used), we fixed them after determining suitable ranges of the intrinsic reward in our initial experimentation on Atari. After running the grid search with those hyperparameters, we selected the combination with the highest amount games (out of ) that held a score greater than our human benchmark. As one can see on the multiple mixtures ablations seen on Tab. 3, as well as the single mixture ablations on App B, the only agent that achieved superhuman performance on the set of games is NGU().
Finally, in order to obtain the R2D2+RND baseline, we ran a sweep over the hyperparameter with values , , and , over the games shown in Tab. 3. Coincidentally, like NGU(), the best value of was determined to be .
These are the hyperparameters used in all the experiments. We expose a full list of hyperparameters here for completeness. However, as one can see, the R2D2-related architectural hyperparameters are identical to the original R2D2 hyperparameters. Shown in Tab. 6.
|Number of Seeds|
|Cross Mixture Ratio|
|Number of mixtures|
|Optimizer||AdamOptimizer (for all losses)|
|Learning rate (R2D2)|
|Learning rate (RND and Action prediction)|
|Adam clip norm|
|R2D2 reward transformation|
|Episodic memory capacity|
|Embeddings memory mode||Ring buffer|
|Intrinsic reward scale|
|Kernel num. neighbors used|
|Kernel cluster distance|
|Kernel pseudo-counts constant|
|Kernel maximum similarity|
|Replay priority exponent|
|Minimum sequences to start replay|
|Actor update period|
|Target Q-network update period|
|Embeddings target update period||once/episode|
|Action prediction network L2 weight|
|RND clipping factor|
Hyperparameters are shown in Tab. 7.
|Episodic memory capacity|
|Learning rate (R2D2 and Action prediction)|
|Intrinsic reward scale|
|Retrace loss transformation|
|Num. action repeats|
|Target Q-network update period|
|Q-network filter sizes|
Q-network filter strides
|Q-network num. filters|
|Action prediction network filter sizes|
|Action prediction network filter strides|
|Action prediction network num. filters|
Hyperparameters are shown in Tab. 8.
|Max episode length|
|Num. action repeats|
|Num. stacked frames|
|Zero discount on life loss|
|Random noops range|
Frames max pooled
|3 and 4|
On Tab. 9 we can see the ranges we used to sweep over in our experiments.
|Intrinsic reward scale|
|Number of mixtures|
|Cross Mixture Ratio|
|# Episodes w/o wiping Episodic Memory|
|Game||R2D2(Retrace)||NGU(32) eval beta=0.0||NGU(32) eval beta=0.3|
|up n down||678.8k1.3k||620.1k13.7k||575.2k10.4k|
|wizard of wor||120.2k7.6k||106.2k7.0k||85.1k12.3k|
|kung fu master||220.7k3.9k||212.1k11.2k||203.2k10.8k|
|name this game||70.6k8.3k||23.9k0.5k||15.6k0.3k|
In this section we evaluate properties of the learned controllable states. We further present a study of the performance of the algorithm when having access to oracle controllable states containing only the necessary information. We use Montezuma’s Revenge as a case-study.
As explained in Section 2, we train the embedding network using an inverse dynamics model as done by Pathak et al. (2017). Intuitively, the controllable states should contain the information relevant to the action performed by the agent given two consecutive observations. However it might contain other type of information as long as it can be easily ignored by our simple classifier, .
As noted in Burda et al. (2018b), for this game, one can identify a novel state by using five pieces of information: the position of the player, a room identifier, the level number, and the number of keys held. This information can be easily extracted from the RAM state of the game as described in Section I.3 bellow. One question that we could ask is whether this information is present (or easily decodable) or not in the learned controllable state. We attempted to answer this question by training a linear classifier to predict the
coordinates and the room identified from the learned controllable state. Importantly we do not backpropagate the errors to the embedding network. Figure 19 shows the average results over the episodes as the training of the agent progresses. We can see that the squared error in predicting the position of the agent stabilises to a more or less constant value, which suggests that it can successfully generalise to new rooms (we do not observe an increase in the error when new rooms are discovered). The magnitude of the error is of the order of 12 units, which less than 10% of the range (see Section I.3). This is to be expected, as it is the most important information for predicting which action was taken. It shows that the information is quite accessible and probably has a significant influence in the proposed novelty measure. The room identifier, on the other hand, is information that is not necesary to predict the action taken by the agent. Unlike the previous case, one can see jumps in the error as training progresses as the problem becomes harder. It stabilises around an error slightly above 20%, which is reasonably good considering that random chance is 96%. This means that even if there is nothing specifically encouraging this information to be there, it is still present and in turn can influence the proposed novelty signal.
An avenue of future work is to research alternative methods for learning controllable states that directly search for retaining all relevant information. While very good results can be obtained with one of the simple alternative of an inverse dynamics model, it is reasonable to think that better results could be attained when using a better crafted one. To inform this question, we investigate in the next section what results could we obtain if we explicitly use as controllable states the quantities that we were trying to predict in this section.
In the previous section we analysed the properties of the learned controllable states. A valid question to ask is: how would the NGU work if we had access to an oracle controllable state containing only the relevant information? This analysis is a form of upper bound performance for a given agent architecture. We ran the NGU(N=1) model with two ablations: without RND and without extrinsic rewards. Instead of resetting the memory after every episode, we do it after a small number of consecutive episodes, which we call a meta-episode. This structure plays an important role when the agent faces irreversible choices. In this setting, approaches using non-episodic exploration bonuses are even more susceptible to suffer from the “detachment” problem described in Ecoffet et al. (2019). The agent might switch between alternatives without having exhausted all learning opportunities, rendering choosing the initial option uninteresting from a novelty perspective. The episodic approach with a meta-episode of length one would be forced to make similar choices. However, when run with multiple episodes it can offer an interesting alternative. In the first episode, the agent starts with an empty episodic memory can can choose arbitrarily one of the options. In the second episode, the episodic memory contains all the experience collected in the first episode. The agent is then rewarded for not repeating the strategy followed in the first one, as revisiting those states will lead to lower intrinsic reward. Thus, the agent is encourage to learn diverse behaviour across episodes without needing to choose between alternatives nor being susceptible to the detachment problem. Results are summarized in Fig. 19. We report the average episodic return (left) as well as the average number of visited rooms per meta-episode (right). The model achieves higher scores than the one using learned controllable states (as reported in Section 4.2).
Incorporating long-term novelty in the exploration bonus, encourages the agent to concentrate in the less explored areas of the environment. Similarly to what we observed with learned controllable states, this provides a boost both in data efficiency as well as final performance, obtaining close to 15,000 average return and visiting an average of 25 rooms per episode. In this run, three out of five seeds reach the second level of the game, one of which reaches the third level with an average of fifty different rooms per episode. We also observe that, when running in the absence of extrinsic rewards, the agent remarkably still achieves a very high extrinsic reward. Secondly, the agent is able to consistently reach a large number of rooms and explore more than rooms without any extrinsic guidance.
As noted in Burda et al. (2018b), in Montezuma’s Revenge each level contains doors and keys. If the agent walks through a door holding a key, it receives a reward of consuming the key in the process. In order to clear a level, the agent needs open two doors located just before the final room. During exploration, the agent needs to hold on to two keys to see what it could do with them later in the episode, sacrificing the immediate reward of opening more accessible doors. Any agent that acts almost greedily will struggle with what looks like a high level choice. With the right representation and using meta-episodes, our method can handle this problem in an interesting way. When the number of keys held is represented in the controllable state, the agent chooses a different key-door combination on each of the three episodes in which we do not wipe our episodic memory. At the end of training, in the first episode after wiping the episodic memory, our agent shows a score of , while the third episode the agent shows a score of , exploring on average over 30 rooms and consistently going to the second level333See video of the three episodes at https://sites.google.com/view/nguiclr2020. The agent learns a complex exploratory policy spanning several episodes that can handle irreversible choices and overcome “distractor” rewards. We do not observe different key-door combinations across episodes when using learned controllable states. Presumably the signal of the number of held keys in the learned controllable states is not strong enough to treat them as sufficiently different.
The results describe in this section support the idea that significant gains can be obtained by improving the respresentation of the controllable states, suggesting that the study of learning better representations is an interesting line for future work. Recent works have explored ways of measuring novelty by learning controllable aspects of an environment (Kim et al., 2018; Warde-Farley et al., 2018), and we believe that some of these ideas could be also useful in this setting.
We obtain the hand-crafted features for Montezuma’s Revenge by observing the RAM state of the game at every time step. More concretely:
x and y can be observed at positions 0xAA and 0xAB respectively, represented by integers with a range of .
Room id and level number can be found in positions 0x83 and 0xB9 respectively. We provide this information as a single integer to our agent in the form of where is the room id, and is the level number.
Byte 0xC1 is the player’s inventory. We count the number of keys being held (and provide this information to the agent) by adding the bits , which correspond to the binary slots for keys.