On Catastrophic Interference in Atari 2600 Games

02/28/2020 ∙ by William Fedus, et al. ∙ 8

Model-free deep reinforcement learning algorithms are troubled with poor sample efficiency – learning reliable policies generally requires a vast amount of interaction with the environment. One hypothesis is that catastrophic interference between various segments within the environment is an issue. In this paper, we perform a large-scale empirical study on the presence of catastrophic interference in the Arcade Learning Environment and find that learning particular game segments frequently degrades performance on previously learned segments. In what we term the Memento observation, we show that an identically parameterized agent spawned from a state where the original agent plateaued, reliably makes further progress. This phenomenon is general – we find consistent performance boosts across architectures, learning algorithms and environments. Our results indicate that eliminating catastrophic interference can contribute towards improved performance and data efficiency of deep reinforcement learning algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite many notable successes in deep reinforcement learning (RL) most model-free algorithms require huge numbers of samples to learn effective and useful policies (Mnih et al., 2015; Silver et al., 2016). This inefficiency is a central challenge in deep RL that limits their deployment. Our most successful algorithms rely on environment simulators that permit an unrestricted number of interactions (Bellemare et al., 2013; Silver et al., 2016) and many algorithms are specifically designed to churn through as much data as possible, achieved through parallel actors and hardware accelerators (Nair et al., 2015; Espeholt et al., 2018). The sample efficiency of an algorithm is quantified by the number of interactions required with the environment in order to achieve a certain level of performance. In this paper, we empirically examine the hypothesis that poor sample efficiency is in part due to catastrophic forgetting (McCloskey and Cohen, 1989; French, 1999).

Catastrophic forgetting is a well-known phenomenon across machine learning (Kirkpatrick et al., 2017; Li and Hoiem, 2017; Lopez-Paz and Ranzato, 2017)

. It is well-established that as deep neural networks learn various tasks that the associated weight changes rapidly impair progress on previous tasks – even if the tasks share no logical connection. However, invoking it as a source of poor sample efficiency is a potentially counter-intuitive explanation because catastrophic forgetting emerges across different

tasks. Here we show that this also arises within a single environment as we probe across the Arcade Learning Environment (ALE) (Bellemare et al., 2013).

We present the Memento111Reference to the eponymous psychological thriller where the protagonist suffers from amnesia and must deduce a reasonable plan with no memory of how he arrived or even his initial goal observation as the first support of this hypothesis. Here an agent is trained and the state(s) associated with the maximum performance is saved. Then we initialize an identical agent with the weights of the original and launch from this state – in doing so, we find the algorithm reliably makes further progress, even if the original had plateaued. This dynamic consistently holds across architectures and algorithms including DQN (Mnih et al., 2013, 2015), Rainbow (Hessel et al., 2018) and Rainbow with intrinsic motivation (Bellemare et al., 2014, 2016). We refute that this performance boost is simply the result of additional training or of higher model capacity – in both cases, the original agent fails to achieve either the same level of performance or match equivalent sample efficiency. Crucially, non-interference between the two learners is identified as the salient difference.

We next establish the concept of a “task” within a game by using the game score as a task contextualization (Jain et al., 2019). Game score, though not a perfect demarcation of task boundaries, is an easily extracted proxy with reasonable properties. Changes in game score directly impacts the temporal-difference (TD) error through the reward (Sutton and Barto, 1998) and often represent milestones such as acquiring or manipulating an object, successfully navigating a space, or defeating an enemy. We show that training on certain contexts causes prediction errors in other contexts to unpredictably change. In certain environments, learning about one context generalizes to other segments, but in others, the prediction errors worsen unpredictably. These experiments reveal that individual games result in learning interference issues more commonly associated with continual and multitask learning (Parisi et al., 2019).

Our paper links catastrophic forgetting to issues of sample efficiency, performance plateaus, and exploration. To do so, we present evidence through the Memento observation and show that this generalizes across architectures, algorithms and games. We refute that this is attributable to longer training duration or additional model capacity. We then probe deeper into these interference dynamics by using the game score as a context. This reveals that learning on particular contexts of the game often catastrophically interferes with prediction errors in others. Cross-context analysis further corroborates our results: in games with a limited improvement of the Memento agent, which eliminates interference by construction, we also demonstrate lower cross-context catastrophic interference.

2 Background

Reinforcement learning conventionally models the environment as a Markov Decision Process (MDP). We reiterate the details laid forth presented by

(Sutton and Barto, 1998) here. The MDP defined as the tuple where is the state space of the environment and

is the action space which may be discrete or continuous. The learner is not explicitly given the environment transition probability

for going from to given , but samples from this distribution are observed. The environment emits a bounded reward on each transition and is the discount factor which defines a time-preference for short-term versus long-term rewards and creates the basis for a discounted utility model. Generally, the transition function satisfies a Markov property. This means that transition to state is conditionally independent of all previous state-actions sequences given knowledge of state .

A policy is a mapping from states to actions. The state action value function is the expected discounted rewards after taking action in state and then following policy until termination.

(1)

where and .

Original Q-learning algorithms (Watkins and Dayan, 1992)

were proposed in tabular settings. In tabular RL, where each state and action are treated as independent positions, the algorithm generally requires examining each state-action pair in order to build accurate value functions and effective policies. Function approximation was introduced to parameterize value functions so that estimates could generalize, requiring fewer interactions with the environment

(Sutton and Barto, 1998).

Non-linear, deep function approximation was shown to produce superhuman performance in the Atari Learning Environment with Deep Q-Networks (DQN) (Mnih et al., 2013, 2015) ushering in the field of deep reinforcement learning. This work combined Q-learning with neural network function approximation to yield a scalable reinforcement learning algorithm. Experience replay (Lin, 1992) is a technique where the agent’s experience is stored in a sliding window memory buffer. The agent then either samples experience uniformly or with prioritized scheme (Schaul et al., 2015) for training the learner.

3 The Memento Observation

Multi-task research in the ALE often builds upon the assumption that one task is one game and multi-task learning corresponds to learning multiple games (Kirkpatrick et al., 2017; Fernando et al., 2017) or to different game modes (Farebrother et al., 2018). However, we now question this assumption and probe the hypothesis that interference occurs within a single game. We begin with Montezuma’s Revenge, a difficult exploration game (Bellemare et al., 2013, 2016; Ali Taïga et al., 2019) that exhibits a composite objective structure (learning objective can be broken into separate components) because the agent accrues rewards by navigating a series of rooms with enemies and obstacles. The presence of interference in deep RL algorithms was conjectured for games with composite structure, though not examined, by Schaul et al. (2019).

Limitations of existing algorithms. Ali Taïga et al. (2019) recently revisited the difficult exploration games of ALE with the Rainbow agent (Hessel et al., 2018) and with recent bonus-based methods. Progress through these environments can be incentivized by providing the agent with exploration bonuses derived from notions of novelty (Bellemare et al., 2016), curiosity (Pathak et al., 2017) or distillation (Burda et al., 2018). Count-based exploration algorithms (Bellemare et al., 2016) maintain visit counts and provide reward bonuses for reaching unfrequented states. These methods demonstrate impressive performance beyond standard RL algorithms on hard-exploration games in the ALE. Ali Taïga et al. (2019) finds that environments that previously taxed the DQN agent’s (Mnih et al., 2015) exploration capabilities were better handled when employing recent RL advances and the Adam optimizer (Kingma and Ba, 2014). However, consistent barriers are still observed with certain environments like Montezuma’s Revenge. For the Rainbow agent with an intrinsic motivation reward computed via a CTS model (Bellemare et al., 2014) a notable plateaus is observed at 6600 (Figure 2(a)).

(a) Initial position with score of 6600.
(b) Intermediate position with score of 8000.
(c) Maximum recorded score of 14500.
Figure 1: Frame from the start position in Montezuma’s Revenge for the Memento observation with a game score of 6600 (a). A Rainbow agent with CTS model does not reliably proceed past this point in the environment. However, a new agent which starts from this position and with identical architecture reliably makes further progress from here. Repeated resets yielded maximum scores of 14500 (right).
(a) Rainbow CTS agent in Montezuma’s Revenge. Each black line represents a run from each of the five seeds and the dotted black line is the maximum achieved score of 6600.
(b) Memento observation in Montezuma’s Revenge. Both a fresh agent initialized with random weights (blue) or with weights from the original model (orange) reliably make further progress.
Figure 2: The Memento observation in Montezuma’s Revenge. A fresh and randomly initialized Rainbow CTS agent (blue) makes reliable progress from the previous position of score 6600 which was a barrier. Additionally, we observe that instead launching a new agent with the original parameters (orange) yields no weight transfer benefit over a randomly initialized agent.

Breaking the barrier. However, if a fresh agent with the same initial weights is launched from the state where the last agent left off, it makes reliable further progress - repeated resets recorded maximum scores of up to 14,500 in Montezuma’s Revenge. The second agent begins each of its episodes from the final position of the last. As the weights of the second agent are independent of the original, learning progress and weight updates do not interfere with the original agent. Furthermore, in Montezuma’s Revenge we find that instead starting the second agent as randomly initialized (the weights encode no past context) performs equivalently. We refer to this as the Memento observation and demonstrate in Appendix A.2 that this holds on other hard exploration Atari games including Gravitar, Venture, and Private Eye.

This observation might be hypothesized to simply be the result of longer training or of additional network capacity, but we refute both of these in Appendix A.1. In this, we find that neither longer training nor additional model capacity is sufficient to improve past the previous score plateau of 6600. Instead, we posit that the success of the Memento agent is consistent with catastrophic forgetting interfering with the ability of the original agent to make progress (McCloskey and Cohen, 1989; French, 1999). The original agent’s replay buffer does contain samples exceeding the game score plateau (6600+), which suggests that the exploration policy is not the limiting factor. Rather, the agent is unable to integrate this new information and learn a value function in the new region without degrading performance in the first region, causing these high-reward samples to be lost from the experience replay buffer. Catastrophic interference may therefore be linked with exploration challenges in deep RL. Integrating the knowledge via TD-backups of a newly received reward in an environment with a sparse reward signal may interfere with the past policy. In this section we have established evidence for a Rainbow agent with intrinsic rewards in a hard exploration games of ALE and with using a manually-selected reset position. In Section 4 we seek to understand if this phenomenon applies more broadly.

Experience replay and catastrophic interference. Replay buffers with prioritized experience replay (Schaul et al., 2015), which samples states with higher temporal-difference (TD) error more frequently, can exasperate catastrophic interference issues. Figure 3 records the training dynamics the Rainbow agent in Montezuma’s Revenge. Each plot is a separate time-series recording which context(s) the agent is learning as it trains in the environment. The agent clearly iterates through stages of the environment indexed by the score, starting by learning exclusively context 0, then context 1 and so on. The left column shows the early learning dynamics and the right plot shows resulting oscillations after longer training. This is antithetical to approaches to address catastrophic forgetting that suggest shuffling across tasks and contexts (Ratcliff, 1990; Robins, 1995).

Figure 3: We plot how often the first five game contexts of Montezuma’s Revenge are sampled from the replay buffer throughout training. Early in training (left column) the agent almost exclusively trains on the first stage (because no later stages have been discovered). In intermediary stages, the agent is being trained on all the contexts at the same time (which can lead to interference), and in late stages, is being trained on only the last stage (which can lead to catastrophic forgetting). After discovering all context (right column), the agent oscillates between sampling contexts.

These dynamics are reminiscent of continual learning222In contrast to the standard setting, these score contexts can’t be learned independently since the agent must pass through earlier contexts to arrive at later contexts. where updates from different sections may interfere with each-other (Schaul et al., 2019). Prioritized experience replay, a useful algorithm for sifting through past experience, and an important element of the Rainbow agent, naturally gives rise to continual learning dynamics. However, removing prioritized experience replay and sampling instead randomly and uniformly (Mnih et al., 2013, 2015) can still manifest as distinct data distribution shifts in the replay buffer. As the agent learns different stages of the game, frontier regions may produce more experience as the agent dithers in exploratory processes.

4 Generalizing Across Agents and Games

To generalize this observation, we first propose a simple algorithm for selecting states associated with plateaus of the last agent. In this, we train an agent and then sample a single trajectory from this agent, . We compute the return-till-date (game score) for each state, , and choose the earliest timestep at which the game score is maximized: . The state corresponding to this timestep, , will become the starting position for the Memento agent. In Appendix C we present alternative experiments launching the Memento agent from a diverse set of states consistent with the maximum score and find the observation also holds. Figure 4 is a schematic of this approach.

Figure 4: Process for finding plateau points based on return contextualization. We scan the replay buffer for the trajectory with the maximum game. We extract this trajectory (states denoted as white circles and actions denoted as black dots) and we compute the cumulative undiscounted return to date labeled . For the largest in the trajectory (), we grab the corresponding state and then this state becomes the launch point for the Memento agent.

This automated reset heuristic allows us to more generally evaluate the Memento observation across other choices of agents and games. We analyze the observation across the ALE suite on two different agents: a Rainbow agent (without the intrinsic motivation reward model) and a DQN agent. The Rainbow agent we use

(Castro et al., 2018) uses a prioritized replay buffer (Schaul et al., 2015), n-step returns (Sutton and Barto, 1998) and distributional learning (Bellemare et al., 2017). Figure 5 records the percentage improvement of the new agent over the original agent across the entire ALE suite, finding a significant +25.0% median improvement.

Figure 5: Each bar corresponds to a different game and the height is the percentage increase of the Rainbow Memento agent over the Rainbow baseline. Across the entire ALE suite, the Rainbow Memento agent improves 75.0% of the games performance with a median improvement of +25.0%.

Next, we consider ablating the three included algorithmic improvements of Rainbow and examine the base DQN agent. Similar to the Rainbow agent, we still find that this phenomenon generalizes. The DQN Memento agent achieves a median improvement over the baseline agent (see Figure 16(b) for game-by-game detail), supporting that these dynamics hold across different types of value-based agents.

Furthermore, we find that the Memento observation applies across a wide suite of games, including Asterix and Breakout, games that do not share the same compositional narrative structure as the original observation on Montezuma’s Revenge and other hard exploration games. This supports that the issues of interference are present across agents and across different games. In cases where the Memento observation fails to improve, it typically is due to a bad or an inescapable initial state being selected through our heuristic algorithm. We did not further attempt to improve this algorithm because we found that our simple criterion has already produced sufficient evidence for the interference issues (though further performance could likely be achieved through more sophisticated state selection).

As in Section 3 we refute that training duration and capacity are sufficient to explain the performance boost of the generalized Memento phenomenon. Figure 13(a) nullifies that training duration is the primary cause of the Memento observation. We find that training the original agent for 400M frames (double the conventional duration) results only in a +3.3% median improvement for the Rainbow agent (versus +25.0% in the Memento observation). Next in Figure 13(b) we find that increased model capacity is also not the primary cause of the Memento observation by training a double capacity Rainbow agent and finding an improvement of only +7.4% (versus +25.0% for the Memento agent).

Figure 6 shows how the performance gap widens over training duration. The Memento agent is far more sample efficient than training the baseline Rainbow agent longer. The Memento agent achieves the same median performance increase in only 5M frames compared to 200M frames for the longer-training Rainbow agent.

Figure 6: Performance of the Memento agent (blue) versus a standard Rainbow agent trained for 400M frames (red) and a double capacity Rainbow agent trained for 400M frames (green). We find that the Memento agent is considerably more sample efficient. It reaches the peak performance of both longer-training Rainbow agents achieved over 200M frames in <25M frames.

5 Catastrophic Interference

The Memento observation showed improvement in hard exploration games with an intrinsically motivated Rainbow agent (Section 3) and more generally across various learning algorithms on a diverse set of games in the ALE suite (Section 4). As the Memento agent is constructed to separate learning updates between the new and original agents, it provides support for the hypothesis that sample efficiency and performance is realized by the Memento agent because it reduces interference between sections of the game. We now investigate this hypothesis at a finer level of detail: specifically, we measure the TD-errors at different contexts as the agent learns about particular sections of the game.

As introduced in Mnih et al. (2015), our agents use a target network to produce the Bellman targets for the online network. During training the target network parameters are held semi-fixed, updating only after -gradient updates and the online network always minimizes the TD errors with the semi-fixed target network (in our implementation, online network updates). The non-distributional TD errors for a sampled experience tuple are given in Equation where refers to the Huber loss. We refer readers to Bellemare et al. (2017) for the distributional equivalent. In the interim period while the target network is held fixed, the optimization of the online network parameters reduces to a strictly supervised learning setting, eliminating several complications present in RL. The primary complication avoided is that for any given experience the Bellman targets do not change as the online network evolves. In the general RL setting with bootstrapping through TD-learning, the targets will necessarily change, otherwise the online network will simply regress to the -values established with the initial target parameters . We propose analyzing this simplified regime in order to carefully examine the presence of catastrophic interference.

Low Catastrophic Interference.

(a) Train on context 0.
(b) Train on context 5.
(c) Train on context 10.
(d) Train on context 15.
Figure 7: We track the TD errors over the first twenty contexts in Pong for a Rainbow agent when we train on a particular context. The left column shows the absolute change in TD errors and the right column shows the relative change in TD errors.

We begin with an analysis of Pong in Figure 7. While keeping the target network parameters fixed, we measure the TD-errors across all contexts after training -steps from samples from one of the contexts in the agent’s replay buffer. The left column shows the absolute change in TD errors and the right column shows the relative change in TD errors. Each row corresponds to training on samples from a particular context, in this case, contexts 0, 5, 10, and 15. For Pong, a game that does not materially benefit from the Memento observation in either Rainbow or DQN, we find that learning about one context or game score generalizes very well to others (with an occasional aberration).

Figure 8(a) provides a compressed view of the relative changes in TD-errors, where the -th entry corresponds to the relative change in TD error for context , when the model is trained on context . Locations where the diagram is red correspond to interference and negative generalization, and blue regions correspond to positive generalization between contexts. For these two games, the diagrams are overwhelmingly blue, indicating positive generalization between most regions.

(a) TD errors across game score contexts in Pong
(b) TD errors across game score contexts in QBert.
Figure 8: For Pong and Qbert, learning in one context usually generalizes to other contexts (improved relative TD-errors in blue). We find no strong evidence of catastrophic interference in Pong. Qbert also generalizes across local contexts, however, a red column for the 0th context indicates some interference effects.

We consider another game that fared poorly under the Memento observation, Qbert, for both the Rainbow and DQN agents 16(b). Given the Memento observation was not beneficial this leads to a testable hypothesis that we should observe low degrees of interference. Corroborating this, in Figure 8(b) we find that training on contexts typically generalizes. However, we find a consistent negative impact for the 0th context, suggesting that the beginning of the game (context 0) materially differs from later sections. This is evidenced by the red-column showing degradation of performance in this section as others are learned.

High Catastrophic Interference. We have now seen that learning about one portion of Pong or Qbert typically improves the prediction error in other contexts, that is, learning generalizes. However, we now present that this is not generally the case. Figure 9 revisits our running example of Montezuma’s Revenge and continues with the same experiment. We select an agent that has reached the first five game score contexts corresponding to milestones including collecting the first key and exiting the initial room. In contrast to Pong, we find that learning about certain contexts in Montezuma’s Revenge strictly interferes with the prediction errors in other contexts.

(a) Train on context 0.
(b) Train on context 1.
(c) Train on context 2.
(d) Train on context 3.
(e) Train on context 4.
Figure 9: We track the TD errors over the first five contexts in in Montezuma’s Revenge for a Rainbow agent while training on a particular context. The left column shows the absolute change in TD errors and the right column shows the relative change in TD errors. As expected, training on the context reduces the TD error for that context, however, note that learning one context comes at the expense of worse prediction errors on others.

Figure 10(a) shows the corresponding TD-error reduction matrix; as expected, training on samples from a particular context reduces the TD-errors of that context as observed by the blue diagonals. However, the red off-diagonals indicate that training on samples from one context negatively interferes with predictions in all the other contexts. Issues of catastrophic forgetting also extend to games where it might be unexpected like Breakout. Figure 10(b) shows that a Rainbow agent trained here interferes unpredictably with other contexts. Also, interestingly, we note an asymmetry. Training on later contexts produces worse backward transfer (bottom-left) while training on earlier contexts does not produce as severe forward transfer (top-right) (Lopez-Paz and Ranzato, 2017).

(a) TD errors across game score contexts in Montezuma’s Revenge
(b) TD errors across game score contexts in Breakout.
Figure 10: For both Montezuma’s Revenge and Breakout we find that learning almost always only improves the TD errors on that context (blue diagonals) at the expense of other contexts (red off-diagonals). These analyses suggest a high degree of catastrophic interference.

6 Related Work

Catastrophic Interference:

To control how memories are retained or lost, many study the synaptic dynamics between different environment contexts. These methods either try to regularize or expand the parameter space. Regularization has shown to discourage weight updates from overwriting old memories. Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) employs a quadratic penalty to keep parameters used in successive environments close together, and thus prevents unrelated parameters from being repurposed. Li and Hoiem (2017) impose a knowledge distillation penalty (Hinton et al., 2015) to encourage certain network predictions to remain similar as the parameters evolve. The importance of some memories can be encoded in additional weights, then used to scale penalties incurred when they change (Zenke et al., 2017). In general, regularization is effective when a low loss region exists at the intersection of each context’s parameter space. When these regions are disjoint, however, it can be more effective to freeze weights (Sharif Razavian et al., 2014) or to replicate the entire network and augment it with new features for new settings (Rusu et al., 2016; Yoon et al., 2018; Draelos et al., 2017). Our work studies model expansion in the context of Atari, with a hypothesis that little overlap exists between the low loss regions of different game contexts. However, some games, such as Pong, exhibit considerable overlap.

Knowledge of the environment context is often a prerequisite to regularization or model expansion. Given a fixed strategy to mitigate interference, some consider the problem finding the best contextualization. Rao et al. (2019) proposed a variational method to learn context representations which apply to a set of shared parameters. Similar to the Forget-Me Not Process (Milan et al., 2016), the model is dynamically expanded as new data is experienced which cannot be explained by the model. Aljundi et al. (2019) proposes a similar approach using regularization. Our work posits a context model that marks the boundaries between environment settings with the game score.

Continual Learning:

Multi-task and continual learning in reinforcement learning is a highly active area of study (Ring, 1994; Silver et al., 2013; Oh et al., 2017) with some proposing modular architectures, however, multi-task learning in the context of a single environment is a newer area of research. Schaul et al. (2019) describe the problem of ray interference in multi-component objectives for bandits, and show that in this setting, learning exhibits winner-take-all dynamics with performance plateaus and learning constrained to only one component. In the context of diverse initial state distributions, Ghosh et al. (2017) find that policy-gradient methods myopically consider only a subset of the initial-state distribution to improve on when using a single policy. The Memento observation makes close connections to Go-Explore (Ecoffet et al., 2019) which demonstrates the efficacy of resetting to the boundary of the agent’s knowledge and thereafter employing a basic exploration strategy to make progress in difficult exploration games. Finally, recent work in unsupervised continual learning consider a case similar to our setting, where the learning process is effectively multitask but there are no explicit task labels (Rao et al., 2019).

7 Discussion

This research provides evidence of a difficult continual learning problem arising within a single game. The challenge posed by catastrophic interference in the reinforcement learning setting is greater than in standard continual learning for several reasons. First, in this setting we are provided no explicit task labels (Rao et al., 2019). Our analyses used as a proxy label, the game score, to reveal continual learning dynamics, but alternative task designations may further clarify the nature of these issues. Next, adding a further wrinkle to the problem, the "tasks" exist as part of a single MDP and therefore generally must transmit information to other tasks. TD errors in one portion of the environment directly effect TD-errors elsewhere via Bellman backups (Bellman, 1957; Sutton and Barto, 1998). Finally, although experience replay (Mnih et al., 2013, 2015) resembles shuffling task examples, we find it is not sufficient to combat catastrophic forgetting unlike the standard setting. We posit that this is likely due to complicated feedback effects of data generation and learning.

8 Conclusion

We strengthen connections between catastrophic forgetting and central issues in reinforcement learning including poor sample efficiency, performance plateaus (Schaul et al., 2019) and exploration challenges (Ali Taïga et al., 2019). Schaul et al. (2019) hypothesized interference patterns might be observed in a deep RL setting for games like Montezuma’s Revenge. Our empirical studies not only confirm this hypothesis, but also show that the phenomenon is more prevalent than previously conjectured. Both the Memento observation as well as peering into inter-context interference illuminates the nature and severity of interference in deep reinforcement learning. Our findings also suggest that the prior belief of what constitutes a "task" may be misleading and must therefore be carefully examined. We hope this work provides a clear characterization of the problem and show that it has far reaching implications for many fundamental problems in reinforcement learning.

Acknowledgements

This work stemmed from surprising initial results and our understanding was honed through many insightful conversations at Google and Mila. In particular, the authors would like to thank Marlos Machado, Rishabh Agarwal, Adrien Ali Taïga, Margaret Li, Ryan Sepassi, George Tucker and Mike Mozer for helpful discussions and contributions. We would also like thank the reviewers at the Biological and Artificial Reinforcement Learning workshop for constructive reviews on an earlier manuscript.

References

  • A. Ali Taïga, W. Fedus, M. C. Machado, A. Courville, and M. G. Bellemare (2019) Benchmarking bonus-based exploration methods on the arcade learning environment. arXiv preprint arXiv:1908.02388. Cited by: §3, §3, §8.
  • R. Aljundi, K. Kelchtermans, and T. Tuytelaars (2019) Task-free continual learning. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §6.
  • M. G. Bellemare, W. Dabney, and R. Munos (2017) A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887. Cited by: §4, §5.
  • M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling (2013) The arcade learning environment: an evaluation platform for general agents.

    Journal of Artificial Intelligence Research

    47, pp. 253–279.
    Cited by: §1, §1, §3.
  • M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016) Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479. Cited by: §1, §3, §3.
  • M. Bellemare, J. Veness, and E. Talvitie (2014) Skip context tree switching. In International Conference on Machine Learning, pp. 1458–1466. Cited by: §1, §3.
  • R. Bellman (1957) A markovian decision process. Journal of Mathematics and Mechanics 6 (5), pp. 679–684. Cited by: §7.
  • Y. Burda, H. Edwards, A. Storkey, and O. Klimov (2018) Exploration by random network distillation. arXiv preprint arXiv:1810.12894. Cited by: §3.
  • P. S. Castro, S. Moitra, C. Gelada, S. Kumar, and M. G. Bellemare (2018) Dopamine: a research framework for deep reinforcement learning. arXiv preprint arXiv:1812.06110. Cited by: §4.
  • T. J. Draelos, N. E. Miner, C. C. Lamb, J. A. Cox, C. M. Vineyard, K. D. Carlson, W. M. Severa, C. D. James, and J. B. Aimone (2017)

    Neurogenesis deep learning: extending deep networks to accommodate new classes

    .
    2017 International Joint Conference on Neural Networks (IJCNN), pp. 526–533. Cited by: §6.
  • A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune (2019) Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995. Cited by: §6.
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. (2018) Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561. Cited by: §1.
  • J. Farebrother, M. C. Machado, and M. Bowling (2018) Generalization and regularization in dqn. arXiv preprint arXiv:1810.00123. Cited by: §3.
  • C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra (2017) Pathnet: evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734. Cited by: §3.
  • R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: §1, §3.
  • D. Ghosh, A. Singh, A. Rajeswaran, V. Kumar, and S. Levine (2017) Divide-and-conquer reinforcement learning. arXiv preprint arXiv:1711.09874. Cited by: §6.
  • M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018) Rainbow: combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §3.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §6.
  • V. Jain, W. Fedus, H. Larochelle, D. Precup, and M. Bellemare (2019) Algorithmic improvements for deep reinforcement learning applied to interactive fiction. Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20). Cited by: §1.
  • D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, pp. 201611835. Cited by: §1, §3, §6.
  • Z. Li and D. Hoiem (2017) Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence 40 (12), pp. 2935–2947. Cited by: §1, §6.
  • L. Lin (1992) Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning 8 (3-4), pp. 293–321. Cited by: §2.
  • D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pp. 6467–6476. Cited by: §1, §5.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §1, §3.
  • K. Milan, J. Veness, J. Kirkpatrick, M. Bowling, A. Koop, and D. Hassabis (2016) The forget-me-not process. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 3702–3710. External Links: Link Cited by: §6.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1, §2, §3, §7.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §1, §2, §3, §3, §5, §7.
  • A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Petersen, et al. (2015) Massively parallel methods for deep reinforcement learning. arXiv preprint arXiv:1507.04296. Cited by: §1.
  • J. Oh, S. Singh, H. Lee, and P. Kohli (2017) Zero-shot task generalization with multi-task deep reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2661–2670. Cited by: §6.
  • G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual lifelong learning with neural networks: a review. Neural Networks. Cited by: §1.
  • D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17. Cited by: §3.
  • D. Rao, F. Visin, A. Rusu, R. Pascanu, Y. W. Teh, and R. Hadsell (2019) Continual unsupervised representation learning. In Advances in Neural Information Processing Systems, pp. 7645–7655. Cited by: §6, §6, §7.
  • R. Ratcliff (1990) Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.. Psychological review 97 (2), pp. 285. Cited by: §3.
  • M. B. Ring (1994) Continual learning in reinforcement environments. Ph.D. Thesis, University of Texas at Austin Austin, Texas 78712. Cited by: §6.
  • A. Robins (1995) Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science 7 (2), pp. 123–146. Cited by: §3.
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671. Cited by: §6.
  • T. Schaul, D. Borsa, J. Modayil, and R. Pascanu (2019) Ray interference: a source of plateaus in deep reinforcement learning. arXiv preprint arXiv:1904.11455. Cited by: §3, §3, §6, §8.
  • T. Schaul, J. Quan, I. Antonoglou, and D. Silver (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §2, §3, §4.
  • A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson (2014) CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 806–813. Cited by: §6.
  • D. L. Silver, Q. Yang, and L. Li (2013) Lifelong machine learning systems: beyond learning algorithms. In 2013 AAAI spring symposium series, Cited by: §6.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §1.
  • R. S. Sutton and A. G. Barto (1998) Reinforcement learning: an introduction. Vol. 1, MIT press Cambridge. Cited by: §1, §2, §2, §4, §7.
  • C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §2.
  • J. Yoon, E. Yang, J. Lee, and S. J. Hwang (2018) Lifelong learning with dynamically expandable networks. In International Conference on Learning Representations, External Links: Link Cited by: §6.
  • F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3987–3995. Cited by: §6.

Appendix A Rainbow Agent with CTS Additional Figures

a.1 Additional Training and Model Capacity

We consider training the Rainbow CTS model longer and with higher capacity.

(a) Result of training the Rainbow CTS for a longer duration. The black vertical line denotes the typical training duration of 200M frames. We find for both the original agent (orange) as well as a sweep over other update-horizons (blue, green), that no agent reliably exceeds the baseline.
(b) Result of training a Rainbow CTS with the increased capacity equal to two separate Rainbow CTS agents accomplished by increasing the filter widths. We find for both the original agent (orange) as well as a sweep over other update-horizons (blue, green), that no agent reliably exceeds the baseline.
Figure 11: In Montezuma’s Revenge, neither additional training (Figure 11(a)) nor additional model capacity (Figure 11(b)) leads to improved performance from the base agent. These results hold for various n-step returns or update horizons of 1, 3 (Rainbow default), and 10. The dotted-line is the maximum achieved baseline for the original agent, and all experiments are run with 5 seeds.

a.2 Rainbow with CTS Hard Exploration Games

We demonstrate the Memento observation for three difficult exploration games using the Rainbow + CTS agent.

(a) The Memento observation in Gravitar.
(b) The Memento observation in Venture.
(c) The Memento observation in PrivateEye.
Figure 12: Each black line represents a run from each of the five seeds and the dotted black line is the maximum achieved score. We find that the Memento observation holds across hard exploration games. PrivateEye results in a slight decrease in performance for the new agent (orange) due to the count-down timer in, however, we note that it is more stable compared to baseline data (blue) and preserves performance.

Appendix B Rainbow Agent Additional Figures

b.1 Additional Training and Model Capacity

We first present results of training the Rainbow agent for a longer duration and with additional capacity. In both settings, we find these variants fail to achieve the performance of the Memento observation.

(a) A Rainbow agent trained for twice the duration or 400M frames.
(b) A Rainbow agent trained with double network capacity.
Figure 13: A Rainbow agent trained for twice the duration or twice the network capacity similarly fails to improve materially over the baseline. Training for 400M frames yields a median improvement of +3% and doubling the network capacity yields no median improvement. In the contrast, the Memento agent improves over baseline +25.0%.

Appendix C Memento State Strategy

We consider the performance when the Memento agent instead starts from a set of states rather than a singular state. This is a more difficult problem requiring demanding more generalization capability from the Memento agent and thus reported median performance is lower. We find, however, that the Memento observation also holds in this more challenging environment. Furthermore, we find a higher fraction of games improve under multiple Memento start states, likely due to avoiding the primary issue of starting exclusively from an inescapable position.

(a) Single Memento state.
(b) Multiple Memento states.
Figure 14: Multiple Memento states. Each bar corresponds to a different game and the height is the percentage increase of the Rainbow Memento agent from a set of start states over the Rainbow baseline. The Rainbow Memento agent launched from a single state results in a 25.0% median improvement (left) which is greater than the 11.9% median improvement of launching from multiple states (right).

Appendix D Weight Transfer

We examine the importance of the original agent’s parameters in order transfer to later sections of the game in the Memento observation. Our original observation relied on the original agent weights. To examine the weight transfer benefits of these parameters, we consider instead using random initialization for the Memento agent. If the parameters learned in the early section of the game by the original agent generalize, we should observe faster progress and potentially better asymptotic performance. We measure this both for Rainbow and DQN.

(a) Rainbow Memento agent.
(b) DQN Memento agent.
Figure 15: We examine the impact of the initial parameters for the Rainbow Memento agent. The median improvement over the corresponding original agent is recorded for both random weight initialization (blue) as well as parameters from the original agent (red).

Interestingly, we find opposite conclusions for the Rainbow and DQN agents. The Memento Rainbow agent benefits from the original agent weights whilst the Memento DQN agent is better off randomly initialized. We would also highlight that the gap between both variants is not large. This indicates that the parameters learned from the earlier stages of the game may be of limited benefit.

Appendix E Memento in Rainbow versus DQN

(a) Rainbow Memento agent.
(b) DQN Memento agent.
Figure 16: Each bar corresponds to a different game and the height is the percentage increase of the Memento agent over its baseline. Across the entire ALE suite, the Rainbow Memento agent (left) improves 75.0% of the games performance with a median improvement of +25.0%. The DQN Memento agent (right) improves 73.8% of the games performance with a median improvement of 17.0%.

Appendix F Training Curves

Figure 17: Rainbow MEMENTO training curves. Blue is the baseline agent and orange is the MEMENTO agent launched from the best position of the previous (five seeds each).