The field of deep reinforcement learning (RL), while relatively young, has already demonstrated its value, from decision-making from perception (Mnih et al., 2015) to superhuman game-playing (Silver et al., 2016) to robotic manipulation (Levine et al., 2016). While most of early RL research focused on illustrating specific aspects of decision-making, a defining characteristic of deep reinforcement learning research has been its emphasis on complex agent-environment interactions: for example, an agent navigating its way out of a video game maze, from vision, is a task difficult even for humans.
As a consequence of this shift towards more complex interactions, writing reusable software for deep RL research has also become more challenging. First, an “agent” is now a whole architecture: for example, OpenAI Baselines’s implementation of Mnih et al.’s Deep Q-Network (DQN) agent is composed of 6 different modules (Dhariwal et al., 2017). Second, there is now a wealth of algorithms to choose from, so being comprehensive in one’s implementation typically requires sacrificing simplicity. Most importantly, the growing diversity in deep RL research makes it difficult to foresee what software needs the next research project might have.
This paper introduces Dopamine111https://github.com/google/dopamine, a new TensorFlow-based framework that aims to support fundamental deep RL research. Dopamine emphasizes being compact rather than comprehensive: the first version is made of 12 Python files. These provide tested implementations of state-of-the-art, value-based agents for the Arcade Learning Environment (Bellemare et al., 2013). The code is designed to be easily understood by newcomers to the field, yet performant enough for research at scale. To further facilitate research, we provide interactive notebooks, trained models, and downloadable training data for all of our agents – including reproductions of previously published learning curves.
Our design choices are guided by the idea that different research objectives have different software needs. We support this point by reviewing some of the major developments in value-based deep RL, in particular those derived from the DQN agent. We identify different research objectives and discuss expectations of code that supports these objectives: architecture research, comprehensive studies, visualization, algorithmic research, and instruction. We have engineered Dopamine with the last two of these objectives in mind, and as such, we believe that it has a unique role to play in the deep RL frameworks ecosystem.
The paper is organized as follows. In Section 2 we identify common research objectives in deep reinforcement learning and discuss their particular software needs. In Section 3 we introduce Dopamine and argue for its effectiveness in addressing the needs surfaced in Section 2. In Section 4 we revisit the guidelines put forth by Machado et al. (2018) and discuss Dopamine’s adherence to them. In Section 5 we discuss related frameworks and provide concluding remarks in Section 6.
2 Software for Deep Reinforcement Learning Research
What makes a piece of code useful to research? The deep learning community has by now identified a number of operations critical to their research goals: component modularity, automatic differentiation, and visualization, to name a few(Abadi et al., 2016). Perhaps because it is a younger field, consensus on software has been elusive in deep RL. In this section we identify different aspects of deep RL research, supporting our categorization with an analysis of recent work. Our aim is not to be exhaustive but to highlight the heterogeneous nature of this research. From this analysis, we argue that different research aims lead to different software considerations.
To narrow the scope of our question, we restrict our attention to a subset of deep RL research:
Fundamental research in deep reinforcement learning,
applied to or evaluated on simulated environments.
The software needs of commercial applications are likely to be significantly different from those of researchers; similarly, real-world environments typically require tailored software infrastructure. In this paper we focus on the Arcade Learning Environment (ALE) as a mature, well-understood environment for this kind of research; we believe this section’s conclusions extend to similar environments, including continuous control (Duan et al., 2016; Tassa et al., 2018), first-person environments (Kempka et al., 2016; Beattie et al., 2016; Johnson et al., 2016), and to some extent real-time strategy games (Tian et al., 2017; Vinyals et al., 2017).
2.1 Case Study: The Deep Q-Network Architecture
We begin by taking a close look at the research genealogy of the deep Q-network agent (DQN). Through our review we identify distinct research objectives that recur in the field. DQN is a natural choice for this purpose: not only is it recognized as a milestone in deep reinforcement learning, but it has since been extensively studied and improved upon, providing us with a diversity of research work to study. We emphasize that the objectives we identify here are not mutually exclusive, and that our survey is by no means exhaustive – we highlight specific results because they are particularly clear examples.
What we call architecture research is concerned with the interaction between components, including network topologies, to create deep RL agents. DQN was innovative in its use of an agent architecture including target network, replay memory, and Atari-specific preprocessing. Since then, it has become commonplace, if not expected, that an agent is composed of multiple interacting components; consider A3C (Mnih et al., 2016), ACER (Wang et al., 2017), Reactor (Gruslys et al., 2018), and IMPALA (Espeholt et al., 2018).
Algorithmic research is concerned with improving the underlying algorithms that drive learning and behaviour in deep RL agents. Using the DQN architecture as a starting point, double DQN (van Hasselt et al., 2016) and gap-increasing methods (Bellemare et al., 2016a) both adapt the Q-Learning rule to be more statistically robust. Prioritized experience replay (Schaul et al., 2016) replaces the uniform sampling rule from the replay memory by one that more frequently samples states with high prediction error. Retrace computes a sum of
-step returns, weighted by a truncated correction ratio derived from policy probabilities(Munos et al., 2016). The dueling algorithm (Wang et al., 2016)
separates the estimation of Q-values into advantage and baseline components. Finally, distributional methods(Bellemare et al., 2017; Dabney et al., 2018b) replace the scalar prediction made by DQN with a value distribution, and accordingly introduce new losses to the algorithm. In our taxonomy, algorithmic research is not tied to a specific agent architecture.
look back at existing research to benchmark or otherwise study how existing methods perform under different conditions or hyperparameter settings.Hessel et al. (2018), as one well-executed example, provide an ablation study of the effect of six algorithmic improvements, among which are double DQN, prioritized experience replay, and distributional learning. This study, performed over 57 games from the ALE, provides definitive conclusions on the relative usefulness of these improvements. Comprehensive studies compare the performance of many agent architectures, or of different established algorithms within a given architecture.
Visualization is concerned with gaining a deeper understanding of deep RL methods through focused interventions, typically with an emphasis away from these methods’ performance. Guo et al. (2014) studied the visual patterns which maximally influenced hidden units within a DQN-like network. Mnih et al. (2015)
performed a simple clustering analysis to understand the state dynamics within DQN.Zahavy et al. (2016) also used clustering methods within a simple graphical interface to study emerging state abstractions in DQN networks, and further visualized the saliency of various stimuli. Bellemare et al. (2016b) depicted the exploratory behaviour of intrinsically-motivated agents using maps from the game Montezuma’s Revenge. We note that this kind of research is not restricted to visual analysis per se.
2.2 Different Software for Different Objectives
For each research objective identified above, we now ask: how are these objectives enabled by software? Specifically, we consider 1) the likelihood that the research can be completed using existing code, versus requiring researchers to write new code, 2) the shelf life of code produced during the course of the research, and 3) the value of high-performance code to the research. We shape our analysis along the axis of code complexity: how likely is it that the research objective require a complex (large, modular, abstract) framework? Conversely, how beneficial is it to pursue this kind of research in a simple (compact, monolithic, readily understood) framework?
Comprehensive studies are naturally implemented as variations within a common agent architecture (each of (Hessel et al., 2018)’s Rainbow agent’s features can be enabled or disabled). Frameworks supporting a wide range of variations are very likely to be complex, because they unify diverse methods under common interfaces. On the other hand, many studies require no code beyond what the framework already provides, and any new code is likely to have a long shelf life as it plays a role in further studies or in new agents. Performance is critical because these studies are done at scale.
Architecture research. In software, what we call architecture research touches entire branches of code. Frameworks that support architecture research typically provide reusable modules (e.g., a general-purpose replay memory) and common interfaces, and as such are likely to be complex. Code from early iterations may be discarded, but eventually significant time and effort is spent to produce a stable product. This is especially likely when the research focuses on engineering issues, for example scaling up distributed training (Horgan et al., 2018).
As the examples above highlight, visualization is typically intrusive, but benefits from stable implementations of common tools (e.g., Tensorboard). As such, it may be best served by frameworks that strike a balance between simplicity and complexity. Keeping visualization code beyond the project for which it is developed is often onerous, and when visualization modules are reused they usually require additional engineering. Performance is rarely a concern.
Algorithmic research is realized in software at different scales, from a simple change in equation (double DQN) to new data structures (the sum tree in prioritized experience replay). We believe algorithmic research benefits from simple frameworks, by virtue that simple code makes it easier to implement radically different ideas. Algorithmic research typically requires multiple iterations from an initial design, and in our experience this iterative process leads to a significant amount of code being discarded. Performance is less of an issue, as the objective is often to demonstrate the feasibility or value of a new approach, rather than deploy it at scale.
Finally, a piece of research software may have an instructional purpose. By this we mean code that is developed not only to produce scientific results, but also to explain the methodology to others. In deep RL, there are relatively few publications with this stated intent; interactive notebooks have played much of the teaching role. We view this objective as benefitting the most from simplicity
: teaching research software needs to be stable, trusted, and clear, all of which are facilitated by a small codebase. Teaching-minded code often sacrifices some performance to increase understandability, and may provide additional resources (notebooks, benchmark data, visualizations). This code is usually intended to have a long shelf life.
Conclusions. From the above taxonomy we conclude that there is a natural trade-off between simplicity and complexity in a deep RL research framework. The resulting choices empower some research objectives, possibly at the detriment of others. As we explain below, Dopamine positions itself on the side of simplicity, aiming to facilitate algorithmic research and instructional purposes.
In this section we provide specifics of Dopamine, our TensorFlow-based framework. Dopamine is built to satisfy the following design principles:
Self-contained and compact: Self-contained means the core logic is not contained in external, non-standard, libraries. A compact framework is one that contains a small number of files and lines of code. Software often satisfies one of these requirements at the expense of the other. Satisfying both results in a low barrier-of-entry for users, as a compact framework with minimal reliance on external libraries means it is necessary to go through only a small number of files to comprehend the framework’s internals.
Reliable and reproducible: A reliable framework is one that is demonstrably correct; in software engineering this is typically achieved via tests. A reproducible framework is one that facilitates the regeneration of published statistics, and makes novel scientific contributions easily shareable with the community.
Being self-contained and compact helps achieve our goal of providing a simple framework, and in turn supporting algorithmic research and instructional purposes. By putting special emphasis on reliability, we also ensure that the resulting research can be trusted – acknowledging recent concerns on reproducibility in deep reinforcement learning (Islam et al., 2017; Henderson et al., 2017).
The initial offering of Dopamine focuses on value-based reinforcement learning applied to the Arcade Learning Environment. This restricted scope allowed us to make critical design decisions to achieve our goal of designing a simple framework. We intend for future expansions (see Section 6 for details) to follow the guidelines we set here. Although the focus of Dopamine is not computational performance, we provide some statistics in Appendix A.
3.1 Self-contained and Compact
Dopamine’s design is illustrated in Figure 1, highlighting the main components and the lifecycle of an experiment. As indicated in the figure, our complete codebase consists of 12 files containing a little over 2000 lines of Python code.
The Runner class manages the interaction between the agent and the ALE (e.g., taking steps and receiving observations) as well as the bookkeeping (e.g., checkpointing and logging via the Checkpointer and Logger, respectively). The Checkpointer is in charge of regularly saving the experiment state which enables graceful recovery after failure, as well as learned weight re-use. The Logger is in charge of saving experiment statistics (e.g., accumulated training or evaluation rewards) to disk for visualization. We provide Colab interactive notebooks to facilitate the visualization of these statistics.
The different agent classes contain the core logic for inference and learning: receiving Atari 2600 frames from the Runner, returning actions to perform, and performing learning. As in Mnih et al. (2015), the agents make use of a replay memory for the learning process. We provide complete implementations of four established agents: DQN (Mnih et al., 2015) (which is the base class for all agents), C51 (Bellemare et al., 2017), a simplified version of the single-GPU Rainbow agent (Hessel et al., 2018), and IQN (Dabney et al., 2018a). Our version of the Rainbow agent includes the three components identified as most important by Hessel et al. (2018): -step updates (Mnih et al., 2016), prioritized experience replay (Schaul et al., 2016), and distributional reinforcement learning (Bellemare et al., 2017). In particular, our version of Rainbow does not include double DQN, the dueling architecture, or noisy networks (details in the original work). In our codebase, C51 is a particular parametrization of the Rainbow agent.
3.2 Reliable and Reproducible
We provide a complete suite of tests for all our codebase with code coverage of over 98%. In addition to helping ensure the correctness of the code, these tests provide an alternate form of documentation, complementing the regular documentation and interactive notebooks provided with the framework.
Dopamine makes use of gin-config (github.com/google/gin-config) for configuration of the different modules. Gin-config is a simple scheme for parameter injection, i.e. changing the default parameters of a method dynamically. In Dopamine we specify all parameters of an experiment within a single file. Figure 2 shows a sample of the configuration of the default DQN agent settings (full gin-config files for all agents are provided in Appendix D).
|Epis. termination||Game Over||Life Loss||Life Loss||Life Loss||Life Loss|
|decay schedule (frames)||1M||4M||4M||1M||4M|
|Min. history to learn (frames)||80K||200K||200K||80K||200K|
|Target net. update freq. (frames)||32K||40K||40K||32K||40K|
3.3 Baselines for comparison
We also provide a set of pre-trained baselines for the community to benchmark against. We use a uniform set of hyperparameters for all the provided agents, and call these the default settings. These combine Hessel et al. (2018)’s agent hyperparameters with Machado et al. (2018)’s ALE parameters. Our intent is not to provide an optimal set of hyperparameters, but to provide a consistent set as a baseline, while facilitating the process of hyperparameter exploration. Table 1 summarizes the differences between our default experimental setup and published results. For completeness, our experiments also use the minimal action set from ALE, as this is what the Gym interface provides.
For each agent, we include gin-config files for both the default and published settings. In Figure 3 we compare the four agents’ published settings against the default settings on Space Invaders and Seaquest. Note that the change in scale in the y-axis is due to the use of sticky actions. It is interesting to note the changing dynamics between the algorithms in Seaquest when going from their published settings to the default settings. With the former, C51 dominates the other algorithms by a large margin, and Rainbow dominates over IQN by a smaller margin. With the latter, IQN seems to dominate over all algorithms from early on.
To facilitate benchmarking against our default settings we ran 5 independent runs of each agent on all 60 games from the ALE. For each of these runs we provide the TensorFlow checkpoints for all agents, the event files for Tensorboard, the training logs for all of the agents, a set of interactive notebooks facilitating the plotting of new experiments against our baselines, as well as a webpage where one can visually inspect the four agents’ performance across all 60 games.
4 Revisiting the Arcade Learning Environment: A Test Case
Machado et al. (2018) propose a standard methodology for evaluating algorithms within the ALE, and provide empirical evidence that alternate ALE parameter choices can impact research conclusions. In this section we continue the investigation where they left off: we begin with DQN and look ahead to some of the algorithms it inspired; namely, C51 (Bellemare et al., 2017) and Rainbow (Hessel et al., 2018). For legibility we are plotting DQN against C51 or Rainbow, but not both; the qualitative results, however, remain the same when compared with either. For continuity, we employ the parameter names from Section 3.1 from Machado et al.’s work.
4.1 Episode termination
The ALE considers an episode as finished when a human would normally stop playing: when they have finished the game or run out of lives. We call this termination condition “Game Over”. Mnih et al. (2015)
introduced a heuristic, calledLife Loss, which adds artificial episode boundaries in the replay memory whenever the player loses a life (e.g., in Montezuma’s Revenge the agent has 5 lives). Both definitions of episode termination have been used in the recent literature. Running this experiment in Dopamine consists in modifying the following gin-config option:
Figure 4 illustrates the difference in reported performance under the two conditions. Although our plots show that the Life Loss heuristic improves performance in some of the simpler games, Bellemare et al. (2016b) pointed out that it hinders performance in others, in particular because the agent cannot learn about the true consequences of losing a life. Following Machado et al.’s guidelines, the Life Loss heuristic is disabled in the default settings.
4.2 Measuring training data and summarizing learning performance
In Dopamine, training data is measured in game frames experienced by the agent; and each iteration consists in a fixed number of frames, rounded up to the nearest episode boundary.
Dopamine supports two schedules for running jobs: train and train_and_eval. The former only measures average score during training, while the latter interleaves these with evaluation runs, where learning is stopped. Machado et al. (2018) advocate for reporting the learning curves during training, necessitating only the train schedule. We report the difference in reported scores between the training and evaluation scores in Figure 5. This graph suggests that there is little difference between reporting training versus evaluation returns, so restricting to training curves is sufficient.
4.3 Effect of Sticky Actions on Agent Performance
The original ALE has deterministic transitions, which rewards agents that can memorize sequences of actions to achieve high scores (e.g., The Brute in Machado et al., 2018). To mitigate this issue, the most recent version of the ALE implements sticky actions. Sticky actions make use of a stickiness parameter , which is the probability that the environment will execute the agent’s previous action, as opposed to the one the agent just selected – effectively implementing a form of action momentum. Running this experiment in Dopamine consisted in modifying the following gin-config option:
In Figure 6 we demonstrate that there are differences in performance when running with or without sticky actions. While in some cases (Rainbow playing Space Invaders) sticky actions seem to improve performance, they do typically reduce performance. Nevertheless, they still lead to meaningful learning curves (Rainbow surpassing DQN); hence, and in accordance with the recommendations given by Machado et al. (2018), sticky actions are enabled by default in Dopamine.
5 Related work
We acknowledge that the last two years have seen the multiplication of frameworks for deep RL, most targeting fundamental research. This section reviews some of the more popular of these, and provides our perspective on their interplay with the taxonomy of Section 2. From this review, we conclude that Dopamine fills a unique niche in the deep reinforcement learning software ecosystem.
OpenAI baselines (Dhariwal et al., 2017) is a popular library providing comprehensive implementations of deep RL algorithms, with a particular focus on single-threaded, policy-based algorithms. Similar libraries include Coach from Intel (Caspi et al., 2017), Tensorforce (Schaarschmidt et al., 2017)
, and Keras RL(Plappert, 2016), which provide state-of-the-art algorithms and easy integration with OpenAI Gym. RLLab (Duan et al., 2016) emphasizes the benchmarking of continuous control algorithms, while RLLib (Liang et al., 2018) focuses on defining abstractions to optimize reinforcement learning performance in distributed settings. ELF (Tian et al., 2017) is a distributed RL framework in C++ and Python focusing on real-time strategy games. By virtue of providing implementations for many agents and algorithms, these frameworks are well-positioned to perform architecture research and comprehensive studies.
The field has also benefited from the open-sourcing of individual agents, typically done to facilitate reproducibility and disseminate technological savoir-faire. Of note, DQN was originally open-sourced in Lua and has since been re-implemented countless times; similarly, there are a number of publicly available PPO implementations (e.g. Hafner et al., 2017). More recently, the IMPALA agent was also open-sourced (Espeholt et al., 2018). While still beneficial, these frameworks are typically intended for personal consumption or illustrative purposes.
We conclude by noting that numerous reinforcement learning frameworks have been developed prior and in parallel to the deep RL resurgence, with similar research objectives. RLGlue (Tanner & White, 2009) emphasized benchmarking of value-based algorithms across a variety of canonical tasks, and drove a yearly competition; BURLAP (MacGlashan, 2016) provides object-oriented support; PyBrain (Schaul et al., 2010)
supports neural networks, and is perhaps closest to more recent frameworks; RLPy(Geramifard et al., 2015) is a lighter framework written in Python, with similar goals to RLGlue and also emphasizing instructional purposes.
6 Conclusion and Future Work
Dopamine provides a stable, reliable, flexible, and reproducible framework for fundamental deep reinforcement learning research. In this paper we have highlighted some of the challenges the research community faces in making new research reproducible and easy to compare against, and argue how Dopamine addresses many of these issues. Our hope is that by providing the community with a stable framework that is not difficult to understand will propel new scientific advances in this continually-growing field of research.
In order to keep our initial offering as simple and compact as possible, we have focused on single-GPU, value-based agents running on the ALE. In the near future, we hope to extend these offerings to policy-based methods as well as other environments. Distributed methods are an avenue that we are considering, although we are approaching it with caution in order to avoid introducing extra complexity into the codebase. Finally, we would like to provide tools for more advanced visualizations such as those discussed in Section 2.2.
Abadi et al. (2016)
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.
TensorFlow: a system for large-scale machine learning.In Symposium on Operating Systems Design and Implementation, 2016.
- Beattie et al. (2016) Charles Beattie, Joel Z. Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, Julian Schrittwieser, Keith Anderson, Sarah York, Max Cant, Adam Cain, Adrian Bolton, Stephen Gaffney, Helen King, Demis Hassabis, Shane Legg, and Stig Petersen. Deepmind Lab. CoRR, abs/1612.03801, 2016. URL http://arxiv.org/abs/1612.03801.
Bellemare et al. (2013)
Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The Arcade Learning Environment: An evaluation platform for
Journal of Artificial Intelligence Research, 47:253–279, June 2013.
- Bellemare et al. (2016a) Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip S. Thomas, and Rémi Munos. Increasing the Action Gap: New Operators for Reinforcement Learning. In Proceedings of the Thirtieth AAAI Conference, 2016a.
- Bellemare et al. (2016b) Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Rémi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, 2016b.
- Bellemare et al. (2017) Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2017.
- Caspi et al. (2017) Itai Caspi, Gal Leibovich, Gal Novik, and Shadi Endrawis. Reinforcement Learning Coach, 2017. URL https://doi.org/10.5281/zenodo.1134899.
Dabney et al. (2018a)
Will Dabney, Georg Ostrovski, David Silver, and Remi Munos.
Implicit Quantile Networks for Distributional Reinforcement Learning.In Proceedings of the International Conference on Machine Learning, 2018a.
- Dabney et al. (2018b) Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional Reinforcement Learning with Quantile Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018b.
- Dhariwal et al. (2017) Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. OpenAI Baselines. https://github.com/openai/baselines, 2017.
- Duan et al. (2016) Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, 2016.
- Espeholt et al. (2018) Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. In Proceedings of the 35th International Conference on Machine Learning, 2018.
- Geramifard et al. (2015) Alborz Geramifard, Christoph Dann, Robert H. Klein, William Dabney, and Jonathan P. How. RLPy: A Value-Function-Based Reinforcement Learning Framework for Education and Research. Journal of Machine Learning Research, 2015.
- Gruslys et al. (2018) Audrunas Gruslys, Will Dabney, Mohammad Gheshlaghi Azar, Bilal Piot, Marc Bellemare, and Remi Munos. The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning. In International Conference on Learning Representations, 2018.
- Guo et al. (2014) Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L. Lewis, and Xiaoshi Wang. Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning. In Advances in Neural Information Processing Systems, 2014.
- Hafner et al. (2017) Danijar Hafner, James Davidson, and Vincent Vanhoucke. TensorFlow Agents: Efficient Batched Reinforcement Learning in TensorFlow. arXiv preprint arXiv:1709.02878, 2017.
- Henderson et al. (2017) Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep Reinforcement Learning that Matters. CoRR, abs/1709.06560, 2017. URL http://arxiv.org/abs/1709.06560.
- Hessel et al. (2018) Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining Improvements in Deep Reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- Horgan et al. (2018) Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributed prioritized experience replay. In International Conference on Learning Representations, 2018.
- Islam et al. (2017) Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control. In ICML Workshop on Reproducibility in Machine Learning, 2017.
- Johnson et al. (2016) Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The Malmo Platform for Artificial Intelligence Experimentation. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, 2016.
- Kempka et al. (2016) Michal Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaskowski. Vizdoom: A Doom-based AI Research Platform for Visual Reinforcement learning. CoRR, abs/1605.02097, 2016. URL http://arxiv.org/abs/1605.02097.
- Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 2016.
- Liang et al. (2018) Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. RLlib: Abstractions for Distributed Reinforcement Learning. In International Conference on Machine Learning, 2018.
- MacGlashan (2016) James MacGlashan. Burlap: Brown-UMBC reinforcement learning and planning. https://github.com/openai/baselines, 2016.
- Machado et al. (2018) Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the Arcade Learning Environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 2018.
- Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 2015.
- Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, 2016.
- Munos et al. (2016) Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G. Bellemare. Safe and Efficient Off-Policy Reinforcement Learning. In Advances in Neural Information Processing Systems, 2016.
- Plappert (2016) Matthias Plappert. keras-rl. https://github.com/keras-rl/keras-rl, 2016.
- Schaarschmidt et al. (2017) Michael Schaarschmidt, Alexander Kuhnle, and Kai Fricke. Tensorforce: A tensorflow library for applied reinforcement learning. https://github.com/reinforceio/tensorforce, 2017.
- Schaul et al. (2010) Tom Schaul, Justin Bayer, Daan Wierstra, Yi Sun, Martin Felder, Frank Sehnke, Thomas Rückstieß, and Jürgen Schmidhuber. PyBrain. Journal of Machine Learning Research, 2010.
- Schaul et al. (2016) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized Experience Replay. In International Conference on Learning Representations, 2016.
- Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 2016.
- Tanner & White (2009) Brian Tanner and Adam White. RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments. Journal of Machine Learning Research, 2009.
- Tassa et al. (2018) Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and Martin A. Riedmiller. DeepMind Control Suite. CoRR, abs/1801.00690, 2018.
- Tian et al. (2017) Yuandong Tian, Qucheng Gong, Wenling Shang, Yuxin Wu, and C Lawrence Zitnick. ELF: An Extensive, Lightweight and Flexible Research Platform for Real-time Strategy Games. In Advances in Neural Information Processing Systems, 2017.
- van Hasselt et al. (2016) Hado van Hasselt, Arthur Guez, and David Silver. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2016.
- Vinyals et al. (2017) Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, John Quan, Stephen Gaffney, Stig Petersen, Karen Simonyan, Tom Schaul, Hado van Hasselt, David Silver, Timothy P. Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, David Lawrence, Anders Ekermo, Jacob Repp, and Rodney Tsing. Starcraft II: A New Challenge for Reinforcement Learning. CoRR, abs/1708.04782, 2017.
- Wang et al. (2016) Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. Proceedings of the International Conference on Machine Learning, 2016.
- Wang et al. (2017) Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Rémi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample Efficient Actor-Critic with Experience Replay. In Proceedings of the International Conference on Learning Representations, 2017.
- Zahavy et al. (2016) Tom Zahavy, Nir ben Zrihem, and Shie Mannor. Graying the black box: understanding DQNs. In Proceedings of the International Conference on Machine Learning, 2016.
Appendix A Performance statistics for Dopamine
Although the focus of Dopamine is not computational performance, we provide some performance statistics below.
Runtime performance is normally measured in the number of Atari frames processed per second (fps). This performance will vary from agent to agent and from game to game, but we provide two extremes of this spectrum. Both were measured while training on a Tesla P100 GPU.
DQN on Pong: Runtime is around 800 fps.
IQN on Asterix: Runtime is around 371 fps.
As mentioned in the main body of the text, Dopamine performs regular checkpointing to allow for graceful recovery from failures. We do perform garbage collection, so maintain checkpoints only for the last few iterations. The largest consumer of diskspace is the replay buffer, as it is configured by default to hold 1 million frames. When checkpointing, we store these as compressed numpy objects. Once again, the footprint varies not only from agent to agent and from game to game, but also from one run to the next, as the complexity of the frames stored depend on the policy learned by the agent. We provide two extremes of this spectrum:
DQN on Pong: The compressed replay buffer can be as low as 4.3Kb. Pong frames have few “active” pixels per frame, so this is not surprising. See screenshot below:
IQN on YarsRevenge: The compressed replay buffer can be as high as 1.2Gb. YarsRevenge has a pseudo-random strip running down the screen, so this is also not surprising. See screenshot below:
Appendix B Modifying DQN to select random actions
In this section we demonstrate how one can create a new agent by inheriting from one of the provided agents. This code is meant solely for illustrative purposes, as the created agent will perform quite poorly.
This code snippet is also provided in one of our interactive notebooks, where users can train and visualize the agent’s performance against the trained baselines as follows:
Resulting in the following plot:
Appendix C Creating a new agent from scratch
In this section we demonstrate how to create a new agent from scratch by providing the minimum functionality expected by the Runner. Again, this code is meant solely for illustrative purposes.
This code snippet is also provided in one of our interactive notebooks, where users can train and visualize the agent’s performance against the trained baselines as follows:
Resulting in the following plot:
Appendix D Gin Config files for all agents
In this section we list all the gin config files provided for all our agents. Note that these are the gin-configs of the first version. It is possible they may change over time, so the best reference is always github.com/google/dopamine.
d.1.1 Default settings
Default settings used in Dopamine:
d.1.2 Nature settings
Settings used in Mnih et al. (2015):
d.1.3 ICML settings
Settings used in Bellemare et al. (2017):
d.2.1 Default settings
Default settings used in Dopamine:
d.2.2 ICML settings
Settings used in Bellemare et al. (2017):
d.3.1 Default settings
Default settings used in Dopamine:
d.3.2 AAAI settings
Settings used in Hessel et al. (2018):
d.4.1 Default settings
Default settings used in Dopamine:
d.4.2 ICML settings
Settings used in Dabney et al. (2018a):