Since its introduction the Atari Learning Environment (ALE; Bellemare et al. (2013)) has been an important testbed in reinforcement learning (RL). ALE enables easily evaluating RL algorithms on more than 50 emulated Atari games spanning diverse gameplay styles, providing a window on such algorithms’ generality. Indeed, surprisingly strong results in ALE with deep neural networks (DNNs; LeCun et al. (2015)), published in Nature (Mnih et al., 2015), greatly contributed to the current popularity of deep reinforcement learning (DRL).
Like other machine learning benchmarks, much effort aims to quantitatively improve state-of-the-art (SOTA) scores. As the DRL community grows, a paper pushing SOTA is likely to attract significant interest and accumulate citations. While improving performance is important, it is equally important to understand what DRL algorithms learn, how they process and represent information, and what are their properties, strengths, and weaknesses. These questions cannot be answered through simple quantitative measurements of performance across the ALE suite of games.
In comparison to pushing SOTA, much less work has focused on understanding, interpreting, and visualizing the products of DRL; in particular, there is a dearth of research that compares DRL algorithms across dimensions other than performance. The main focus of this paper is to help alleviate the considerable friction for those looking to rigorously understand the qualitative behavior of DRL agents. Three main sources of such friction are: (1) the significant computational resources required to run DRL at scale, (2) the logistical tedium of plumbing the products of different DRL algorithms into a common interface, and (3) the wasted effort in re-implementing standard analysis pipelines (like t-SNE embeddings of the state space Maaten and Hinton (2008)
, or activation maximization for visualizing what neurons in a model representErhan et al. (2009); Olah et al. (2018); Nguyen et al. (2017); Simonyan et al. (2013); Yosinski et al. (2015); Nguyen et al. (2016a, b); Mahendran and Vedaldi (2016)). To address these frictions, this paper introduces the Atari Zoo, a release of trained models spanning major families of DRL algorithms, and an accompanying open-source software package111Software available at: http://t.uber.com/atarizoo that enables their easy analysis, comparison, and visualization (and similar analysis of future models). In particular, this package enables easily downloading particular frozen models of interest from the zoo on-demand, further evaluating them in their training environment or modified environments, generating visualizations of their neural activity, exploring compressed visual representations of their behavior, and creating synthetic input patterns that reveal what particular neurons most respond to.
To demonstrate the promise of this model zoo and software, this paper presents an initial analysis of the products of six DRL algorithms spanning policy gradient, value-based, and evolutionary methods: A2C (policy-gradient; Mnih et al. (2016)), DQN (value-based; Mnih et al. (2015)), Rainbow (value-based; Hessel et al. (2017)), Ape-X (value-based; Horgan et al. (2018)), ES (evolutionary; Salimans et al. (2017)), and Deep GA (evolutionary; Such et al. (2017)). The analysis illuminates differences in learned policies across methods that are independent of raw score performance, highlighting the benefit of going beyond simple quantitative measures and of having a unifying software framework that enables analyses with multiple, different, complementary techniques and applying them across many RL algorithms.
2.1 Visualizing Deep Networks
One line of DNN research focuses on visualizing the internal dynamics of a DNN Yosinski et al. (2015) or examines what particular neurons detect or respond to Erhan et al. (2009); Zeiler and Fergus (2014); Olah et al. (2018); Nguyen et al. (2017); Simonyan et al. (2013); Yosinski et al. (2015); Mahendran and Vedaldi (2016). The hope is to gain more insight into how DNNs are representing information, motivated both to enable more interpretable decisions from these models Olah et al. (2018) and to illuminate previously unknown properties about DNNs Yosinski et al. (2015). For example, through live visualization of all activations in a vision network responding to different images, Yosinski et al. (2015) highlighted that representations were often surprisingly local (as opposed to distributed), e.g. one convolutional filter proved to be a reliable face detector. One practical value of such insights is that they can catalyze future research Li et al. (2015). The Atari Zoo enables animations in the spirit of Yosinski et al. (2015) that show an agent’s activations as it interacts with a game, and also enables creating synthetic inputs via activation maximization Erhan et al. (2009); Zeiler and Fergus (2014); Olah et al. (2018); Nguyen et al. (2017); Simonyan et al. (2013); Yosinski et al. (2015); Mahendran and Vedaldi (2016), specifically by connecting DRL agents to the visualization package from Olah et al. (2018).
2.2 Understanding Deep RL
While much more visualization and understanding work has been done for vision models than for DRL, a few papers directly focus on understanding DRL agents (Greydanus et al., 2017; Zahavy et al., 2016), and many others feature some analysis of DRL agent behavior (often in the form of t-SNE diagrams of the state space; see Mnih et al. (2015)). One approach to understanding DRL agents is to investigate the learned features of models (Greydanus et al., 2017; Zahavy et al., 2016). For example, Zahavy et al. (2016) visualize what pixels are most important to an agent’s decision by using gradients of decisions with respect to pixels. Another approach is to modify the DNN architecture or training procedure such that a trained model will have more interpretable features (Iyer et al., 2018; Annasamy and Sycara, 2018). For example, Annasamy and Sycara (2018) augment a model with an attention mechanism and a reconstruction loss, hoping to produce interpretable explanations as a result.
The software package released here is in the spirit of the first paradigm. It facilitates understanding the most commonly applied architectures instead of changing them, although it is designed also to accommodate importing in new vision-based DRL models, and could thus be also used to analyze agents explicitly engineered to be interpretable. In particular, the package enables re-exploring many past DRL analysis techniques at scale, and across algorithms, which were previously applied only for one algorithm and across only a handful of hand-selected games.
2.3 Model Zoos
A useful mechanism for reducing friction for analyzing and building upon models is the idea of a model zoo
, i.e. a repository of pre-trained models that can easily be further investigated, fine-tuned, and/or compared (e.g. by looking at how their high-level representations differ). For example, the Caffe website includes a model zoo with many popular vision modelsJia et al. (2014)
, as do TensorflowSilberman and Guadarrama (2016)
, KerasChollet et al. (2015)
and PyTorchPaszke et al. (2017)
. The idea is that training large-scale vision networks (e.g. on the ImageNet dataset) can take weeks with powerful GPUs, and that there is little reason to constantly reduplicate the effort of training. Pre-trained word-embedding models are often released with similar motivation, e.g. for Word2VecMikolov et al. (2013) or GLoVE Pennington et al. (2014). However, such a practice is much less common in the space of DRL; one reason is that so far, unlike with vision models and word-embedding models, there are few other down-stream tasks from which Atari DRL agents provide obvious value. But, if the goal is to better understand these models and algorithm, both to improve them and to use them safely, then there is value in their release.
One notable DRL model release accompanied the recent Dopamine software package for reproducible DRL Bellemare et al. (2018); Dopamine includes final checkpoints of models trained by several DQN variants across ALE games. However, in general it is non-trivial to extract TensorFlow models from their original context for visualization purposes, and to directly compare agent behavior across DRL algorithms in the same software framework (e.g. due to slight differences in image preprocessing), or to explore dynamics that take place over learning, i.e. from intermediate checkpoints. To remedy this, for this paper and its accompanying software release, the Dopamine checkpoints were distilled into frozen models that can be easily loaded into the Atari Zoo framework.
3 Generating the Zoo
The approach in this paper is to run several validated implementations of DRL algorithms and to collect and standardize the models and results, such that they can then be easily used for downstream analysis and synthesis. There are many algorithms – and implementations of them – and different ways that those implementations could be run (e.g. different hyperparameters, architectures, input representations, stopping criteria, etc.). These choices influence the kind of post-hoc analysis that is possible. For example, Rainbow most often outperforms DQN, and if only final models are released, it is impossible to explore scientific questions where it is important tocontrol for performance.
To navigate these many degrees of freedom, we adopted the high level principle that the Atari Zoo should hold as many elements of architecture and experimental design constant across algorithms (e.g. DNN structure, input representation), should enable as many types of downstream analysis as possible (e.g. by releasing checkpoints across training time), and should make reasonable allowances for the particularities of each algorithm (e.g. ensuring hyperparameters are well-fit to the algorithm, and allowing for differences in how policies are encoded or sampled from). The next paragraphs describe specific design choices.
3.1 Frozen Model Selection Criteria
To enable the platform to facilitate a variety of explorations, we release multiple frozen models for each run, according to different criteria that may be useful to control for when comparing learned policies. The idea is that depending on the desired analysis, controlling for samples, or for wall-clock, or for performance (i.e. comparing similarly-performing policies) will be more appropriate. In particular, in addition to releasing the final model for each run, additional are models taken over training time (at one, two, four, six, and ten hours); over game frame samples (400 million, and 1 billion frames); over scores (if an algorithm reaches human level performance); and also a model before any training, to enable analysis of how weights change from their random initialization. The hope is that these frozen models will cover a wide spectrum of possible use cases.
3.2 Algorithm Choice
Because one focus of the Atari Zoo is to compare learned agents across different DRL algorithms, one important design consideration is the choice of particular algorithms to run. The main families of DRL algorithms that have been applied to the ALE are policy gradients methods like A2C Mnih et al. (2016), value-based methods like DQN Mnih et al. (2015), and black-box optimization methods like ES Salimans et al. (2017) and Deep GA Such et al. (2017). Based on available and trusted implementations, and the authors’ familiarity in running various algorithms at scale, the particular algorithms chosen to train included one policy gradients algorithm (A2C; Mnih et al. (2016)
), two evolutionary algorithms (ESSalimans et al. (2017) and Deep GA Such et al. (2017)), and one value-function based algorithm (a high-performing DQN variant, Ape-X; Horgan et al. (2018)). Additionally, models are also imported from the Dopamine release Bellemare et al. (2018), which include DQN Mnih et al. (2015) and a sophisticated variant of it called Rainbow Hessel et al. (2017). Note that from the Dopamine models, only final models are currently available. Hyperparameters and training details for all algorithms are available in supplemental material section S3. We hope to include models from additional algorithms in future releases.
3.3 Network Architecture and Input Representation
All algorithms are run with the DNN architecture from Mnih et al. (2015), which consists of three convolutional layers (with filter size 8x8, 4x4, and 3x3, followed by a fully-connected layer). For most of the explored algorithms, the fully-connected layer connects to an output layer with one neuron per action available in the underlying Atari game. However, A2C has an additional output that approximates the state value function; Ape-X’s architecture features dueling DQN Wang et al. (2015), which has two separate fully-connected streams; and Rainbow’s architecture includes C51 Bellemare et al. (2017), which uses many outputs to approximate the distribution of expected Q-values.
, is a a tensor consisting of the four most recent observation frames, grayscaled and downsampled to 84x84 (figure1b). By including some previous frames, the aim is to make the game more fully-observable, which is useful for the feed-forward architectures used here, that are currently most common in Atari research (although recurrent architectures offer possible improvements Hausknecht and Stone (2015); Mnih et al. (2016); Espeholt et al. (2018)). One useful Atari representation that is applied in post-training analysis in this paper, is the Atari RAM state, which is only 1024 bits long but encompasses the true underlying state information in the game (figure 1c).
3.4 Data Collection
All algorithms are run across 55 Atari games, for at least three independent random weight initializations. Regular checkpoints were taken during training; after training, the checkpoints that best fit each of the desired criteria (e.g. 400 million frames or human-level performance) were frozen and included in the zoo. The advantage of this post-hoc culling is that additional criteria can be added in the future, e.g. if Atari Zoo users introduce a new use case, because the original checkpoints are archived. Log files were stored and converted into a common format that will also be released with the models, to aid future performance curve comparisons for other researchers. Each frozen model was run post-hoc in ALE for timesteps to generate cached behavior of policies in their training environment, which includes the raw game frames, the processed four-frame observations, RAM states, and high-level representations (e.g. neural representations at hidden layers). As a result, it is possible to do meaningful analysis without ever running the models themselves.
4 Quantitative Analysis
The open-source software package released with the acceptance of this work provides an interface to the Atari Zoo dataset, and implements several common modes of analysis. Models can be downloaded with a single line of code; and other single-line invocations interface directly with ALE and return the behavioral outcome of executing a model’s policy, or create movies of agents superimposed with neural activation, or access convolutional weight tensors. In this section, we demonstrate analyses the Atari Zoo software can facilitate, and highlight some of its built-in features. For many of the analyses below, for computational simplicity we study results in a representative subset of 13 ALE games used by prior research Such et al. (2017), which we refer to here as the analysis subset of games.
4.1 Convolutional Filter Analysis
While reasoning about what a DNN has learned by looking at its weights is difficult, weights directly connected to the input can often be interpreted. For example, from visualizing the weights of the first convolutional layer in a vision model, Gabor-like edge detection filters are nearly always present Yosinski et al. (2014). An interesting question is if Gabor-like filters also arise when DRL algorithms are trained from pixel input, as in the ALE. In visualizing filters across games and DRL algorithms, edge-detector-like features sometimes arise in the gradient-based methods, but they are seemingly never as crisp as in vision models; this may because ALE lacks the visual complexity of natural images. In contrast, the filters in the evolutionary models appear to have less regularity. Representative examples across games and algorithms are shown in supplemental figure S1.
Learned filters would commonly appear to be tiled similarly across time (i.e. across the four DNN input frames), where past frames would have lower-intensity weights. One explanation is that reward gradients are more strongly influenced by present observations. To explore this more systematically, across games and algorithms we examined the absolute magnitude of filter weights connected to the present frame versus the past. Interestingly, in contrast to the gradient-based methods the evolutionary methods show no discernable preference across time (supplemental figure S2), again suggesting that their learning is qualitatively different from the gradient-based methods. Interestingly, a rigorous information-theoretic approximation of memory usage is explored in the context of DQN in Dann et al. (2016); our measure well-correlates with theirs despite the relative simplicity of exploring only filter weight strength (supplemental section S1.1).
4.2 Robustness to Observation Noise
An important property is how agents perform in slightly out-of-distribution (OOD) situations; ideally they would not catastrophically fail in the face of nominal change. While it is not easy to flexibly alter the dynamics of the environment in ALE (without learning how to program in 6502 assembly code), it is possible to systematically distort observations. Here we explore one simple OOD change to observations by adding increasingly severe noise to the four-frame observations input to DNN-based agents, and observe how their evaluated performance in ALE degrades. The motivation is to discover whether some learning algorithms are learning more robust policies than others. The results show that with some caveats, methods with a direct representation of the policy may be more robust to observation noise (supplemental figure S4). A similar study was conducted for robustness to parameter noise (supplemental section S1.2), but there was no clear trend in the data.
4.3 Distinctiveness of Policies Learned by Algorithms
To explore the distinctive signature of solutions discovered by different DRL algorithms, we train image classifiers to identify the generating DRL algorithm given states sampled from independent runs of each algorithm (details are in supplemental sectionS1.3). Supplemental figure S7
shows the confusion matrix for Seaquest, wherein the policy search methods (A2C, ES, and GA) have the most inter-class confusion, reflecting (as confirmed in later sections) that these algorithms tend to converge to the same sub-optimal behavior in this game; results are qualitatively similar when tabulated across the analysis subset of games (supplemental figureS8), and quantitatively reveal that Skiing is a particularly idiosyncratic game (supplemental table S1).
This section highlights the Atari Zoo’s visualization capabilities, which enables quickly and systematically exploring how policies vary across runs and algorithms for a given game. The tools can be split into three broad categories: Direct policy visualization, dimensionality reduction, and neuron activation maximization. In the future we will add additional tools, e.g. saliency maps Greydanus et al. (2017); Zahavy et al. (2016).
5.1 Inspecting Policy Behavior and Neural Activations through Animations in ALE
To quickly survey the kinds of solutions being learned, our software can generate grids of videos, where one axis in the grid covers different DRL algorithms, and the other axis covers independent runs of the algorithm. Such videos can highlight when different algorithms are converging to the same local optimum (e.g. supplemental figure S9 shows a situation where this is the case for A2C, ES, and the GA; video: http://t.uber.com/atarizoo_rolloutgrid).
To enable investigating the internal workings of the DNN, our software generates movies that display activations of all neurons alongside animated frames of the agent acting in game. This approach is inspired by the deep visualization toolbox Yosinski et al. (2015), but put into a DRL context. Supplemental figure S10 shows how this tool can lead to recognizing the functionality of particular high-level features (video: http://t.uber.com/atarizoo_activation); in particular, it helped to identify a submarine detecting neuron on the third convolution layer of an Ape-X agent. Note that for ES and GA, no such specialized neuron was found; activations seemed qualitatively more distributed for those methods.
5.2 Finding Image Patches from Observations that Maximally Excite Convolution Filters
One automated technique to discover what information is relevant to a particular convolutional filter is to find which image patches evoke the highest magnitude activations from it. Given a trained DRL agent and a target convolution filter to analyze, observations from the agent interacting with its ALE training environment are input to the agent’s DNN, and resulting maps of activations from the filter of interest are stored. These maps are sorted by the single maximum activation within them, and the geometric location within the map of that maximum activation is recorded. Then, for each of these top-most activations, the specific image patch from the observation that generated it is identified and displayed, by taking the receptive field of the filter into account (i.e. modulated by both the stride and size of the convolutional layers). As a sanity check, we validate that the neuron identified in the previous section does indeed maximally fire for submarines (figure2).
5.3 Dimensionality Reduction
Dimensionality reduction provides another view on DRL agent behavior; in DRL research it is common to generate t-SNE plots Maaten and Hinton (2008) of agent DNN representations that summarize evaluation in the domain Mnih et al. (2015). Our software includes such an implementation (supplemental figure S12).
However, such an approach relies on embedding the high-level representation of one agent; it is unclear how to apply it to create an embedding appropriate for comparisons of different independent runs of the same algorithm, or runs from different DRL algorithms. As an initial approach, we implement an embedding based on the Atari RAM representation (which is the same across algorithms and runs, but distinct between games). Like the grid view of agent behaviors and the state-distinguishing classifier, this t-SNE tool provides high-level information from which to compare runs of or between different algorithms (figure 3); details of this approach are provided in supplemental section S2.1.
5.4 Generating Synthetic Inputs to Understand Neurons
While the previous sections explore DNN activations in the context of an agent’s training environment, another approach is to optimize synthetic input images that stimulate particular DNN neurons. Variations on this approach have yielded striking results in vision models Nguyen et al. (2017, 2016a); Olah et al. (2018); Simonyan et al. (2013); the hope is that these techniques could yield an additional view on DRL agents’ neural representations. To enable this analysis, we leverage the Lucid visualization library Luc (2018); in particular, we create wrapper classes that enable easy integration of Atari Zoo models into Lucid, and release Jupyter notebooks that generate synthetic inputs for different DRL models.
We now present a series of synthetic inputs generated by the Lucid library across a handful of games that highlight the potential of these kinds of visualizations for DRL understanding (further details of the technique used are described in supplemental section S2.2. We first explore the kinds of features learned across depth. Supplemental figure S13 supports what was learned by visualizing the first-layer filter weights for value-based networks (section 4.1; i.e. showing that first convolution layers in the value-based networks appear to be learning edge-detector features). The activation videos of section 5.1 and the patch-based approach of section 5.2 help to provide grounding, showing that in the context of the game, some first-layer filters detect the edges of the screen, in effect to serve as location anchors, while others encode concepts like blinking objects (see figure S11). Supplemental figure S14 explores visualizing later-layer convolution filters, and figure 4 show inputs synthesized to maximize output neurons, which sometimes yields interpretable features.
Such visualizations can also reveal that critical features are being attended to (figure 5 and supplemental figure S15). Overall, these visualizations demonstrate the potential of this kind of technique, and we believe that many useful further insights may result from a more systematic application and investigation of this and many of the other interesting visualization techniques implemented by Lucid, which can now easily be applied to Atari Zoo models. Also promising would be to further explore regularization to constrain the space of synthetic inputs, e.g. a generative model of Atari frames in the spirit of Nguyen et al. (2017) or similar works.
6 Discussion and Conclusions
There are many follow-up extensions that the initial explorations of the zoo raise. One natural extension is to include more DRL algorithms; we would like to include agents from IMPALA Espeholt et al. (2018) or other popular policy-gradients algorithms (like TRPO Schulman et al. (2015) or PPO Schulman et al. (2017)), to balance out the distribution of DRL algorithms. Beyond algorithms, there are many alternate architectures that might have interesting effects on representation and decision-making, for example recurrent architectures Hausknecht and Stone (2015), or architectures that exploit attention Sorokin et al. (2015). Also intriguing is examining the effect of the incentive driving search: Do auxiliary or substitute objectives qualitatively change DRL representations, e.g. as in UNREAL Jaderberg et al. (2016), curiosity-driven exploration Pathak et al. (2017), or novelty search Conti et al. (2017)? How do the representations and features of meta-learning agents such as RL Wang et al. (2016) or MAML Finn et al. (2017) change as they learn a new task? Finally, there are other analysis tools that could be implemented, which might illuminate other interesting properties of DRL algorithms and learned representation, e.g. the image perturbation analysis of Greydanus et al. (2017) or a variety of sophisticated neuron visualization techniques Nguyen et al. (2017, 2016b). We welcome community contributions for these algorithms, models, architectures, incentives, and tools.
While the main motivation for the zoo was to reduce friction for research into understanding and visualizing the behavior of DRL agents, it can also serve as a platform for other research questions. For example, having a zoo of agents trained on individual games, for different amounts of data, also would reduce friction for exploring transfer learning within AtariSobol et al. (2018), i.e. whether experience learned on one game can quickly benefit on another game. Also, by providing a huge library of cached rollouts for agents across algorithms, the zoo may be interesting in the context of learning from demonstrations Hester et al. (2017), or for creating generative models of games Oh et al. (2015). In conclusion, we look forward to seeing how this dataset will be used by the community at large.
- Luc  Lucid: A collection of infrastructure and tools for research in neural network interpretability. http://https://github.com/tensorflow/lucid, 2018.
- Annasamy and Sycara  Raghuram Mandyam Annasamy and Katia Sycara. Towards better interpretability in deep q-networks. arXiv preprint arXiv:1809.05630, 2018.
Bellemare et al. 
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Bellemare et al.  Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887, 2017.
- Bellemare et al.  Marc G. Bellemare, Pablo Samuel Castro, Carles Gelada, Saurabh Kumar, and Subhodeep Moitra. Dopamine. GitHub, GitHub repository, 2018. URL https://github.com/google/dopamine.
- Chollet et al.  François Chollet et al. Keras applications. https://keras.io/applications/, 2015.
- Conti et al.  Edoardo Conti, Vashisht Madhavan, Felipe Petroski Such, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. arXiv preprint arXiv:1712.06560, 2017.
- Dann et al.  Christoph Dann, Katja Hofmann, and Sebastian Nowozin. Memory lens: How much memory does an agent use? arXiv preprint arXiv:1611.06928, 2016.
- Dhariwal et al.  Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. GitHub, GitHub repository, 2017.
- Erhan et al.  Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009.
- Espeholt et al.  Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
- Finn et al.  Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
- Greydanus et al.  Sam Greydanus, Anurag Koul, Jonathan Dodge, and Alan Fern. Visualizing and understanding atari agents. arXiv preprint arXiv:1711.00138, 2017.
- Hausknecht and Stone  Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. CoRR, abs/1507.06527, 7(1), 2015.
- Hessel et al.  Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298, 2017.
- Hester et al.  Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, et al. Deep q-learning from demonstrations. arXiv preprint arXiv:1704.03732, 2017.
- Horgan et al.  Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018.
- Iyer et al.  Rahul Iyer, Yuezhang Li, Huao Li, Michael Lewis, Ramitha Sundar, and Katia Sycara. Transparency and explanation in deep reinforcement learning neural networks. 2018.
- Jaderberg et al.  Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
- Jia et al.  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
- LeCun et al.  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
Lehman et al. 
Joel Lehman, Jay Chen, Jeff Clune, and Kenneth O Stanley.
Es is more than just a traditional finite-difference approximator.
Proceedings of the Genetic and Evolutionary Computation Conference, pages 450–457. ACM, 2018.
- Li et al.  Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John E Hopcroft. Convergent learning: Do different neural networks learn the same representations? In FE@ NIPS, pages 196–212, 2015.
- Maaten and Hinton  Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
Mahendran and Vedaldi 
Aravindh Mahendran and Andrea Vedaldi.
Visualizing deep convolutional neural networks using natural pre-images.
International Journal of Computer Vision, 120(3):233–255, 2016.
- Mikolov et al.  Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
- Mnih et al.  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
- Mordvintsev et al.  Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Deepdream-a code example for visualizing neural networks. Google Research, 2:5, 2015.
- Mordvintsev et al.  Alexander Mordvintsev, Nicola Pezzotti, Ludwig Schubert, and Chris Olah. Differentiable image parameterizations. Distill, 3(7):e12, 2018.
- Nguyen et al. [2016a] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in Neural Information Processing Systems, pages 3387–3395, 2016a.
- Nguyen et al. [2016b] Anh Nguyen, Jason Yosinski, and Jeff Clune. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks. arXiv preprint arXiv:1602.03616, 2016b.
- Nguyen et al.  Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. Plug & play generative networks: Conditional iterative generation of images in latent space. In CVPR, volume 2, page 7, 2017.
- Oh et al.  Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pages 2863–2871, 2015.
- Olah et al.  Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization.
- Olah et al.  Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. The building blocks of interpretability. Distill, 3(3):e10, 2018.
- Paszke et al.  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Torchvision models. https://pytorch.org/docs/stable/torchvision/models.html, 2017.
- Pathak et al.  Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.
Pennington et al. 
Jeffrey Pennington, Richard Socher, and Christopher Manning.
Glove: Global vectors for word representation.In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
- Raghu et al.  Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In Advances in Neural Information Processing Systems, pages 6076–6085, 2017.
- Salimans et al.  Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
- Schulman et al.  John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
- Schulman et al.  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Silberman and Guadarrama  Nathan Silberman and Sergio Guadarrama. Tensorflow-slim image classification model library. http://https://github.com/tensorflow/models/tree/master/research/slim, 2016.
- Simonyan et al.  Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Sobol et al.  Doron Sobol, Lior Wolf, and Yaniv Taigman. Visual analogies between atari games for studying transfer learning in rl. 2018.
- Sorokin et al.  Ivan Sorokin, Alexey Seleznev, Mikhail Pavlov, Aleksandr Fedorov, and Anastasiia Ignateva. Deep attention recurrent q-network. arXiv preprint arXiv:1512.01693, 2015.
- Such et al.  Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Deep neuroevolution: genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567, 2017.
- Wang et al.  Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
- Wang et al.  Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.
- Yosinski et al.  Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
- Yosinski et al.  Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
- Zahavy et al.  Tom Zahavy, Nir Ben-Zrihem, and Shie Mannor. Graying the black box: Understanding dqns. In International Conference on Machine Learning, pages 1899–1908, 2016.
- Zeiler and Fergus  Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
The following sections contain additional figures, and describe in more detail the experimental setups applied in the paper’s experiments.
S1 Quantitative Analysis Details
s1.1 Further Study of Temporal Bias in DQN
As an exploration of the connection between the information theoretic measure of memory-dependent action in Dann et al.  and the pattern highlighted in this paper (i.e. the strength of filter weights in the first layer of convolutions may highlight a network’s reliance on the past), we examined first-layer filters in DQN across all 55 games. A simple metric of present-focus is the ratio of average weight magnitudes for the past three frames to the present frame. When sorted by this metric (see figure S3), there is high agreement with the 8 games identified by Dann et al.  that use memory. In particular, three out of the top four games identified by our metric align with theirs; as do six out of the top twelve games, considered among the games that overlap between their 49 and our 55.
s1.2 Observation and Parameter Noise Details
. How performance of trained policies degrades with increasing severe normally-distributed noise is shown, averaged over three independent runs across the analysis subset of games. The figure hows performance degrades (a) relative to baseline performance by that algorithm on each game, and (b) by the best performance of any algorithm on each game. Zero performance in this chart represents random play. The conclusion is that the policy search algorithms show less steep degradation relative to (a) their own best performance; although this is confounded by (b) the overall better absolute performance of the value-based methods. Follow-up analysis will control for performance, by using the Atari Zoo human-performance frozen models.
Figure S4 shows robustness to observation noise for games in the analysis subset. Beyond observation noise, another interesting property of learning algorithms is the kind of local optimum they find in the parameter space, i.e. whether the learned function is smooth or not in the area of a found solution. One gross tool for examining this property is to test the robustness of policies to parameter perturbations. It is plausible that the evolutionary methods would be more robust in this way, given that they are trained through parameter perturbations. To measure this, we perturb the convolutional weights of each DRL agent with increasingly severe normally-distributed noise. We perturbed the convolutional weights only, because that part of the DNN is identical across agents, whereas the fully-connected layers sometimes vary in structure across DRL algorithms (e.g. value-based algorithms like Rainbow or Ape-X that include features that require post-convolutional architectural changes). Figure S5 shows the results across games and algorithms; however, no strong trend is evident.
s1.3 Distinctiveness of Policies Learned by Algorithms
We use only the “present” channel of each gray-scale observation frame (i.e. without the complete four-frame stack) to train a classifier for each game. The classifier consists of two convolution layers and two fully connected layers, and is trained with early stopping to avoid overfitting. For each game, 2501 frames are collected from multiple evaluations by each model. The reported classification results use 20% of the frames as test set. Figure S6 summarizes classification performance across models and games, while table S1 shows performance averaged across games (highlighting that Skiing is an outlier in terms of classifier performance).
shows performance averaged across games (highlighting that Skiing is an outlier in terms of classifier performance).
S2 Visualization Details
This section provides more details and figures for the visualization portion of the paper’s analysis. Figure S9 shows one frame of a collage of simultaneous videos that give a quick high-level comparison of how different algorithms and runs are solving an ALE environment. Figure S9 shows one frame of a video that simultaneously shows a DNN agent acting in an ALE environment and all of the activations of its DNN.
Figure S11 shows a second example of how the image-patch finder can help ground out what particular DNN neurons are learning.
s2.1 t-SNE Details
To visualize RAM states and high-level DNN representations in 2D, principal component analysis (PCA) is first applied to reduce the number of dimensions to 50, followed by 3000 t-SNE iterations with perplexity of 30. The dimensionality reduction of RAM states is applied across all available runs of multiple algorithms, while that of high-level DNN representations is with respect to a specific model of a given algorithm.
To visualize RAM states and high-level DNN representations in 2D, principal component analysis (PCA) is first applied to reduce the number of dimensions to 50, followed by 3000 t-SNE iterations with perplexity of 30. The dimensionality reduction of RAM states is applied across all available runs of multiple algorithms, while that of high-level DNN representations is with respect to a specific model of a given algorithm.
s2.2 Synthetic Input Generation Details
We use the lucid library Luc  to visualize what types of inputs maximize neuron activations throughout the agents’ networks. This study used the trained checkpoints provided by Dopamine Bellemare et al.  for DQN and Rainbow (although it could be applied to any of the DRL algorithms in the Atari Zoo). These frozen graphs are then loaded as part of a Lucid model and an optimization objective is created.
An input pattern to the network (consisting of a stack of four 84x84 pixel screens) is optimized to maximize the activations of the desired neurons. Initally, the four 84x84 frames are initialized with random noise. The result of optimization ideally yields visualizations that reveal qualitatively what features the neurons have learned to capture. As recommended in Olah et al.  and Mahendran and Vedaldi  we apply some regularization to produce clearer results; for most images we use only image jitter (i.e. randomly offsetting the input image by a few pixels to encourage local translation invariance). For some images, we found it helpful to add total variation regularization (to encourage local smoothness; see Mahendran and Vedaldi ) and L1 regularization (to encourage pixels that are not contributing to the objective to become zero) on the optimized image.
S3 DRL Algorithm Details and Hyperparameters
This section describes the implementations and hyperparameters used for training the models released with the zoo. The DQN and Rainbow models come from the Dopamine model release Bellemare et al. 111The hyperparameters and training details can be found in https://github.com/google/dopamine/tree/master/baselines/. The following sections describe the algorithms for the newly-trained models released with this paper.
The implementation of A2C Mnih et al.  that generated the models in this paper was derived from the OpenAI baselines software package Dhariwal et al. . It ran with 20 parallel worker threads for 400 million frames; checkpoints occurred every 4 million frames. Hyperparameters are listed in table S2.
|Value Function Loss Coefficient||0.5|
|Entropy Loss Coefficient||0.01|
The implementation of Ape-X used to generate the models in this paper can be found here: https://github.com/uber-research/ape-x. The hyperparameters are reported in Table S3.
|Number of Actors||384|
|target network period||2500|
|Prioritized replay||(0.6, 0.4)|
|Adam Learning rate||0.00025 / 4|
The implementation of GA used to generate the models in this paper can be found here: https://github.com/uber-research/deep-neuroevolution. The hyperparameters are reported in Table S4 and were found through random search.
The implementation of ES used to generate the models in this paper can be found here: https://github.com/uber-research/deep-neuroevolution. The hyperparameters reported in Table S5 were found via preliminary search and are similar to those reported in Conti et al. .