An Atari Model Zoo for Analyzing, Visualizing, and Comparing Deep Reinforcement Learning Agents

12/17/2018 ∙ by Felipe Petroski Such, et al. ∙ Uber 6

Much human and computational effort has aimed to improve how deep reinforcement learning algorithms perform on benchmarks such as the Atari Learning Environment. Comparatively less effort has focused on understanding what has been learned by such methods, and investigating and comparing the representations learned by different families of reinforcement learning (RL) algorithms. Sources of friction include the onerous computational requirements, and general logistical and architectural complications for running Deep RL algorithms at scale. We lessen this friction, by (1) training several algorithms at scale and releasing trained models, (2) integrating with a previous Deep RL model release, and (3) releasing code that makes it easy for anyone to load, visualize, and analyze such models. This paper introduces the Atari Zoo framework, which contains models trained across benchmark Atari games, in an easy-to-use format, as well as code that implements common modes of analysis and connects such models to a popular neural network visualization library. Further, to demonstrate the potential of this dataset and software package, we show initial quantitative and qualitative comparisons between the performance and representations of several deep RL algorithms, highlighting interesting and previously unknown distinctions between them.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 8

page 12

page 16

page 17

page 18

page 20

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since its introduction the Atari Learning Environment (ALE; Bellemare et al. (2013)) has been an important testbed in reinforcement learning (RL). ALE enables easily evaluating RL algorithms on more than 50 emulated Atari games spanning diverse gameplay styles, providing a window on such algorithms’ generality. Indeed, surprisingly strong results in ALE with deep neural networks (DNNs; LeCun et al. (2015)), published in Nature (Mnih et al., 2015), greatly contributed to the current popularity of deep reinforcement learning (DRL).

Like other machine learning benchmarks, much effort aims to quantitatively improve state-of-the-art (SOTA) scores. As the DRL community grows, a paper pushing SOTA is likely to attract significant interest and accumulate citations. While improving performance is important, it is equally important to understand what DRL algorithms learn, how they process and represent information, and what are their properties, strengths, and weaknesses. These questions cannot be answered through simple quantitative measurements of performance across the ALE suite of games.

In comparison to pushing SOTA, much less work has focused on understanding, interpreting, and visualizing the products of DRL; in particular, there is a dearth of research that compares DRL algorithms across dimensions other than performance. The main focus of this paper is to help alleviate the considerable friction for those looking to rigorously understand the qualitative behavior of DRL agents. Three main sources of such friction are: (1) the significant computational resources required to run DRL at scale, (2) the logistical tedium of plumbing the products of different DRL algorithms into a common interface, and (3) the wasted effort in re-implementing standard analysis pipelines (like t-SNE embeddings of the state space Maaten and Hinton (2008)

, or activation maximization for visualizing what neurons in a model represent

Erhan et al. (2009); Olah et al. (2018); Nguyen et al. (2017); Simonyan et al. (2013); Yosinski et al. (2015); Nguyen et al. (2016a, b); Mahendran and Vedaldi (2016)). To address these frictions, this paper introduces the Atari Zoo, a release of trained models spanning major families of DRL algorithms, and an accompanying open-source software package111Software available at: http://t.uber.com/atarizoo that enables their easy analysis, comparison, and visualization (and similar analysis of future models). In particular, this package enables easily downloading particular frozen models of interest from the zoo on-demand, further evaluating them in their training environment or modified environments, generating visualizations of their neural activity, exploring compressed visual representations of their behavior, and creating synthetic input patterns that reveal what particular neurons most respond to.

To demonstrate the promise of this model zoo and software, this paper presents an initial analysis of the products of six DRL algorithms spanning policy gradient, value-based, and evolutionary methods: A2C (policy-gradient; Mnih et al. (2016)), DQN (value-based; Mnih et al. (2015)), Rainbow (value-based; Hessel et al. (2017)), Ape-X (value-based; Horgan et al. (2018)), ES (evolutionary; Salimans et al. (2017)), and Deep GA (evolutionary; Such et al. (2017)). The analysis illuminates differences in learned policies across methods that are independent of raw score performance, highlighting the benefit of going beyond simple quantitative measures and of having a unifying software framework that enables analyses with multiple, different, complementary techniques and applying them across many RL algorithms.

2 Background

2.1 Visualizing Deep Networks

One line of DNN research focuses on visualizing the internal dynamics of a DNN Yosinski et al. (2015) or examines what particular neurons detect or respond to Erhan et al. (2009); Zeiler and Fergus (2014); Olah et al. (2018); Nguyen et al. (2017); Simonyan et al. (2013); Yosinski et al. (2015); Mahendran and Vedaldi (2016). The hope is to gain more insight into how DNNs are representing information, motivated both to enable more interpretable decisions from these models Olah et al. (2018) and to illuminate previously unknown properties about DNNs Yosinski et al. (2015). For example, through live visualization of all activations in a vision network responding to different images, Yosinski et al. (2015) highlighted that representations were often surprisingly local (as opposed to distributed), e.g. one convolutional filter proved to be a reliable face detector. One practical value of such insights is that they can catalyze future research Li et al. (2015). The Atari Zoo enables animations in the spirit of Yosinski et al. (2015) that show an agent’s activations as it interacts with a game, and also enables creating synthetic inputs via activation maximization Erhan et al. (2009); Zeiler and Fergus (2014); Olah et al. (2018); Nguyen et al. (2017); Simonyan et al. (2013); Yosinski et al. (2015); Mahendran and Vedaldi (2016), specifically by connecting DRL agents to the visualization package from Olah et al. (2018).

2.2 Understanding Deep RL

While much more visualization and understanding work has been done for vision models than for DRL, a few papers directly focus on understanding DRL agents (Greydanus et al., 2017; Zahavy et al., 2016), and many others feature some analysis of DRL agent behavior (often in the form of t-SNE diagrams of the state space; see Mnih et al. (2015)). One approach to understanding DRL agents is to investigate the learned features of models (Greydanus et al., 2017; Zahavy et al., 2016). For example, Zahavy et al. (2016) visualize what pixels are most important to an agent’s decision by using gradients of decisions with respect to pixels. Another approach is to modify the DNN architecture or training procedure such that a trained model will have more interpretable features (Iyer et al., 2018; Annasamy and Sycara, 2018). For example, Annasamy and Sycara (2018) augment a model with an attention mechanism and a reconstruction loss, hoping to produce interpretable explanations as a result.

The software package released here is in the spirit of the first paradigm. It facilitates understanding the most commonly applied architectures instead of changing them, although it is designed also to accommodate importing in new vision-based DRL models, and could thus be also used to analyze agents explicitly engineered to be interpretable. In particular, the package enables re-exploring many past DRL analysis techniques at scale, and across algorithms, which were previously applied only for one algorithm and across only a handful of hand-selected games.

2.3 Model Zoos

A useful mechanism for reducing friction for analyzing and building upon models is the idea of a model zoo

, i.e. a repository of pre-trained models that can easily be further investigated, fine-tuned, and/or compared (e.g. by looking at how their high-level representations differ). For example, the Caffe website includes a model zoo with many popular vision models

Jia et al. (2014)

, as do Tensorflow

Silberman and Guadarrama (2016)

, Keras

Chollet et al. (2015)

and PyTorch

Paszke et al. (2017)

. The idea is that training large-scale vision networks (e.g. on the ImageNet dataset) can take weeks with powerful GPUs, and that there is little reason to constantly reduplicate the effort of training. Pre-trained word-embedding models are often released with similar motivation, e.g. for Word2Vec

Mikolov et al. (2013) or GLoVE Pennington et al. (2014). However, such a practice is much less common in the space of DRL; one reason is that so far, unlike with vision models and word-embedding models, there are few other down-stream tasks from which Atari DRL agents provide obvious value. But, if the goal is to better understand these models and algorithm, both to improve them and to use them safely, then there is value in their release.

One notable DRL model release accompanied the recent Dopamine software package for reproducible DRL Bellemare et al. (2018); Dopamine includes final checkpoints of models trained by several DQN variants across ALE games. However, in general it is non-trivial to extract TensorFlow models from their original context for visualization purposes, and to directly compare agent behavior across DRL algorithms in the same software framework (e.g. due to slight differences in image preprocessing), or to explore dynamics that take place over learning, i.e. from intermediate checkpoints. To remedy this, for this paper and its accompanying software release, the Dopamine checkpoints were distilled into frozen models that can be easily loaded into the Atari Zoo framework.

3 Generating the Zoo

The approach in this paper is to run several validated implementations of DRL algorithms and to collect and standardize the models and results, such that they can then be easily used for downstream analysis and synthesis. There are many algorithms – and implementations of them – and different ways that those implementations could be run (e.g. different hyperparameters, architectures, input representations, stopping criteria, etc.). These choices influence the kind of post-hoc analysis that is possible. For example, Rainbow most often outperforms DQN, and if only final models are released, it is impossible to explore scientific questions where it is important to

control for performance.

To navigate these many degrees of freedom, we adopted the high level principle that the Atari Zoo should hold as many elements of architecture and experimental design constant across algorithms (e.g. DNN structure, input representation), should enable as many types of downstream analysis as possible (e.g. by releasing checkpoints across training time), and should make reasonable allowances for the particularities of each algorithm (e.g. ensuring hyperparameters are well-fit to the algorithm, and allowing for differences in how policies are encoded or sampled from). The next paragraphs describe specific design choices.

3.1 Frozen Model Selection Criteria

To enable the platform to facilitate a variety of explorations, we release multiple frozen models for each run, according to different criteria that may be useful to control for when comparing learned policies. The idea is that depending on the desired analysis, controlling for samples, or for wall-clock, or for performance (i.e. comparing similarly-performing policies) will be more appropriate. In particular, in addition to releasing the final model for each run, additional are models taken over training time (at one, two, four, six, and ten hours); over game frame samples (400 million, and 1 billion frames); over scores (if an algorithm reaches human level performance); and also a model before any training, to enable analysis of how weights change from their random initialization. The hope is that these frozen models will cover a wide spectrum of possible use cases.

3.2 Algorithm Choice

Because one focus of the Atari Zoo is to compare learned agents across different DRL algorithms, one important design consideration is the choice of particular algorithms to run. The main families of DRL algorithms that have been applied to the ALE are policy gradients methods like A2C Mnih et al. (2016), value-based methods like DQN Mnih et al. (2015), and black-box optimization methods like ES Salimans et al. (2017) and Deep GA Such et al. (2017). Based on available and trusted implementations, and the authors’ familiarity in running various algorithms at scale, the particular algorithms chosen to train included one policy gradients algorithm (A2C; Mnih et al. (2016)

), two evolutionary algorithms (ES

Salimans et al. (2017) and Deep GA Such et al. (2017)), and one value-function based algorithm (a high-performing DQN variant, Ape-X; Horgan et al. (2018)). Additionally, models are also imported from the Dopamine release Bellemare et al. (2018), which include DQN Mnih et al. (2015) and a sophisticated variant of it called Rainbow Hessel et al. (2017). Note that from the Dopamine models, only final models are currently available. Hyperparameters and training details for all algorithms are available in supplemental material section S3. We hope to include models from additional algorithms in future releases.

3.3 Network Architecture and Input Representation

All algorithms are run with the DNN architecture from Mnih et al. (2015), which consists of three convolutional layers (with filter size 8x8, 4x4, and 3x3, followed by a fully-connected layer). For most of the explored algorithms, the fully-connected layer connects to an output layer with one neuron per action available in the underlying Atari game. However, A2C has an additional output that approximates the state value function; Ape-X’s architecture features dueling DQN Wang et al. (2015), which has two separate fully-connected streams; and Rainbow’s architecture includes C51 Bellemare et al. (2017), which uses many outputs to approximate the distribution of expected Q-values.

Raw Atari frames are color images with 210x160 resolution (see figure 1a); the canonical DRL representation, as introduced by Mnih et al. (2015)

, is a a tensor consisting of the four most recent observation frames, grayscaled and downsampled to 84x84 (figure

1b). By including some previous frames, the aim is to make the game more fully-observable, which is useful for the feed-forward architectures used here, that are currently most common in Atari research (although recurrent architectures offer possible improvements Hausknecht and Stone (2015); Mnih et al. (2016); Espeholt et al. (2018)). One useful Atari representation that is applied in post-training analysis in this paper, is the Atari RAM state, which is only 1024 bits long but encompasses the true underlying state information in the game (figure 1c).

(a) RGB frame from emulator
(b) Processed observation
(c) RAM representation
Figure 1: Input and RAM Representation. (a) One RGB frame of emulated Atari gameplay is shown, which is (b) preprocessed and concatenated with previous frames before being fed into the DNN controller. A compressed representation of a 2000-step ALE simulation is shown in (c), i.e. the 1024-bit RAM state (horizontal axis) unfurled over frames (vertical axis).

3.4 Data Collection

All algorithms are run across 55 Atari games, for at least three independent random weight initializations. Regular checkpoints were taken during training; after training, the checkpoints that best fit each of the desired criteria (e.g. 400 million frames or human-level performance) were frozen and included in the zoo. The advantage of this post-hoc culling is that additional criteria can be added in the future, e.g. if Atari Zoo users introduce a new use case, because the original checkpoints are archived. Log files were stored and converted into a common format that will also be released with the models, to aid future performance curve comparisons for other researchers. Each frozen model was run post-hoc in ALE for timesteps to generate cached behavior of policies in their training environment, which includes the raw game frames, the processed four-frame observations, RAM states, and high-level representations (e.g. neural representations at hidden layers). As a result, it is possible to do meaningful analysis without ever running the models themselves.

4 Quantitative Analysis

The open-source software package released with the acceptance of this work provides an interface to the Atari Zoo dataset, and implements several common modes of analysis. Models can be downloaded with a single line of code; and other single-line invocations interface directly with ALE and return the behavioral outcome of executing a model’s policy, or create movies of agents superimposed with neural activation, or access convolutional weight tensors. In this section, we demonstrate analyses the Atari Zoo software can facilitate, and highlight some of its built-in features. For many of the analyses below, for computational simplicity we study results in a representative subset of 13 ALE games used by prior research Such et al. (2017), which we refer to here as the analysis subset of games.

4.1 Convolutional Filter Analysis

While reasoning about what a DNN has learned by looking at its weights is difficult, weights directly connected to the input can often be interpreted. For example, from visualizing the weights of the first convolutional layer in a vision model, Gabor-like edge detection filters are nearly always present Yosinski et al. (2014). An interesting question is if Gabor-like filters also arise when DRL algorithms are trained from pixel input, as in the ALE. In visualizing filters across games and DRL algorithms, edge-detector-like features sometimes arise in the gradient-based methods, but they are seemingly never as crisp as in vision models; this may because ALE lacks the visual complexity of natural images. In contrast, the filters in the evolutionary models appear to have less regularity. Representative examples across games and algorithms are shown in supplemental figure S1.

Learned filters would commonly appear to be tiled similarly across time (i.e. across the four DNN input frames), where past frames would have lower-intensity weights. One explanation is that reward gradients are more strongly influenced by present observations. To explore this more systematically, across games and algorithms we examined the absolute magnitude of filter weights connected to the present frame versus the past. Interestingly, in contrast to the gradient-based methods the evolutionary methods show no discernable preference across time (supplemental figure S2), again suggesting that their learning is qualitatively different from the gradient-based methods. Interestingly, a rigorous information-theoretic approximation of memory usage is explored in the context of DQN in Dann et al. (2016); our measure well-correlates with theirs despite the relative simplicity of exploring only filter weight strength (supplemental section S1.1).

4.2 Robustness to Observation Noise

An important property is how agents perform in slightly out-of-distribution (OOD) situations; ideally they would not catastrophically fail in the face of nominal change. While it is not easy to flexibly alter the dynamics of the environment in ALE (without learning how to program in 6502 assembly code), it is possible to systematically distort observations. Here we explore one simple OOD change to observations by adding increasingly severe noise to the four-frame observations input to DNN-based agents, and observe how their evaluated performance in ALE degrades. The motivation is to discover whether some learning algorithms are learning more robust policies than others. The results show that with some caveats, methods with a direct representation of the policy may be more robust to observation noise (supplemental figure S4). A similar study was conducted for robustness to parameter noise (supplemental section S1.2), but there was no clear trend in the data.

4.3 Distinctiveness of Policies Learned by Algorithms

To explore the distinctive signature of solutions discovered by different DRL algorithms, we train image classifiers to identify the generating DRL algorithm given states sampled from independent runs of each algorithm (details are in supplemental section

S1.3). Supplemental figure S7

shows the confusion matrix for Seaquest, wherein the policy search methods (A2C, ES, and GA) have the most inter-class confusion, reflecting (as confirmed in later sections) that these algorithms tend to converge to the same sub-optimal behavior in this game; results are qualitatively similar when tabulated across the analysis subset of games (supplemental figure

S8), and quantitatively reveal that Skiing is a particularly idiosyncratic game (supplemental table S1).

5 Visualization

This section highlights the Atari Zoo’s visualization capabilities, which enables quickly and systematically exploring how policies vary across runs and algorithms for a given game. The tools can be split into three broad categories: Direct policy visualization, dimensionality reduction, and neuron activation maximization. In the future we will add additional tools, e.g. saliency maps Greydanus et al. (2017); Zahavy et al. (2016).

5.1 Inspecting Policy Behavior and Neural Activations through Animations in ALE

To quickly survey the kinds of solutions being learned, our software can generate grids of videos, where one axis in the grid covers different DRL algorithms, and the other axis covers independent runs of the algorithm. Such videos can highlight when different algorithms are converging to the same local optimum (e.g. supplemental figure S9 shows a situation where this is the case for A2C, ES, and the GA; video: http://t.uber.com/atarizoo_rolloutgrid).

To enable investigating the internal workings of the DNN, our software generates movies that display activations of all neurons alongside animated frames of the agent acting in game. This approach is inspired by the deep visualization toolbox Yosinski et al. (2015), but put into a DRL context. Supplemental figure S10 shows how this tool can lead to recognizing the functionality of particular high-level features (video: http://t.uber.com/atarizoo_activation); in particular, it helped to identify a submarine detecting neuron on the third convolution layer of an Ape-X agent. Note that for ES and GA, no such specialized neuron was found; activations seemed qualitatively more distributed for those methods.

5.2 Finding Image Patches from Observations that Maximally Excite Convolution Filters

One automated technique to discover what information is relevant to a particular convolutional filter is to find which image patches evoke the highest magnitude activations from it. Given a trained DRL agent and a target convolution filter to analyze, observations from the agent interacting with its ALE training environment are input to the agent’s DNN, and resulting maps of activations from the filter of interest are stored. These maps are sorted by the single maximum activation within them, and the geometric location within the map of that maximum activation is recorded. Then, for each of these top-most activations, the specific image patch from the observation that generated it is identified and displayed, by taking the receptive field of the filter into account (i.e. modulated by both the stride and size of the convolutional layers). As a sanity check, we validate that the neuron identified in the previous section does indeed maximally fire for submarines (figure

2).

Figure 2: A sub-detecting neuron in Seaquest. Each image represents an observation from an Ape-X agent playing Seaquest. The red square indicates which image patch highly-activated the sub-detecting neuron on the third convolutional layer of the DNN. Having multiple tools (such as this image patch finder, or the activation movies which identified this neuron of interest) enables more easily triangulating and verifying hypotheses about the internals of a DRL agent’s neural network.

5.3 Dimensionality Reduction

Dimensionality reduction provides another view on DRL agent behavior; in DRL research it is common to generate t-SNE plots Maaten and Hinton (2008) of agent DNN representations that summarize evaluation in the domain Mnih et al. (2015). Our software includes such an implementation (supplemental figure S12).

However, such an approach relies on embedding the high-level representation of one agent; it is unclear how to apply it to create an embedding appropriate for comparisons of different independent runs of the same algorithm, or runs from different DRL algorithms. As an initial approach, we implement an embedding based on the Atari RAM representation (which is the same across algorithms and runs, but distinct between games). Like the grid view of agent behaviors and the state-distinguishing classifier, this t-SNE tool provides high-level information from which to compare runs of or between different algorithms (figure 3); details of this approach are provided in supplemental section S2.1.

Figure 3: Multiple runs of algorithms and sharing the same RAM-space embedding in Seaquest. This plot shows one ALE evaluation per model for A2C, ES, and Ape-X, visualized in the same underlying RAM t-SNE embedding. Each dot represents a separate frame from each agent, colored by score (darker color indicates higher score). The plot highlights that in this game, A2C and ES visit similar distributions of states (corresponding to the same sub-optimal behavior), while Ape-X visits a distinct part of the state-space, i.e. matching what could manually be distilled from watching the policy movies shown in supplemental figure S9. The interface allows clicking on points to observe the corresponding RGB frame, and for toggling different runs of different algorithms for visualization.

5.4 Generating Synthetic Inputs to Understand Neurons

While the previous sections explore DNN activations in the context of an agent’s training environment, another approach is to optimize synthetic input images that stimulate particular DNN neurons. Variations on this approach have yielded striking results in vision models Nguyen et al. (2017, 2016a); Olah et al. (2018); Simonyan et al. (2013); the hope is that these techniques could yield an additional view on DRL agents’ neural representations. To enable this analysis, we leverage the Lucid visualization library Luc (2018); in particular, we create wrapper classes that enable easy integration of Atari Zoo models into Lucid, and release Jupyter notebooks that generate synthetic inputs for different DRL models.

We now present a series of synthetic inputs generated by the Lucid library across a handful of games that highlight the potential of these kinds of visualizations for DRL understanding (further details of the technique used are described in supplemental section S2.2. We first explore the kinds of features learned across depth. Supplemental figure S13 supports what was learned by visualizing the first-layer filter weights for value-based networks (section 4.1; i.e. showing that first convolution layers in the value-based networks appear to be learning edge-detector features). The activation videos of section 5.1 and the patch-based approach of section 5.2 help to provide grounding, showing that in the context of the game, some first-layer filters detect the edges of the screen, in effect to serve as location anchors, while others encode concepts like blinking objects (see figure S11). Supplemental figure S14 explores visualizing later-layer convolution filters, and figure 4 show inputs synthesized to maximize output neurons, which sometimes yields interpretable features.

Figure 4: Synthesized inputs for output layer neurons in Seaquest.

For a representative run of Rainbow and DQN, inputs are shown optimized to maximize the activation of the first neuron in the output layer of a Seaquest network. Because Rainbow includes C51, its image is in effect optimized to maximize the probability of a low-reward scenario; this neuron appears to be learning interpretable features such as submarine location and the seabed. When maximizing (or minimizing) DQN Q-value outputs (one example shown on left), this qualitative outcome of interpretability was not observed.

Figure 5: Synthesized inputs for fully-connected layer neurons in Freeway. Inputs synthesized to maximize activations of the first three neurons in the first fully connected layer are shown for a respresentative DQN and Rainbow DNN. One of the Rainbow neurons (in red rectangle) appears to be capturing lane features.

Such visualizations can also reveal that critical features are being attended to (figure 5 and supplemental figure S15). Overall, these visualizations demonstrate the potential of this kind of technique, and we believe that many useful further insights may result from a more systematic application and investigation of this and many of the other interesting visualization techniques implemented by Lucid, which can now easily be applied to Atari Zoo models. Also promising would be to further explore regularization to constrain the space of synthetic inputs, e.g. a generative model of Atari frames in the spirit of Nguyen et al. (2017) or similar works.

6 Discussion and Conclusions

There are many follow-up extensions that the initial explorations of the zoo raise. One natural extension is to include more DRL algorithms; we would like to include agents from IMPALA Espeholt et al. (2018) or other popular policy-gradients algorithms (like TRPO Schulman et al. (2015) or PPO Schulman et al. (2017)), to balance out the distribution of DRL algorithms. Beyond algorithms, there are many alternate architectures that might have interesting effects on representation and decision-making, for example recurrent architectures Hausknecht and Stone (2015), or architectures that exploit attention Sorokin et al. (2015). Also intriguing is examining the effect of the incentive driving search: Do auxiliary or substitute objectives qualitatively change DRL representations, e.g. as in UNREAL Jaderberg et al. (2016), curiosity-driven exploration Pathak et al. (2017), or novelty search Conti et al. (2017)? How do the representations and features of meta-learning agents such as RL Wang et al. (2016) or MAML Finn et al. (2017) change as they learn a new task? Finally, there are other analysis tools that could be implemented, which might illuminate other interesting properties of DRL algorithms and learned representation, e.g. the image perturbation analysis of Greydanus et al. (2017) or a variety of sophisticated neuron visualization techniques Nguyen et al. (2017, 2016b). We welcome community contributions for these algorithms, models, architectures, incentives, and tools.

While the main motivation for the zoo was to reduce friction for research into understanding and visualizing the behavior of DRL agents, it can also serve as a platform for other research questions. For example, having a zoo of agents trained on individual games, for different amounts of data, also would reduce friction for exploring transfer learning within Atari

Sobol et al. (2018), i.e. whether experience learned on one game can quickly benefit on another game. Also, by providing a huge library of cached rollouts for agents across algorithms, the zoo may be interesting in the context of learning from demonstrations Hester et al. (2017), or for creating generative models of games Oh et al. (2015). In conclusion, we look forward to seeing how this dataset will be used by the community at large.

References

Supplementary Material

The following sections contain additional figures, and describe in more detail the experimental setups applied in the paper’s experiments.

S1 Quantitative Analysis Details

Figure S1 shows a sampling of first-layer convolutional filters from final trained models, and figure S2 highlights that such filters often differentially attend to the present over the past.

(a) Seaquest
(b) Venture
Figure S1: Learned Convolutional Filters. In games in which they exceed random performance, filters for the gradient-based algorithms often have spatial structure and sometimes resemble edge detectors, and the intensity of weights often degrades into the past (i.e. the left-most patches). This can be seen for all gradient-based methods in (a) Seaquest; when gradient-based methods fail to learn, as DQN and A2C often do in (b) Venture, their filters then appear more random (this effect is consistent across runs). Filters for the evolutionary algorithms appear less regular, even when their performance is competitive with the gradient-based methods.
Figure S2: Significance of Time Across Models. Filter weight magnitudes across input patches are shown averaged across a representative sample of ALE games with independent runs each. Before averaging, past-frame weight magnitudes are normalized by that of weights attending to the most recent observation (i.e. the present magnitudes are anchored to ). For the gradient-based DRL algorithms (Ape-X, Rainbow, DQN, & A2C), filter weights are stronger when connected to the current frame than to historical frames. Interestingly, such a trend is not seen for the evolutionary algorithms; note that ES includes L2 regularization, so this effect is not merely an artifact of weight decay being present in the gradient-based methods only. The effect is also present when looking at individual games (data not shown).

s1.1 Further Study of Temporal Bias in DQN

As an exploration of the connection between the information theoretic measure of memory-dependent action in Dann et al. [2016] and the pattern highlighted in this paper (i.e. the strength of filter weights in the first layer of convolutions may highlight a network’s reliance on the past), we examined first-layer filters in DQN across all 55 games. A simple metric of present-focus is the ratio of average weight magnitudes for the past three frames to the present frame. When sorted by this metric (see figure S3), there is high agreement with the 8 games identified by Dann et al. [2016] that use memory. In particular, three out of the top four games identified by our metric align with theirs; as do six out of the top twelve games, considered among the games that overlap between their 49 and our 55.

Figure S3: Attention to the past in DQN. DQN’s tendency to focus on the present relative to the past (as measured by filter magnitudes from different input frames), is shown across 55 ALE games. From left to right, the amount of present-bias increases, e.g. the games at the left seemingly have greater use for information stored in the past three frames relative to the games on the right.

s1.2 Observation and Parameter Noise Details

(a) Normalized by algorithm best
(b) Normalized by best over all algorithms
Figure S4: Robustness to Observation Noise

. How performance of trained policies degrades with increasing severe normally-distributed noise is shown, averaged over three independent runs across the analysis subset of games. The figure hows performance degrades (a) relative to baseline performance by that algorithm on each game, and (b) by the best performance of any algorithm on each game. Zero performance in this chart represents random play. The conclusion is that the policy search algorithms show less steep degradation relative to (a) their own best performance; although this is confounded by (b) the overall better absolute performance of the value-based methods. Follow-up analysis will control for performance, by using the Atari Zoo human-performance frozen models.

Figure S4 shows robustness to observation noise for games in the analysis subset. Beyond observation noise, another interesting property of learning algorithms is the kind of local optimum they find in the parameter space, i.e. whether the learned function is smooth or not in the area of a found solution. One gross tool for examining this property is to test the robustness of policies to parameter perturbations. It is plausible that the evolutionary methods would be more robust in this way, given that they are trained through parameter perturbations. To measure this, we perturb the convolutional weights of each DRL agent with increasingly severe normally-distributed noise. We perturbed the convolutional weights only, because that part of the DNN is identical across agents, whereas the fully-connected layers sometimes vary in structure across DRL algorithms (e.g. value-based algorithms like Rainbow or Ape-X that include features that require post-convolutional architectural changes). Figure S5 shows the results across games and algorithms; however, no strong trend is evident.

(a) Performance relative to algorithm best
(b) Performance relative to best over algorithms
Figure S5: Robustness to Parameter Noise. How performance of trained policies degrades with increasing severe normally-distributed parameter noise is shown, averaged over three independent runs across the analysis subset of games. The figure hows performance degrades (a) relative to baseline performance by that algorithm on each game, and (b) by the best performance of any algorithm on each game. Zero performance in this chart represents random play. Interestingly, no clear trends emerge from the data; our prior hypothesis was that the evolutionary algorithms might exhibit higher robustness by this measure (given that they are trained with parameter perturbations).

s1.3 Distinctiveness of Policies Learned by Algorithms

We use only the “present” channel of each gray-scale observation frame (i.e. without the complete four-frame stack) to train a classifier for each game. The classifier consists of two convolution layers and two fully connected layers, and is trained with early stopping to avoid overfitting. For each game, 2501 frames are collected from multiple evaluations by each model. The reported classification results use 20% of the frames as test set. Figure S6 summarizes classification performance across models and games, while table S1

shows performance averaged across games (highlighting that Skiing is an outlier in terms of classifier performance).

Figure S6: F1 scores for frame classification. F1 score is defined as . We observe the classifier distinguishes each algorithm in all environments with at least 0.6 F1 score except for Skiing.
Game Mean F1
Amidar 0.95
Assault 0.86
Asterix 0.94
Asteroids 0.95
Atlantis 0.7
Enduro 0.93
Frostbite 0.93
Gravitar 0.9
Kangaroo 0.92
Seaquest 0.86
Skiing 0.05
Venture 0.95
Zaxxon 0.95
Table S1: Average F1 scores by game. The score is an unweighted average of F1 scores across all algorithm.
Figure S7: Confusion matrix for Seaquest frame classification. The cell in the th row and the th column denotes the number of frames generated by the algorithm in row that are predicted to be generated by the algorithm in column . The conclusion is that in this game, there is a cluster of confusion among the direct policy search algorithms (ES, A2C, and GA), suggesting that they are converging to similarly sub-optimal behaviors.
Figure S8: Confusion matrix for all games. The cell in the th row and the th column denotes the total number of frames from the rollouts of the algorithm in row predicted to be a from rollouts of the algorithm in column

. The true positive predictions are reset to 0 to highlight the false positives. Skiing is excluded from this chart because it works so poorly that it alone greatly skews the aggregate view this chart provides.

S2 Visualization Details

This section provides more details and figures for the visualization portion of the paper’s analysis. Figure S9 shows one frame of a collage of simultaneous videos that give a quick high-level comparison of how different algorithms and runs are solving an ALE environment. Figure S9 shows one frame of a video that simultaneously shows a DNN agent acting in an ALE environment and all of the activations of its DNN.

Figure S9: Grid of Rollout Videos in Seaquest. The vertical axis represents different independently-trained models, while the horizontal axis represents the DRL algorithms included in the Atari Zoo. In Seaquest, one objective is to control a submarine to shoot fish without getting hit by them, and another is to avoid running out of oxygen by intermittently resurfacing. All three independent runs of A2C, GA, and ES converge to the same sub-optimal behavior: They dive to the bottom of the ocean, and shoot fish until they run out of oxygen. The value-function based methods exhibit more sophisticated behavior, highlighting that in this game, greedy policy searches may often converge to sub-optimal solutions, while learning the value of state-action pairs can avoid this pathology. Video is available at: http://t.uber.com/atarizoo_rolloutgrid
Figure S10: Policy and activation visualization. The figure shows a still frame from a video of an Ape-X agent acting in the Seaquest environment (full video can be accessed at http://t.uber.com/atarizoo_activation). On the left, the RGB frame is shown, while from top to bottom on the right are: the processed observations, and then the activations for the convolutional layers, the fully connected layer, and finally, the Q-value outputs. From watching the video, it is apparent that the brightest neuron in the third convolutional layer tracks the position of the submarine. This shows that like in vision DNNs, sometimes important features are represented in a local, rather than distributed fashion Yosinski et al. [2015].

Figure S11 shows a second example of how the image-patch finder can help ground out what particular DNN neurons are learning.

Figure S11: Location-anchor and oxygen-detector in a Rainbow agent in Seaquest. The top three images show image patches (red square) that highly-activate a first-layer convolution filter of a Rainbow agent; this filter always activates maximally in the same geometric location, potentially serving as a geometric anchor for localization by down-stream filters. The bottom three images show images patches that highly-activate a separate first-layer filter in the same agent. It detects blinking objects; the submarine can blink before it runs out of oxygen, and the oxygen meter itself blinks when it is running low.

s2.1 t-SNE Details

Figure S12: Comparing high-level DNN representations through separate t-SNE embeddings. The figure shows separate t-SNE embeddings of high-level representations for DNNs trained to play Seaquest by A2C and Ape-X. Each dot corresponds to a specific frame in a rollout, and darker shades indicate higher scores. Embeddings that represent similar frames cluster together, indicating states with different positions of the submarine, and objects of various numbers, categories and colors. Representative frames for selected clusters are displayed. For example, in the left figure (A2C), the top-left cluster represents terminated states, and the bottom-left cluster corresponds to the situation of oxygen depletion, while in the right figure (Ape-X), bottom-right cluster corresponds to a repeated series of actions that the agent takes to surface and refill its oxygen.

To visualize RAM states and high-level DNN representations in 2D, principal component analysis (PCA) is first applied to reduce the number of dimensions to 50, followed by 3000 t-SNE iterations with perplexity of 30. The dimensionality reduction of RAM states is applied across all available runs of multiple algorithms, while that of high-level DNN representations is with respect to a specific model of a given algorithm.

s2.2 Synthetic Input Generation Details

We use the lucid library Luc [2018] to visualize what types of inputs maximize neuron activations throughout the agents’ networks. This study used the trained checkpoints provided by Dopamine Bellemare et al. [2018] for DQN and Rainbow (although it could be applied to any of the DRL algorithms in the Atari Zoo). These frozen graphs are then loaded as part of a Lucid model and an optimization objective is created.

An input pattern to the network (consisting of a stack of four 84x84 pixel screens) is optimized to maximize the activations of the desired neurons. Initally, the four 84x84 frames are initialized with random noise. The result of optimization ideally yields visualizations that reveal qualitatively what features the neurons have learned to capture. As recommended in Olah et al. [2017] and Mahendran and Vedaldi [2016] we apply some regularization to produce clearer results; for most images we use only image jitter (i.e. randomly offsetting the input image by a few pixels to encourage local translation invariance). For some images, we found it helpful to add total variation regularization (to encourage local smoothness; see Mahendran and Vedaldi [2016]) and L1 regularization (to encourage pixels that are not contributing to the objective to become zero) on the optimized image.

Figure S13: Synthesized inputs for neurons in the first convolutional layer in Seaquest. Inputs optimized to activate the first three neurons in the first convolutional layer are shown for representative runs of DQN and Rainbow. These neurons appear to be learning ‘edge-detector’ style features.
Figure S14: Synthesized inputs for neurons in the third convolutional layer in Seaquest. Inputs optimized to activate the first four neurons in the last (third) convolutional layer in Seaquest are shown for a representative run of DQN and Rainbow (hyperparameters for regularization, e.g. a total variation penalty, were optimized by hand to improve image quality). Both networks appear to focus on particular styles of objects, combinations of them, and animation-related features such as blinking. Some synthetic inputs make sense in the context of other investigatory tools; e.g. Rainbow’s first neuron’s synthetic input includes objects that blink between frames, and when explored with the patch activation technique, is seen responding most intensely when the sub is blinking and about to explode from running out of oxygen. However, for some features it is unclear how the synthetic input is to be interpreted without further investigation, e.g. the patch activation technique shows that Rainbow’s third neuron responds most when the sub is nearing the top border of the water. Further experimentation with regularization within Lucid, or employing more sophistiacted techniques, may help to improve these initial results.
Figure S15: Synthesized inputs for neurons in the third convolutional layer in Pong. Inputs optimized to activate the first four neurons in the last (third) convolutional layer in Pong are shown for a representative run of DQN and Rainbow. Both networks seem to learn qualitatively similar features, with images featuring vertical lines reminiscent of patterns and smaller objects reminiscent of balls. Further exploration is needed to ground out these evocative appearances.

S3 DRL Algorithm Details and Hyperparameters

This section describes the implementations and hyperparameters used for training the models released with the zoo. The DQN and Rainbow models come from the Dopamine model release Bellemare et al. [2018]111The hyperparameters and training details can be found in https://github.com/google/dopamine/tree/master/baselines/. The following sections describe the algorithms for the newly-trained models released with this paper.

s3.1 A2c

The implementation of A2C Mnih et al. [2016] that generated the models in this paper was derived from the OpenAI baselines software package Dhariwal et al. [2017]. It ran with 20 parallel worker threads for 400 million frames; checkpoints occurred every 4 million frames. Hyperparameters are listed in table S2.

Hyperparameter Setting
Learning Rate 7e-5
1.0
Value Function Loss Coefficient 0.5
Entropy Loss Coefficient 0.01
Discount factor 0.99
Table S2: A2C Hyperparameters. Population sizes are incremented to account for elites (). Many of the unusual numbers were found via preliminary hyperparameter searches in other domains.

s3.2 Ape-X

The implementation of Ape-X used to generate the models in this paper can be found here: https://github.com/uber-research/ape-x. The hyperparameters are reported in Table S3.

Hyperparameter Setting
Buffer Size
Number of Actors 384
Batch Size 512
n-step 3
gamma 0.99
gradient clipping 40
target network period 2500
Prioritized replay (0.6, 0.4)
Adam Learning rate 0.00025 / 4
Table S3: Ape-X Hyperparameters. For more details on what these parameters signify, see Horgan et al. [2018].

s3.3 Ga

The implementation of GA used to generate the models in this paper can be found here: https://github.com/uber-research/deep-neuroevolution. The hyperparameters are reported in Table S4 and were found through random search.

Hyperparameter Setting
(Mutation Power) 0.002
Population Size 1000
Truncation Size 20
Table S4: GA Hyperparameters. For more details on what these parameters signify, see Such et al. [2017].

s3.4 Es

The implementation of ES used to generate the models in this paper can be found here: https://github.com/uber-research/deep-neuroevolution. The hyperparameters reported in Table S5 were found via preliminary search and are similar to those reported in Conti et al. [2017].

Hyperparameter Setting
(Mutation Power) 0.02
Virtual Batch Size 128
Population Size 5000
Learning Rate 0.01
Optimizer Adam
L2 Regularization Coefficient 0.005
Table S5: ES Hyperparameters. For more details on what these parameters signify, see Salimans et al. [2017], Conti et al. [2017].