Learn to Interpret Atari Agents

12/29/2018 ∙ by Zhao Yang, et al. ∙ 8

Deep Reinforcement Learning (DeepRL) models surpass human-level performance in a multitude of tasks. Standing in stark contrast to the stellar performance is the obscure nature of the learned policies. The direct mapping from states to actions makes it hard to interpret the rationale behind the decision making of agents. In contrast to previous a-posteriori methods of visualising DeepRL policies, we propose an end-to-end trainable framework based on Rainbow, a representative Deep Q-Network (DQN) agent. Our method automatically detects important regions in the input domain, which enables characterization of general strategy and explanation for non-intuitive behaviors. Hence, we call it Region Sensitive Rainbow (RS-Rainbow). RS-Rainbow utilises a simple yet effective mechanism to incorporate innate visualisation ability into the learning model, not only improving the interpretability, but enabling the agent to leverage enhanced state representations for improved performance. Without extra supervision, specialised feature detectors focusing on distinct aspects of gameplay can be learned. Extensive experiments on the challenging platform of Atari 2600 demonstrates the superiority of RS-Rainbow. In particular, our agent achieves state of the art at just 25 parallel training.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Understanding deep neural networks (DNN) has been a long-standing goal of the machine learning community. Many efforts exploit the class discriminative nature of the CNN-based classification models 

(Krizhevsky et al., 2012) for producing human-interpretable visual explanations (Simonyan et al., 2014; Zeiler & Fergus, 2014; Springenberg et al., 2015; Shrikumar et al., 2017; Fong & Vedaldi, 2017; Dabkowski & Gal, 2017).

With the advent of Deep Reinforcement Learning (DeepRL) (Mnih et al., 2013, 2015)

, there is an increasing interest in understanding DeepRL models. Combining deep learning techniques with reinforcement learning algorithms, DeepRL leverages the strong representation capacity and approximation power of DNNs for return estimation and policy optimization 

(Sutton & Barto, 1998)

. In modern applications where a state is defined by high-dimensional data input, 

e.g., Atari 2600 (Bellemare et al., 2013), the task of DeepRL divides into two essential sub-tasks, i.e., generating (low-dimensional) representations on states and subsequent policy learning using such representations.

As DeepRL does not optimize for class discriminative objectives, previous interpretation methods developed for classification models are not readily applicable to DeepRL models. The approximation of the optimal state value or action distribution not only operates in a black-box manner, but incorporates temporal information and environment dynamics. The black-box and sequential nature of DeepRL models makes them inherently difficult to understand.

Although interpreting DeepRL models is challenging, some efforts have been devoted in recent years to studying the behaviors of these complex models. Most of the existing interpretation methods (Mnih et al., 2015; Wang et al., 2016; Zahavy et al., 2016; Greydanus et al., 2018) are a-posteriori, explaining a model after it has been trained. For instance, some t-SNE-based methods (Mnih et al., 2015; Zahavy et al., 2016) employ game-specific human intuitions and expert knowledge in RL. Other vision-inspired methods (Wang et al., 2016) adopt traditional saliency methods. The representative (Greydanus et al., 2018) adopts a data-driven approach for illustrating policy responses to a fixed input masking function, requiring hundreds of forward passes per frame. As a common limitation, these a-posteriori methods cannot improve training with the deduced knowledge.

In this work, we approach from a learning perspective, and propose Region Sensitive Rainbow (RS-Rainbow) to improve both the interpretability and performance of a DeepRL model. To this end, RS-Rainbow leverages a region-sensitive module to estimate the importance of different sub-regions on the screen, which is used to guide policy learning in end-to-end training. Specifically, a sub-region containing a distinctive pattern or objects useful for policy learning is assigned with high importance. A combination of important sub-regions replaces the original unweighted screen as the representation of a state. Throughout an episode, the focus points of a pattern detector change as a result of game dynamics, and lead to policy variations. Therefore, each pattern detector illustrates a distinct line of reasoning by the agent. With the region-sensitive module, we produce intuitive visualizations (see Fig. 1) in a single backward pass without human interventions or repetitive, costly passes through the network.

The primary contribution of this work is to provide, to the best of our knowledge, the first learning-based approach for automatically interpreting DeepRL models. It requires no extra supervision and is end-to-end trainable. Moreover, it possesses three advantages:

1) In contrast to previous methods (Zahavy et al., 2016; Greydanus et al., 2018), RS-Rainbow illustrates the actual rationale used in inference for decision making, in an intuitive manner without human interventions.

2) Besides supporting innate interpretation, quantitative experiments on the Atari 2600 platform (Bellemare et al., 2013) demonstrate that RS-Rainbow effectively improves policy learning. In comparison, previous a-posteriori methods are unable to bring performance enhancements.

3) The region-sensitive module, the core component of RS-Rainbow, is a simple and efficient plug-in. It can be potentially applied to many DQN-based models for performance gains and a built-in visualization advantage.

The rest of the paper is organized as follows. We provide a brief overview of background knowledge in Sec. 2 and present the details of the proposed RS-Rainbow in Sec. 3. Sec. 4 demonstrates the interpretability of RS-Rainbow and Sec. 5 gives the quantitative evaluation of RS-Rainbow on Atari games. Conclusions are given in Sec. 6.

2 Background

Figure 2: The architecture of the proposed RS-Rainbow.

2.1 DQN and Rainbow

As an RL algorithm, DQN seeks to find a policy which maximizes the long-term return of an agent acting in an environment, with convergence guarantee provided by a Bellman equation. DQN combines deep learning with the traditional off-policy, value-based Q-learning algorithm by employing a DNN as a value approximation function and the mean-squared error minimization as an alternative for temporal difference updating (Sutton, 1988; Tesauro, 1995). Target network and experience replay are two key engineering feats to stabilize training. In DQN, Q value refers to the expected discounted return for executing a particular action in a given state and following the current policy thereafter. Given optimal Q values, the optimal policy follows as taking the action with the highest Q value.

Rainbow (Hessel et al., 2018) incorporates many extensions over the original DQN (Mnih et al., 2013, 2015), each of which enhances a different aspect of the model. Such extensions include double DQN (van Hasselt et al., 2016), dueling DQN (Wang et al., 2016), priority experience replay (Schaul et al., 2016), multi-step learning (Sutton, 1988), distributional RL (Bellemare et al., 2017), and noisy nets (Fortunato et al., 2018)

. Double DQN addresses the over-estimation of Q in the target function. Dueling DQN decomposes the estimation of Q into separate estimations for a state value and an action advantage. Priority experience replay samples training data of higher learning potential with higher frequency. Multi-step learning looks multiple steps ahead by replacing one-step rewards and states with their multi-step counterparts. Noisy net injects adaptable noises to linear layer outputs to introduce state-dependent exploration. In distributional RL, Q is modeled as a random variable whose distribution is learned over a fixed support set of discrete values. The resulting Kullbeck-Leibler divergence loss enjoys convergence guarantee as the return distributions satisfy a Bellman equation.

2.2 Understanding DeepRL

Interpreting RL systems traditionally involves language generation via first-order logic (Dodson et al., 2011; Elizalde et al., 2008; Khan et al., 2009; Hayes & Shah, 2017). These approaches rely on small state spaces and high-level state variables with interpretable semantics. As such, they are not applicable to most modern DeepRL applications, such as vision-based Atari 2600 tasks (Bellemare et al., 2013).

In the context of DeepRL, (Mnih et al., 2015) and (Zahavy et al., 2016) propose to interpret DQN policies in the t-SNE (Maaten & Hinton, 2008) embedding space. (Zahavy et al., 2016)

propose Semi-Aggregated Markov Decision Process (SAMDP), which visualizes hierarchical spatio-temporal abstractions in a policy with game-specific attributes. The manual selection of suitable attributes makes SAMDP moderately rely on human intuition for good performance. Moreover, extracting these attributes from simple emulators like Atari is particularly laborious without interface support. While high-level abstractions are informative to RL experts, a user without relevant theoretical backgrounds may find them hard to understand.

The work of (Greydanus et al., 2018) adopts perturbation-based saliency (Shrikumar et al., 2017) to visualize pixel importance in an asynchronous advantage actor-critic (A3C) model (Mnih et al., 2016)

. It applies a masking function at fixed dense locations on the input frame and observes the impact on the target output, measured by the Euclidean distance. Such methods can be computationally inefficient as each perturbation requires a separate forward pass through the network. Therefore, hundreds of forward passes are required for computing saliency on a single frame. Some work 

(Shrikumar et al., 2017) points out that saliency (Springenberg et al., 2015) tend to underestimate feature importance. Finally, as analyzed in Sec. 1, the prowess of saliency may be fundamentally limited by the optimization objectives of DeepRL, i.e., value estimation or policy optimization.

3 Proposed Approach

In this section, we introduce our motivation in Sec. 3.1, then describe the architecture of RS-Rainbow in Sec. 3.2, and finally present its capability for visualization in Sec. 3.3.

3.1 Motivation

There are three main considerations in our motivation for RS-Rainbow. First, by definition, pixels on the screen do not all contain useful information for value prediction. For example, functional objects are critical while the background is less relevant. Second, the relevance of an object depends on the specific state. For instance, an unimportant background object may become important in some states when it is associated with reward signals, which can happen due to environment determinism. Third, humans tend to play a game by looking at sub-regions with high strategic values on the screen rather than considering all information on the entire screen.

Thus we are interested in the following questions. Will exploiting the relevance of objects in an environment benefit policy learning in DeepRL, given that such information can potentially improve state representations? If so, how can we learn the relevance information without extra supervision? Once learned, can object relevance shed light on the inference process of a DeepRL agent? In the next section, we describe our approach to exploring the answers.

3.2 Architecture

We present an end-to-end learning architecture for addressing the above questions. The complete architecture of RS-Rainbow is illustrated in Fig. 2, which consists of an image encoder, the region-sensitive module, and policy layers with a value stream and an advantage stream.

As in Rainbow (Hessel et al., 2018), our image encoder

is a three-layer CNN interleaved with ReLU nonlinearities 

(Nair & Hinton, 2010). At each time step , a stack of four consecutive frames S of shape is drawn from the replay memory. The image encoder takes S as input, and outputs the image embedding , where denotes the size of the channel dimension and two s denote the size of the height and width dimensions. We normalize I along the channel dimension to ensure scale invariance.

In the region-sensitive module, we employ two layers of convolutions with ELU activation (Clevert et al., 2016). The region-sensitive module takes I as input, and outputs score maps , where is the number of score maps each of size . Each element on a score map corresponds to a spatial location on I

, and describes the importance of the image feature vector at the corresponding location. Then score maps

A

are passed to a normalization layer to generate meaningful probability distributions. In our experiments, we implement the normalization layer using the softmax function or the sigmoid function. The final probability distributions after normalizing

A are denoted as , where is the -th () counterpart of the learned probability distributions.

Each highlights a unique criterion of the agent in selecting important regions. As discussed in Sec. 1 and Sec. 4, each assigns high importance to a unique pattern in the game. Therefore, the most important area according to contains the most salient visual features for decision making. During training, P learns to assign importance to a diverse set of patterns that complement each other. Together, they form a holistic view on the state. Note that no extra supervision is provided for learning P.

For each , we generate the corresponding image embedding as a unique representation of the state. is defined as the element-wise product of and I by broadcasting along the channel dimension, as Hence, is of the same shape as I. To obtain the final state representation, we aggregate as In summary, the original image embedding I is scaled at each spatial location by the corresponding estimated importance, and independent estimations are aggregated to form the final representation of the state.

The region-sensitive module is related to the broader concept of attention popularized by (Bahdanau et al., 2015; Vaswani et al., 2017)

in the task of neural machine translation and further extended in areas such as visual question answering 

(Yang et al., 2016) and image caption generation (Xu et al., 2015). Different from attention, our region-sensitive module does not assume the role as a mapping function from a query and a key-value pair to an aggregated output.

Finally, the policy layers consist of an advantage stream and a value stream, the outputs of which are aggregated to estimate the state-action value Q. Each stream is implemented by two noisy linear layers (Fortunato et al., 2018) and ReLU (Nair & Hinton, 2010). A noisy linear layer introduces learnable noises into a linear function, thereby inducing state-dependent exploration which replaces -greedy exploration. Finally, Q values are calculated as the mean of a learned distribution over a fixed support set of discrete return values, which are used to derive the policy.

Figure 3: Three alternatives for visualization. (a) Weights overlay. (b) Soft saliency mask. (c) Binary saliency mask.
(a) General
(b) General
(c) General
(d) Counting down
(e) Counting down
(f) Counting down
(g) Slacking
(h) Prepping
(i) Smog
Figure 4: Visualizing enduro. (a)-(c) correspond to the general strategy. (d)-(f) represent the special stage of counting down. (g), (h) and (i) illustrate the stages of slacking, prepping, and smog, respectively.

3.3 Visualization

Based on the region-sensitive module, we explore how to visualize and interpret learned salient regions, which are most important to decision making.

The first alternative (see Fig. 3) is to directly overlay upsampled onto the original screen. The intensity corresponds to the importance weight. As is of , this alternative effectively treats the original screen as a grid, and incorrectly assumes that the receptive field of each element in corresponds to a grid cell. The most prominent issue is that localization is highly inaccurate.

In the second and the third alternatives, we apply soft and binary saliency masks to the original screen, respectively. We first calculate the gradient-based saliency (Simonyan et al., 2014) of the largest importance score from each score map, as , where indexes spatial locations in . We take the absolute value of and normalize between and

as saliency. The original saliency corresponds to a soft mask, and we also binarize it to generate a binary mask.

As shown in Fig. 3 and Fig. 3, we multiply the soft saliency mask and the binary saliency mask with the original frame, respectively, with both approaches accurately locating the salient object. In principle, there is no difference between them, however, we observe that the soft saliency mask is fuzzy and uneven, while the binary saliency mask produces clear and intuitive visualization.

Based on the above analysis, we adopt the binary saliency approach shown in Fig. 3 in our following interpretations of the challenging games of Atari 2600. Note that our visualization is automatically learned, which is different from existing a-posteriori methods. Interested readers can refer to  (Zahavy et al., 2016; Greydanus et al., 2018) for more details.

4 Atari Analysis

4.1 Enduro

(a) Detecting Ghosts
(b) Detecting Vulnerable Ghosts
(c) Detecting Fruits
(d) The Warp Tunnel
(e) The Warp Tunnel
(f) The Warp Tunnel
(g) The Last Pellet
(h) The Last Pellet
(i) The Last Pellet
Figure 5: Visualizing ms_pacman. (a)-(c) Detecting moving objects: ghosts, vulnerable ghosts, and fruits. (d)-(f) Travelling through a warp tunnel. (g)-(i) Eating the last pellet in the maze.

In this racing game, the total return is the total number of cars that the player has passed. On each day, the player has to pass a minimum number of cars to qualify for the next day. Passing more cars than the minimum does not bring extra return. Variations in weather and time add extra difficulty for avoiding collisions.

By setting the number of score maps in RS-Rainbow, we obtain two individual gazes of the agent. A gaze is a region assigned the highest importance and contributes the most to the Q value estimation. We first describe the most common patterns appearing in the two gazes throughout the game, and use them to characterize the general policy. Then we focus on special cases when gazes shift to new patterns, which we discover explaining interesting changes in the inference rationale.

General strategy. As shown in Fig. 4(a) to Fig. 4(c), both the left and right gazes attend to the race track, yet with different focuses. We discover that the left gaze focuses on different segments of the race track at different times, e.g., the far, the intermediate, and the near segments, whereas the right gaze consistently follows the player, which can be seen as a player tracker. Importantly, the locations discovered by the left gaze correspond to distant cars that are potential collision targets, and the player tracker also closely monitors upcoming cars that are imminent collision threats.

The general inference rationale of RS-Rainbow is summarized as the following. On a high level, the agent considers the race track as the most important region and subsequently features from this region contributes the most to Q value predictions. Specifically, the agent distinguishes between two categories of objects on the race track, i.e., cars and the player. On the one hand, the agent locates the player and the local area around for avoiding immediate collisions. On the other hand, the agent locates the next potential collision targets at various distances. The agent first separately recognizes the player and approaching cars, and then combines the two when making decisions.

We highlight three properties of our interpretations. First, the gazes are automatically learned during end-to-end training without extra supervision. Second, our interpretations are not a-posteriori analysis as in (Greydanus et al., 2018) and (Zahavy et al., 2016). Instead, we illustrate the prominent patterns that contribute the most to decision making. Third, the interpretations are also the reasons for the performance improvements observed in Sec. 5.

Counting down. Near the completion of the current level, the agent “celebrates” in advance. As shown from Fig. 4(d) to Fig. 4(f), the left gaze loses its focus on cars and diverts to the mileage board starting when only cars are left. We draw an analogy between counting down and the premature celebration of a runner in a race. In both cases, victory signs greatly influence the evaluation of states. We observe a normally functioning player tracker in the right gaze, and there is no noticeable policy shift in this stage. Therefore, we discover an insight about the internal decision making of RS-Rainbow that cannot be revealed by policy outputs.

Slacking. Upon reaching the goal, the agent does not receive reward signals until the next day starts. During this period, the agent learns to output no-op actions, corresponding to not playing the game. We refer to this stage as “slacking.” We are interested in what leads to the decision of not playing. Fig. 4(g) shows that when slacking happens, both gazes fixate on the mileage board, where flags are displayed indicating task completion. As such, the agent no longer considers the race track as important, and relies the most on the flags to make a decision. The recognition of the flags as a sign of zero return leads to the no-op policy.

Prepping. Near the start of a new race, the agent terminates slacking early, and starts driving in advance to get a head start. The flags are still up and there are still no rewards for playing. It is intriguing in this case why decision making has changed. As shown in Fig. 4(h), the left gaze focuses on an inconspicuous region in the background, i.e., some mountains and the sky. As it turns out, the agent recognizes dawn (time near a new race start) from the unique colours of the light gray sky and the orange mountains. Since dawn indicates forthcoming rewards, the normally unimportant mountains and the sky become important features for value prediction. In a way, the agent resembles an advanced human player who can exploit inconspicuous details and determinism in the game for earning higher rewards.

Smog. When smog partially blocks the front view, the left gaze cannot find car targets and strays off the road into empty fields. The distracted left gaze results in minor performance decrease. This indicates the importance of localizing collision targets in advance, which is a reasonable rule according to human intuition.

In this game, we discover that under the general setting, RS-Rainbow differentiates the player and approaching cars, while also combines them for decision making. In special stages of this game, the agent employs specific visual cues for making decisions. Surprisingly, we find some of these insights reasonably intuitive.

4.2 Ms_pacman

(a) Jumping
(b) Inspecting Progress
(c) Entering the Igloo
Figure 6: Visualizing frostbite. (a) corresponds to jumping over ice blocks. (b) corresponds to checking the construction progress of the igloo. (c) corresponds to entering the igloo.

In this game, ms_pacman accumulates points by collecting pellets while avoiding ghosts in a maze. Eating power pellets makes the ghosts vulnerable. Eating fruits and vulnerable ghosts adds bonus points. Therefore, the moving objects, ghosts, vulnerable ghosts, and fruits are essential for high return. Ms_pacman proceeds to the next level after eating all pellets.

Fig. 5 illustrates the learned gazes of RS-Rainbow in this game. The right gaze stays focused on ms_pacman to track its position and detect nearby threats and pellets. The left gaze attends to different moving objects and locations in different states. Next we interpret specific game strategies via visualizations.

Moving objects detection. In Fig. 5(a), the left gaze detects two ghosts on the upper-right corner of the maze. Therefore ms_pacman, located by the right gaze, stays in the mid-left region to safely collect dense rewards. In Fig. 5(b), as ms_pacman chases after vulnerable ghosts, the left gaze locks in on three vulnerable ghosts in the mid-right region. In Fig. 5(c), the left gaze detects a new cherry near the lower-left warp tunnel entrance. Therefore, ms_pacman enters the closest warp tunnel from the right side, to be transported toward the cherry.

Travelling through a warp tunnel. In Fig. 5(d), the right gaze locates ms_pacman entering the upper-right tunnel. The left gaze predicts the exiting upper-left tunnel. In Fig. 5(e) and Fig. 5(f), we observe the same patterns, where the left gaze predicts the destination of a warp tunnel transportation.

Eating the last pellet. As shown in Fig. 5(g), the left gaze locates the last pellet near the bottom of the maze as ms_pacman moves towards this pellet. A few frames later in Fig. 5(h), a red ghost appears near the pellet and is captured by the left gaze, while ms_pacman diverts to the right to avoid the ghost. In Fig. 5(i), the ghosts besiege ms_pacman from all different directions. Even though the agent detects all ghosts (they appear in the gazes), ms_pacman has no route to escape.

4.3 Frostbite

Method beam_rider breakout enduro frostbite ms_pacman pong seaquest space_invaders
DQN 8,627.5 385.5 729.0 797.4 3,085.6 19.5 5,860.6 1,692.3
DDQN 13,772.8 418.5 1,211.8 1,683.3 2,711.4 20.9 16,452.7 2,525.5
Prior. DDQN 22,430.7 381.5 2,155.0 3,421.6 4,751.2 20.7 44,417.4 7,696.9
Duel. DDQN 12,164.0 345.3 2,258.2 4,672.8 6,283.5 21.0 50,254.2 6,427.3
Dist. DQN 13,213.4 612.5 2,259.3 3,938.2 3,769.2 20.8 4,754.4 6,869.1
Noisy DQN 12,534.0 459.1 1,129.2 583.6 2,501.6 21.0 2,495.4 2,145.5
Rainbow 16,850.2 417.5 2,125.9 9,590.5 5,380.4 20.9 15,898.9 18,789.0
Rainbow* 17,656.8 370.7 2,283.6 11,298.3 6,686.3 20.9 73,601.4 3,001.2
RS-Rainbow (ours) 26,722.3 434.2 2,329.1 12902.0 7,219.3 20.9 245,307.3 19,670.0
Table 1: Comparison of performance with other state-of-the-art methods under the no-op testing condition. Rainbow denotes our re-implementation of Rainbow. The results of other methods are quoted from the respective original papers.

In Frostbite, the player must build an igloo and enter it, before the temperature drops to zero. The mechanism for building is jumping on uncollected (white) floating ice blocks. The ice blocks, remaining degrees of temperature, and fish all provide rewards. Falling into water, temperature dropping to zero, and contacting the bear, clams, and birds all cost the player a life.

We discover that while the right gaze consistently locates the player, the left gaze takes on the role of a generic target detector specific to different strategies in this game. We identify three general types of targets detected by the left gaze, each of which defines a key strategy in the game. Next we interpret the rationale of the agent in each of these strategies using visualizations in Fig. 6.

Jumping. As shown in Fig. 6(a), the left gaze visualizes the next jumping destination, i.e., white ice blocks at the top, while the right gaze locates the player. Jumping onto white ice blocks is the most important skill in this game, as it both provides immediate rewards and builds the igloo for level completion. Subsequently, the most common pattern detected by the left gaze is white ice blocks that are the next jumping destination. We summarize the rationale of the agent as simultaneously locating the jumping target and the player to complete the jumping action.

Inspecting progress. As shown in Fig. 6(b), when the agent is close to finishing the igloo, the left gaze looks at the igloo in advance. The right gaze locates the player as usual. Throughout the game, the agent frequently inspects the completion status of the igloo, in preparation for timely entrance, which is reflected in the localization of the igloo by the left gaze.

Entering the igloo. Fig. 6(c) shows a bear chasing after the player as it enters the completed igloo. The right gaze still locates the player in front of the igloo, while the left gaze captures the malicious bear. After igloo completion, the player must jump onto land and run for the igloo while avoiding the bear. From the left gaze shown in Fig. 6(c), the agent learns the significance of the bear, and tracks the status of the bear when making decisions about jumping onto land and running for the igloo.

5 Quantitative Evaluation

As emphasized above, RS-rainbow can lead to performance improvements due to a better policy learning paradigm. In this section, we give a quantitative evaluation of RS-Rainbow on Atari 2600.

5.1 Testing Environment and Preprocessing

A suite of Atari 2600 games from the Arcade Learning Environment (Bellemare et al., 2013) is a benchmark testbed for DeepRL algorithms. The comparison is conducted with other state-of-the-art methods on games, including beam_rider, breakout, enduro, ms_pacman, pong, seaquest, and space_invaders.

As for the preprocessing step, we follow (Wang et al., 2016; Schaul et al., 2016; Hessel et al., 2018). In more detail, each frame is converted from RGB format into single-channel grayscale and downsampled from the resolution of to

via bilinear interpolation. At each time step, the input is four consecutive preprocessed frames stacked along the channel dimension.

5.2 Implementation Details

We use the publicly available code for Rainbow with the same hyperparameters and model details as in 

(Hessel et al., 2018). For the selection of normalization layers in the region-sensitive module, we employ the sigmoid function in games breakout, space_invaders, and seaquest, and the softmax function in the rest games. In both training and testing, we cap the episode length at K frames and adopt an action repeat of . During training, rewards are clipped in the range of . The exploration strategy is achieved by noisy linear layers. For every environment steps, we suspend training and evaluate the agent for episodes, and use the snapshot with the highest average score for testing. During testing, an -greedy policy with is used as a standard practice. We evaluate the agents under the no-op random start condition. At the beginning of each test episode, a random number (up to ) of no-ops are executed (as is done in training) before the agent starts playing. For the final performance evaluation, we report the average score across test episodes.

5.3 Comparison with the State-of-the-art

We compare the performance of RS-Rainbow with several other state-of-the-art methods in Table 1. The selected methods include Rainbow (Hessel et al., 2018), Distributional DQN (Bellemare et al., 2017), Noisy DQN (Fortunato et al., 2018), Duelling DDQN (Wang et al., 2016), Prioritized DDQN (Schaul et al., 2016), DDQN (van Hasselt et al., 2016), and DQN (Mnih et al., 2015).

For the performance of Rainbow, we report both the original scores quoted from (Hessel et al., 2018) and the ones reproduced by us. We denote our re-implementation as Rainbow in Table 1. Note that the published performances of the DQN variants (including Rainbow) are obtained after training for million environment steps, and our reported results of RS-Rainbow and Rainbow are obtained after training with only million environment steps, due to limited computational resources. The only exception is on game frostbite, where we train for million steps.

Table 1 shows that RS-Rainbow achieves better performances than Rainbow with a large margin in out of games. Compared with Rainbow, RS-Rainbow achieves a improvement on beam_rider, improvement on breakout, improvement on enduro, improvement on frostbite, improvement on ms_pacman, improvement on seaquest, and improvement on space_invaders. RS-Rainbow achieves the same performance as Rainbow in pong, a nearly perfect score of , with being the maximum achievable score.

Compared with the state-of-the-art methods, RS-Rainbow outperforms the best-performing models on out of games by solid margins, including beam_rider, enduro, frostbite, ms_pacman, seaquest and space_invaders. For instance, it improves over prioritized DDQN on beam_rider by , duelling DDQN on ms_pacman and seaquest by and respectively, and Rainbow on frostbite and space_invaders by and respectively. On the rest two games, i.e., breakout and pong, RS-Rainbow also reports competitive scores. The results are especially encouraging, as RS-Rainbow is trained with far fewer training frames as described above. More performance gains can be expected when RS-Rainbow is trained with more frames or utilizes a massively distributed computing platform (Nair et al., 2015; Horgan et al., 2018).

6 Conclusion

We approach the problem of interpreting DeepRL models from a learning perspective. Our proposed RS-Rainbow embeds innate interpretability into the learning model, leading to both clear visualizations and superior performances. It will be interesting to integrate our region-sensitive module with other DeepRL models, such as A3C (Mnih et al., 2016) and proximal policy optimization (Schulman et al., 2017). We leave these issues in our future work.

References

  • Bahdanau et al. (2015) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
  • Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  • Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In ICML, 2017.
  • Clevert et al. (2016) Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). In ICLR, 2016.
  • Dabkowski & Gal (2017) Dabkowski, P. and Gal, Y.

    Real time image saliency for black box classifiers.

    In NeurIPS. 2017.
  • Dodson et al. (2011) Dodson, T., Mattei, N., and Goldsmith, J. A natural language argumentation interface for explanation generation in markov decision processes. In Algorithmic Decision Theory, 2011.
  • Elizalde et al. (2008) Elizalde, F., Sucar, L., Luque, M., Díez, F., and Reyes Ballesteros, A. Policy explanation in factored markov decision processes. In European Workshop on Probabilistic Graphical Models, 2008.
  • Fong & Vedaldi (2017) Fong, R. and Vedaldi, A. Interpretable explanations of black boxes by meaningful perturbation. In ICCV, 2017.
  • Fortunato et al. (2018) Fortunato, M., Azar, M. G., Piot, B., Menick, J., Hessel, M., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., Blundell, C., and Legg, S. Noisy networks for exploration. In ICLR, 2018.
  • Greydanus et al. (2018) Greydanus, S., Koul, A., Dodge, J., and Fern, A. Visualizing and understanding Atari agents. In ICML, 2018.
  • Hayes & Shah (2017) Hayes, B. and Shah, J. A. Improving robot controller transparency through autonomous policy explanation. In ACM/IEEE International Conference on Human-robot Interaction, 2017.
  • Hessel et al. (2018) Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In AAAI, 2018.
  • Horgan et al. (2018) Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., and Silver, D. Distributed prioritized experience replay. In ICLR, 2018.
  • Khan et al. (2009) Khan, O. Z., Poupart, P., and Black, J. P. Minimal sufficient explanations for factored markov decision processes. In International Conference on Automated Planning and Scheduling, 2009.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.
  • Maaten & Hinton (2008) Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008.
  • Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. In NeurIPS Deep Learning Workshop. 2013.
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
  • Nair et al. (2015) Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., Panneershelvam, V., Suleyman, M., Beattie, C., Petersen, S., Legg, S., Mnih, V., Kavukcuoglu, K., and Silver, D. Massively parallel methods for deep reinforcement learning. In ICML Deep Learning Workshop. 2015.
  • Nair & Hinton (2010) Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
  • Schaul et al. (2016) Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. In ICLR, 2016.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
  • Shrikumar et al. (2017) Shrikumar, A., Greenside, P., and Kundaje, A. Learning important features through propagating activation differences. In ICML, 2017.
  • Simonyan et al. (2014) Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR, 2014.
  • Springenberg et al. (2015) Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The all convolutional net. In ICLR, 2015.
  • Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, 1988.
  • Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Introduction to Reinforcement Learning. MIT Press, 1st edition, 1998.
  • Tesauro (1995) Tesauro, G. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.
  • van Hasselt et al. (2016) van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In AAAI, 2016.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In NeurIPS, 2017.
  • Wang et al. (2016) Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., and Freitas, N. Dueling network architectures for deep reinforcement learning. In ICML, 2016.
  • Xu et al. (2015) Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
  • Yang et al. (2016) Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. J. Stacked attention networks for image question answering. In CVPR, 2016.
  • Zahavy et al. (2016) Zahavy, T., Ben-Zrihem, N., and Mannor, S. Graying the black box: Understanding dqns. In ICML, 2016.
  • Zeiler & Fergus (2014) Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In ECCV, 2014.