, there is an increasing interest in understanding DeepRL models. Combining deep learning techniques with reinforcement learning algorithms, DeepRL leverages the strong representation capacity and approximation power of DNNs for return estimation and policy optimization(Sutton & Barto, 1998)
. In modern applications where a state is defined by high-dimensional data input,e.g., Atari 2600 (Bellemare et al., 2013), the task of DeepRL divides into two essential sub-tasks, i.e., generating (low-dimensional) representations on states and subsequent policy learning using such representations.
As DeepRL does not optimize for class discriminative objectives, previous interpretation methods developed for classification models are not readily applicable to DeepRL models. The approximation of the optimal state value or action distribution not only operates in a black-box manner, but incorporates temporal information and environment dynamics. The black-box and sequential nature of DeepRL models makes them inherently difficult to understand.
Although interpreting DeepRL models is challenging, some efforts have been devoted in recent years to studying the behaviors of these complex models. Most of the existing interpretation methods (Mnih et al., 2015; Wang et al., 2016; Zahavy et al., 2016; Greydanus et al., 2018) are a-posteriori, explaining a model after it has been trained. For instance, some t-SNE-based methods (Mnih et al., 2015; Zahavy et al., 2016) employ game-specific human intuitions and expert knowledge in RL. Other vision-inspired methods (Wang et al., 2016) adopt traditional saliency methods. The representative (Greydanus et al., 2018) adopts a data-driven approach for illustrating policy responses to a fixed input masking function, requiring hundreds of forward passes per frame. As a common limitation, these a-posteriori methods cannot improve training with the deduced knowledge.
In this work, we approach from a learning perspective, and propose Region Sensitive Rainbow (RS-Rainbow) to improve both the interpretability and performance of a DeepRL model. To this end, RS-Rainbow leverages a region-sensitive module to estimate the importance of different sub-regions on the screen, which is used to guide policy learning in end-to-end training. Specifically, a sub-region containing a distinctive pattern or objects useful for policy learning is assigned with high importance. A combination of important sub-regions replaces the original unweighted screen as the representation of a state. Throughout an episode, the focus points of a pattern detector change as a result of game dynamics, and lead to policy variations. Therefore, each pattern detector illustrates a distinct line of reasoning by the agent. With the region-sensitive module, we produce intuitive visualizations (see Fig. 1) in a single backward pass without human interventions or repetitive, costly passes through the network.
The primary contribution of this work is to provide, to the best of our knowledge, the first learning-based approach for automatically interpreting DeepRL models. It requires no extra supervision and is end-to-end trainable. Moreover, it possesses three advantages:
1) In contrast to previous methods (Zahavy et al., 2016; Greydanus et al., 2018), RS-Rainbow illustrates the actual rationale used in inference for decision making, in an intuitive manner without human interventions.
2) Besides supporting innate interpretation, quantitative experiments on the Atari 2600 platform (Bellemare et al., 2013) demonstrate that RS-Rainbow effectively improves policy learning. In comparison, previous a-posteriori methods are unable to bring performance enhancements.
3) The region-sensitive module, the core component of RS-Rainbow, is a simple and efficient plug-in. It can be potentially applied to many DQN-based models for performance gains and a built-in visualization advantage.
The rest of the paper is organized as follows. We provide a brief overview of background knowledge in Sec. 2 and present the details of the proposed RS-Rainbow in Sec. 3. Sec. 4 demonstrates the interpretability of RS-Rainbow and Sec. 5 gives the quantitative evaluation of RS-Rainbow on Atari games. Conclusions are given in Sec. 6.
2.1 DQN and Rainbow
As an RL algorithm, DQN seeks to find a policy which maximizes the long-term return of an agent acting in an environment, with convergence guarantee provided by a Bellman equation. DQN combines deep learning with the traditional off-policy, value-based Q-learning algorithm by employing a DNN as a value approximation function and the mean-squared error minimization as an alternative for temporal difference updating (Sutton, 1988; Tesauro, 1995). Target network and experience replay are two key engineering feats to stabilize training. In DQN, Q value refers to the expected discounted return for executing a particular action in a given state and following the current policy thereafter. Given optimal Q values, the optimal policy follows as taking the action with the highest Q value.
Rainbow (Hessel et al., 2018) incorporates many extensions over the original DQN (Mnih et al., 2013, 2015), each of which enhances a different aspect of the model. Such extensions include double DQN (van Hasselt et al., 2016), dueling DQN (Wang et al., 2016), priority experience replay (Schaul et al., 2016), multi-step learning (Sutton, 1988), distributional RL (Bellemare et al., 2017), and noisy nets (Fortunato et al., 2018)
. Double DQN addresses the over-estimation of Q in the target function. Dueling DQN decomposes the estimation of Q into separate estimations for a state value and an action advantage. Priority experience replay samples training data of higher learning potential with higher frequency. Multi-step learning looks multiple steps ahead by replacing one-step rewards and states with their multi-step counterparts. Noisy net injects adaptable noises to linear layer outputs to introduce state-dependent exploration. In distributional RL, Q is modeled as a random variable whose distribution is learned over a fixed support set of discrete values. The resulting Kullbeck-Leibler divergence loss enjoys convergence guarantee as the return distributions satisfy a Bellman equation.
2.2 Understanding DeepRL
Interpreting RL systems traditionally involves language generation via first-order logic (Dodson et al., 2011; Elizalde et al., 2008; Khan et al., 2009; Hayes & Shah, 2017). These approaches rely on small state spaces and high-level state variables with interpretable semantics. As such, they are not applicable to most modern DeepRL applications, such as vision-based Atari 2600 tasks (Bellemare et al., 2013).
propose Semi-Aggregated Markov Decision Process (SAMDP), which visualizes hierarchical spatio-temporal abstractions in a policy with game-specific attributes. The manual selection of suitable attributes makes SAMDP moderately rely on human intuition for good performance. Moreover, extracting these attributes from simple emulators like Atari is particularly laborious without interface support. While high-level abstractions are informative to RL experts, a user without relevant theoretical backgrounds may find them hard to understand.
The work of (Greydanus et al., 2018) adopts perturbation-based saliency (Shrikumar et al., 2017) to visualize pixel importance in an asynchronous advantage actor-critic (A3C) model (Mnih et al., 2016)
. It applies a masking function at fixed dense locations on the input frame and observes the impact on the target output, measured by the Euclidean distance. Such methods can be computationally inefficient as each perturbation requires a separate forward pass through the network. Therefore, hundreds of forward passes are required for computing saliency on a single frame. Some work(Shrikumar et al., 2017) points out that saliency (Springenberg et al., 2015) tend to underestimate feature importance. Finally, as analyzed in Sec. 1, the prowess of saliency may be fundamentally limited by the optimization objectives of DeepRL, i.e., value estimation or policy optimization.
3 Proposed Approach
There are three main considerations in our motivation for RS-Rainbow. First, by definition, pixels on the screen do not all contain useful information for value prediction. For example, functional objects are critical while the background is less relevant. Second, the relevance of an object depends on the specific state. For instance, an unimportant background object may become important in some states when it is associated with reward signals, which can happen due to environment determinism. Third, humans tend to play a game by looking at sub-regions with high strategic values on the screen rather than considering all information on the entire screen.
Thus we are interested in the following questions. Will exploiting the relevance of objects in an environment benefit policy learning in DeepRL, given that such information can potentially improve state representations? If so, how can we learn the relevance information without extra supervision? Once learned, can object relevance shed light on the inference process of a DeepRL agent? In the next section, we describe our approach to exploring the answers.
We present an end-to-end learning architecture for addressing the above questions. The complete architecture of RS-Rainbow is illustrated in Fig. 2, which consists of an image encoder, the region-sensitive module, and policy layers with a value stream and an advantage stream.
As in Rainbow (Hessel et al., 2018), our image encoder
is a three-layer CNN interleaved with ReLU nonlinearities(Nair & Hinton, 2010). At each time step , a stack of four consecutive frames S of shape is drawn from the replay memory. The image encoder takes S as input, and outputs the image embedding , where denotes the size of the channel dimension and two s denote the size of the height and width dimensions. We normalize I along the channel dimension to ensure scale invariance.
In the region-sensitive module, we employ two layers of convolutions with ELU activation (Clevert et al., 2016). The region-sensitive module takes I as input, and outputs score maps , where is the number of score maps each of size . Each element on a score map corresponds to a spatial location on I
, and describes the importance of the image feature vector at the corresponding location. Then score mapsA
are passed to a normalization layer to generate meaningful probability distributions. In our experiments, we implement the normalization layer using the softmax function or the sigmoid function. The final probability distributions after normalizingA are denoted as , where is the -th () counterpart of the learned probability distributions.
Each highlights a unique criterion of the agent in selecting important regions. As discussed in Sec. 1 and Sec. 4, each assigns high importance to a unique pattern in the game. Therefore, the most important area according to contains the most salient visual features for decision making. During training, P learns to assign importance to a diverse set of patterns that complement each other. Together, they form a holistic view on the state. Note that no extra supervision is provided for learning P.
For each , we generate the corresponding image embedding as a unique representation of the state. is defined as the element-wise product of and I by broadcasting along the channel dimension, as Hence, is of the same shape as I. To obtain the final state representation, we aggregate as In summary, the original image embedding I is scaled at each spatial location by the corresponding estimated importance, and independent estimations are aggregated to form the final representation of the state.
in the task of neural machine translation and further extended in areas such as visual question answering(Yang et al., 2016) and image caption generation (Xu et al., 2015). Different from attention, our region-sensitive module does not assume the role as a mapping function from a query and a key-value pair to an aggregated output.
Finally, the policy layers consist of an advantage stream and a value stream, the outputs of which are aggregated to estimate the state-action value Q. Each stream is implemented by two noisy linear layers (Fortunato et al., 2018) and ReLU (Nair & Hinton, 2010). A noisy linear layer introduces learnable noises into a linear function, thereby inducing state-dependent exploration which replaces -greedy exploration. Finally, Q values are calculated as the mean of a learned distribution over a fixed support set of discrete return values, which are used to derive the policy.
Based on the region-sensitive module, we explore how to visualize and interpret learned salient regions, which are most important to decision making.
The first alternative (see Fig. 3) is to directly overlay upsampled onto the original screen. The intensity corresponds to the importance weight. As is of , this alternative effectively treats the original screen as a grid, and incorrectly assumes that the receptive field of each element in corresponds to a grid cell. The most prominent issue is that localization is highly inaccurate.
In the second and the third alternatives, we apply soft and binary saliency masks to the original screen, respectively. We first calculate the gradient-based saliency (Simonyan et al., 2014) of the largest importance score from each score map, as , where indexes spatial locations in . We take the absolute value of and normalize between and
as saliency. The original saliency corresponds to a soft mask, and we also binarize it to generate a binary mask.
As shown in Fig. 3 and Fig. 3, we multiply the soft saliency mask and the binary saliency mask with the original frame, respectively, with both approaches accurately locating the salient object. In principle, there is no difference between them, however, we observe that the soft saliency mask is fuzzy and uneven, while the binary saliency mask produces clear and intuitive visualization.
Based on the above analysis, we adopt the binary saliency approach shown in Fig. 3 in our following interpretations of the challenging games of Atari 2600. Note that our visualization is automatically learned, which is different from existing a-posteriori methods. Interested readers can refer to (Zahavy et al., 2016; Greydanus et al., 2018) for more details.
4 Atari Analysis
In this racing game, the total return is the total number of cars that the player has passed. On each day, the player has to pass a minimum number of cars to qualify for the next day. Passing more cars than the minimum does not bring extra return. Variations in weather and time add extra difficulty for avoiding collisions.
By setting the number of score maps in RS-Rainbow, we obtain two individual gazes of the agent. A gaze is a region assigned the highest importance and contributes the most to the Q value estimation. We first describe the most common patterns appearing in the two gazes throughout the game, and use them to characterize the general policy. Then we focus on special cases when gazes shift to new patterns, which we discover explaining interesting changes in the inference rationale.
General strategy. As shown in Fig. 4(a) to Fig. 4(c), both the left and right gazes attend to the race track, yet with different focuses. We discover that the left gaze focuses on different segments of the race track at different times, e.g., the far, the intermediate, and the near segments, whereas the right gaze consistently follows the player, which can be seen as a player tracker. Importantly, the locations discovered by the left gaze correspond to distant cars that are potential collision targets, and the player tracker also closely monitors upcoming cars that are imminent collision threats.
The general inference rationale of RS-Rainbow is summarized as the following. On a high level, the agent considers the race track as the most important region and subsequently features from this region contributes the most to Q value predictions. Specifically, the agent distinguishes between two categories of objects on the race track, i.e., cars and the player. On the one hand, the agent locates the player and the local area around for avoiding immediate collisions. On the other hand, the agent locates the next potential collision targets at various distances. The agent first separately recognizes the player and approaching cars, and then combines the two when making decisions.
We highlight three properties of our interpretations. First, the gazes are automatically learned during end-to-end training without extra supervision. Second, our interpretations are not a-posteriori analysis as in (Greydanus et al., 2018) and (Zahavy et al., 2016). Instead, we illustrate the prominent patterns that contribute the most to decision making. Third, the interpretations are also the reasons for the performance improvements observed in Sec. 5.
Counting down. Near the completion of the current level, the agent “celebrates” in advance. As shown from Fig. 4(d) to Fig. 4(f), the left gaze loses its focus on cars and diverts to the mileage board starting when only cars are left. We draw an analogy between counting down and the premature celebration of a runner in a race. In both cases, victory signs greatly influence the evaluation of states. We observe a normally functioning player tracker in the right gaze, and there is no noticeable policy shift in this stage. Therefore, we discover an insight about the internal decision making of RS-Rainbow that cannot be revealed by policy outputs.
Slacking. Upon reaching the goal, the agent does not receive reward signals until the next day starts. During this period, the agent learns to output no-op actions, corresponding to not playing the game. We refer to this stage as “slacking.” We are interested in what leads to the decision of not playing. Fig. 4(g) shows that when slacking happens, both gazes fixate on the mileage board, where flags are displayed indicating task completion. As such, the agent no longer considers the race track as important, and relies the most on the flags to make a decision. The recognition of the flags as a sign of zero return leads to the no-op policy.
Prepping. Near the start of a new race, the agent terminates slacking early, and starts driving in advance to get a head start. The flags are still up and there are still no rewards for playing. It is intriguing in this case why decision making has changed. As shown in Fig. 4(h), the left gaze focuses on an inconspicuous region in the background, i.e., some mountains and the sky. As it turns out, the agent recognizes dawn (time near a new race start) from the unique colours of the light gray sky and the orange mountains. Since dawn indicates forthcoming rewards, the normally unimportant mountains and the sky become important features for value prediction. In a way, the agent resembles an advanced human player who can exploit inconspicuous details and determinism in the game for earning higher rewards.
Smog. When smog partially blocks the front view, the left gaze cannot find car targets and strays off the road into empty fields. The distracted left gaze results in minor performance decrease. This indicates the importance of localizing collision targets in advance, which is a reasonable rule according to human intuition.
In this game, we discover that under the general setting, RS-Rainbow differentiates the player and approaching cars, while also combines them for decision making. In special stages of this game, the agent employs specific visual cues for making decisions. Surprisingly, we find some of these insights reasonably intuitive.
In this game, ms_pacman accumulates points by collecting pellets while avoiding ghosts in a maze. Eating power pellets makes the ghosts vulnerable. Eating fruits and vulnerable ghosts adds bonus points. Therefore, the moving objects, ghosts, vulnerable ghosts, and fruits are essential for high return. Ms_pacman proceeds to the next level after eating all pellets.
Fig. 5 illustrates the learned gazes of RS-Rainbow in this game. The right gaze stays focused on ms_pacman to track its position and detect nearby threats and pellets. The left gaze attends to different moving objects and locations in different states. Next we interpret specific game strategies via visualizations.
Moving objects detection. In Fig. 5(a), the left gaze detects two ghosts on the upper-right corner of the maze. Therefore ms_pacman, located by the right gaze, stays in the mid-left region to safely collect dense rewards. In Fig. 5(b), as ms_pacman chases after vulnerable ghosts, the left gaze locks in on three vulnerable ghosts in the mid-right region. In Fig. 5(c), the left gaze detects a new cherry near the lower-left warp tunnel entrance. Therefore, ms_pacman enters the closest warp tunnel from the right side, to be transported toward the cherry.
Travelling through a warp tunnel. In Fig. 5(d), the right gaze locates ms_pacman entering the upper-right tunnel. The left gaze predicts the exiting upper-left tunnel. In Fig. 5(e) and Fig. 5(f), we observe the same patterns, where the left gaze predicts the destination of a warp tunnel transportation.
Eating the last pellet. As shown in Fig. 5(g), the left gaze locates the last pellet near the bottom of the maze as ms_pacman moves towards this pellet. A few frames later in Fig. 5(h), a red ghost appears near the pellet and is captured by the left gaze, while ms_pacman diverts to the right to avoid the ghost. In Fig. 5(i), the ghosts besiege ms_pacman from all different directions. Even though the agent detects all ghosts (they appear in the gazes), ms_pacman has no route to escape.
In Frostbite, the player must build an igloo and enter it, before the temperature drops to zero. The mechanism for building is jumping on uncollected (white) floating ice blocks. The ice blocks, remaining degrees of temperature, and fish all provide rewards. Falling into water, temperature dropping to zero, and contacting the bear, clams, and birds all cost the player a life.
We discover that while the right gaze consistently locates the player, the left gaze takes on the role of a generic target detector specific to different strategies in this game. We identify three general types of targets detected by the left gaze, each of which defines a key strategy in the game. Next we interpret the rationale of the agent in each of these strategies using visualizations in Fig. 6.
Jumping. As shown in Fig. 6(a), the left gaze visualizes the next jumping destination, i.e., white ice blocks at the top, while the right gaze locates the player. Jumping onto white ice blocks is the most important skill in this game, as it both provides immediate rewards and builds the igloo for level completion. Subsequently, the most common pattern detected by the left gaze is white ice blocks that are the next jumping destination. We summarize the rationale of the agent as simultaneously locating the jumping target and the player to complete the jumping action.
Inspecting progress. As shown in Fig. 6(b), when the agent is close to finishing the igloo, the left gaze looks at the igloo in advance. The right gaze locates the player as usual. Throughout the game, the agent frequently inspects the completion status of the igloo, in preparation for timely entrance, which is reflected in the localization of the igloo by the left gaze.
Entering the igloo. Fig. 6(c) shows a bear chasing after the player as it enters the completed igloo. The right gaze still locates the player in front of the igloo, while the left gaze captures the malicious bear. After igloo completion, the player must jump onto land and run for the igloo while avoiding the bear. From the left gaze shown in Fig. 6(c), the agent learns the significance of the bear, and tracks the status of the bear when making decisions about jumping onto land and running for the igloo.
5 Quantitative Evaluation
As emphasized above, RS-rainbow can lead to performance improvements due to a better policy learning paradigm. In this section, we give a quantitative evaluation of RS-Rainbow on Atari 2600.
5.1 Testing Environment and Preprocessing
A suite of Atari 2600 games from the Arcade Learning Environment (Bellemare et al., 2013) is a benchmark testbed for DeepRL algorithms. The comparison is conducted with other state-of-the-art methods on games, including beam_rider, breakout, enduro, ms_pacman, pong, seaquest, and space_invaders.
As for the preprocessing step, we follow (Wang et al., 2016; Schaul et al., 2016; Hessel et al., 2018). In more detail, each frame is converted from RGB format into single-channel grayscale and downsampled from the resolution of to
via bilinear interpolation. At each time step, the input is four consecutive preprocessed frames stacked along the channel dimension.
5.2 Implementation Details
We use the publicly available code for Rainbow with the same hyperparameters and model details as in(Hessel et al., 2018). For the selection of normalization layers in the region-sensitive module, we employ the sigmoid function in games breakout, space_invaders, and seaquest, and the softmax function in the rest games. In both training and testing, we cap the episode length at K frames and adopt an action repeat of . During training, rewards are clipped in the range of . The exploration strategy is achieved by noisy linear layers. For every environment steps, we suspend training and evaluate the agent for episodes, and use the snapshot with the highest average score for testing. During testing, an -greedy policy with is used as a standard practice. We evaluate the agents under the no-op random start condition. At the beginning of each test episode, a random number (up to ) of no-ops are executed (as is done in training) before the agent starts playing. For the final performance evaluation, we report the average score across test episodes.
5.3 Comparison with the State-of-the-art
We compare the performance of RS-Rainbow with several other state-of-the-art methods in Table 1. The selected methods include Rainbow (Hessel et al., 2018), Distributional DQN (Bellemare et al., 2017), Noisy DQN (Fortunato et al., 2018), Duelling DDQN (Wang et al., 2016), Prioritized DDQN (Schaul et al., 2016), DDQN (van Hasselt et al., 2016), and DQN (Mnih et al., 2015).
For the performance of Rainbow, we report both the original scores quoted from (Hessel et al., 2018) and the ones reproduced by us. We denote our re-implementation as Rainbow in Table 1. Note that the published performances of the DQN variants (including Rainbow) are obtained after training for million environment steps, and our reported results of RS-Rainbow and Rainbow are obtained after training with only million environment steps, due to limited computational resources. The only exception is on game frostbite, where we train for million steps.
Table 1 shows that RS-Rainbow achieves better performances than Rainbow with a large margin in out of games. Compared with Rainbow, RS-Rainbow achieves a improvement on beam_rider, improvement on breakout, improvement on enduro, improvement on frostbite, improvement on ms_pacman, improvement on seaquest, and improvement on space_invaders. RS-Rainbow achieves the same performance as Rainbow in pong, a nearly perfect score of , with being the maximum achievable score.
Compared with the state-of-the-art methods, RS-Rainbow outperforms the best-performing models on out of games by solid margins, including beam_rider, enduro, frostbite, ms_pacman, seaquest and space_invaders. For instance, it improves over prioritized DDQN on beam_rider by , duelling DDQN on ms_pacman and seaquest by and respectively, and Rainbow on frostbite and space_invaders by and respectively. On the rest two games, i.e., breakout and pong, RS-Rainbow also reports competitive scores. The results are especially encouraging, as RS-Rainbow is trained with far fewer training frames as described above. More performance gains can be expected when RS-Rainbow is trained with more frames or utilizes a massively distributed computing platform (Nair et al., 2015; Horgan et al., 2018).
We approach the problem of interpreting DeepRL models from a learning perspective. Our proposed RS-Rainbow embeds innate interpretability into the learning model, leading to both clear visualizations and superior performances. It will be interesting to integrate our region-sensitive module with other DeepRL models, such as A3C (Mnih et al., 2016) and proximal policy optimization (Schulman et al., 2017). We leave these issues in our future work.
- Bahdanau et al. (2015) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
- Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In ICML, 2017.
- Clevert et al. (2016) Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). In ICLR, 2016.
Dabkowski & Gal (2017)
Dabkowski, P. and Gal, Y.
Real time image saliency for black box classifiers.In NeurIPS. 2017.
- Dodson et al. (2011) Dodson, T., Mattei, N., and Goldsmith, J. A natural language argumentation interface for explanation generation in markov decision processes. In Algorithmic Decision Theory, 2011.
- Elizalde et al. (2008) Elizalde, F., Sucar, L., Luque, M., Díez, F., and Reyes Ballesteros, A. Policy explanation in factored markov decision processes. In European Workshop on Probabilistic Graphical Models, 2008.
- Fong & Vedaldi (2017) Fong, R. and Vedaldi, A. Interpretable explanations of black boxes by meaningful perturbation. In ICCV, 2017.
- Fortunato et al. (2018) Fortunato, M., Azar, M. G., Piot, B., Menick, J., Hessel, M., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., Blundell, C., and Legg, S. Noisy networks for exploration. In ICLR, 2018.
- Greydanus et al. (2018) Greydanus, S., Koul, A., Dodge, J., and Fern, A. Visualizing and understanding Atari agents. In ICML, 2018.
- Hayes & Shah (2017) Hayes, B. and Shah, J. A. Improving robot controller transparency through autonomous policy explanation. In ACM/IEEE International Conference on Human-robot Interaction, 2017.
- Hessel et al. (2018) Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In AAAI, 2018.
- Horgan et al. (2018) Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., and Silver, D. Distributed prioritized experience replay. In ICLR, 2018.
- Khan et al. (2009) Khan, O. Z., Poupart, P., and Black, J. P. Minimal sufficient explanations for factored markov decision processes. In International Conference on Automated Planning and Scheduling, 2009.
- Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.
- Maaten & Hinton (2008) Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579–2605, 2008.
- Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. In NeurIPS Deep Learning Workshop. 2013.
- Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
- Nair et al. (2015) Nair, A., Srinivasan, P., Blackwell, S., Alcicek, C., Fearon, R., De Maria, A., Panneershelvam, V., Suleyman, M., Beattie, C., Petersen, S., Legg, S., Mnih, V., Kavukcuoglu, K., and Silver, D. Massively parallel methods for deep reinforcement learning. In ICML Deep Learning Workshop. 2015.
- Nair & Hinton (2010) Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
- Schaul et al. (2016) Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay. In ICLR, 2016.
- Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
- Shrikumar et al. (2017) Shrikumar, A., Greenside, P., and Kundaje, A. Learning important features through propagating activation differences. In ICML, 2017.
- Simonyan et al. (2014) Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR, 2014.
- Springenberg et al. (2015) Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for simplicity: The all convolutional net. In ICLR, 2015.
- Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, 1988.
- Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Introduction to Reinforcement Learning. MIT Press, 1st edition, 1998.
- Tesauro (1995) Tesauro, G. Temporal difference learning and td-gammon. Communications of the ACM, 38(3):58–68, 1995.
- van Hasselt et al. (2016) van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In AAAI, 2016.
- Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In NeurIPS, 2017.
- Wang et al. (2016) Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., and Freitas, N. Dueling network architectures for deep reinforcement learning. In ICML, 2016.
- Xu et al. (2015) Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
- Yang et al. (2016) Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. J. Stacked attention networks for image question answering. In CVPR, 2016.
- Zahavy et al. (2016) Zahavy, T., Ben-Zrihem, N., and Mannor, S. Graying the black box: Understanding dqns. In ICML, 2016.
- Zeiler & Fergus (2014) Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In ECCV, 2014.