Towards Better Interpretability in Deep Q-Networks

09/15/2018 ∙ by Raghuram Mandyam Annasamy, et al. ∙ Carnegie Mellon University 18

Deep reinforcement learning techniques have demonstrated superior performance in a wide variety of environments. As improvements in training algorithms continue at a brisk pace, theoretical or empirical studies on understanding what these networks seem to learn, are far behind. In this paper we propose an interpretable neural network architecture for Q-learning which provides a global explanation of the model's behavior using key-value memories, attention and reconstructible embeddings. With a directed exploration strategy, our model can reach training rewards comparable to the state-of-the-art deep Q-learning models. However, results suggest that the features extracted by the neural network are extremely shallow and subsequent testing using out-of-sample examples shows that the agent can easily overfit to trajectories seen during training.



There are no comments yet.


page 5

page 6

page 7

page 12

page 13

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The last few years have witnessed a rapid growth of research and interest in the domain of deep Reinforcement Learning (RL) due to the significant progress in solving RL problems [Arulkumaran et al.2017]

. Deep RL has been applied to a wide variety of disciplines ranging from game playing, robotics, systems to natural language processing and even biological data

[Silver et al.2017, Mnih et al.2015, Levine et al.2016, Kraska et al.2018, Williams, Asadi, and Zweig2017, Choi et al.2017]

. However, most applications treat neural networks as a black-box and the problem of understanding and interpreting deep learning models remains a hard problem. This is even more understudied in the context of deep reinforcement learning and only recently has started to receive attention. Commonly used visualization methods for deep learning such as saliency maps and t-SNE plots of embeddings have been applied to deep RL models

[Greydanus et al.2017, Zahavy, Ben-Zrihem, and Mannor2016, Mnih et al.2015]. However, there are a few questions over the reliability of saliency methods including, as an example, sensitivity to simple transformations of the input [Kindermans et al.2017]. The problem of generalization and memorization with deep RL models is also important. Recent findings suggest that deep RL agents can easily memorize large amounts of training data with drastically varying test performance and are vulnerable to adversarial attacks [Zhang et al.2018, Zhang, Ballas, and Pineau2018, Huang et al.2017].

In this paper, we propose a neural network architecture for Q-learning using key-value stores, attention and constrained embeddings, that is easier to study than the traditional deep Q-network architectures. This is inspired by some of the recent work on Neural Episodic Control (NEC) [Pritzel et al.2017] and distributional perspectives on RL [Bellemare, Dabney, and Munos2017]. We call this model i-DQN for Interpretable DQN and study latent representations learned by the model on standard Atari environments from Open AI gym [Brockman et al.2016]. Most current work around interpretability in deep learning is based on local explanations i.e. explaining network predictions for specific input examples [Lipton2016]. For example, saliency maps can highlight important regions of the input that influence the output of the neural network. In contrast, global explanations attempt to understand the mapping learned by a neural network regardless of the input. We achieve this by constraining the latent space to be reconstructible and inverting embeddings of representative elements in the latent space (keys). This helps us understand aspects of the input space (images) that are captured in the latent space across inputs. Our visualizations suggest that the features extracted by the convolutional layers are extremely shallow and can easily overfit to trajectories seen during training. This is in line with the results of [Zhang et al.2018] and [Zhang, Ballas, and Pineau2018]. Although our main focus is to understand learned models, it is important that the models we analyze perform well on the task at hand. To this end, we show our model achieves training rewards comparable to Q-learning models like Distributional DQN [Bellemare, Dabney, and Munos2017]. Our contribution in this work is threefold:

  • We explore a different neural network architecture with key-value stores, constrained embeddings and an explicit soft-assignment step that separates representation learning and Q-value learning (state aggregation).

  • We show that such a model can improve interpretability in terms of visualizations of the learned keys (cluster), attention maps and saliency maps. Our method attempts to provide a global explanation of the model’s behavior instead of explaining specific input examples (local explanations). We also develop a few examples to test the generalization behavior.

  • We show that the model’s uncertainty can be used to drive exploration that reaches reasonably high rewards with reduced sample complexity (training examples) on some of the Atari environments.

Figure 1: Model Architecture- Interpretable DQN (i-DQN)

Related Work

Many attempts have been made to tackle the problem of interpretability with deep learning, largely in the supervised learning case.

[Zhang and Zhu2018]

carry out an in-depth survey on interpretability with Convolutional Neural Networks (CNNs). Our approach to visualizing embeddings is in principle similar to the work of

[Dosovitskiy and Brox2016] on inverting visual representations. They train a neural network with deconvolution layers using HOG, SIFT and AlexNet embeddings as input and their corresponding real images as ground truth (the sole purpose of this network being visualization). Saliency maps are another popular type of method that generate local explanations which generally use gradient-like information to identify salient parts of the image. The different ways of computing saliency maps are covered exhaustively in [Zhang and Zhu2018]. Few of these have been applied in the context of deep reinforcement learning. [Zahavy, Ben-Zrihem, and Mannor2016] use the Jacobian of the network to compute saliency maps on a Q-value network. Perturbation based saliency maps using a continuous mask across the image and also using object segmentation based masks have been studied in the context of deep-RL [Greydanus et al.2017, Iyer et al.2018]. In contrast to these approaches, our method is based on a global view of the network. Given a particular action and expected returns, we invert the corresponding key to try and understand visual aspects being captured by the embedding regardless of the input state. More recently, [Verma et al.2018] introduce a new method that finds interpretable programs that can best explain the policy learned by a neural network- these programs can also be treated as global explanations.

Architecturally, our network is similar to the network first proposed by [Pritzel et al.2017]. The authors describe their motivation as speeding up the learning process using a semi-tabular representation with Q-value calculations similar to the tabular Q-learning case. This is to avoid the inherent slowness of gradient descent and reward propagation. Their model learns to attend over a subset of states that are similar to the current state by tracking all the states recently seen (up-to half-million states) using a k-d tree. However, their method does not have any notion of clustering or fixed Q-values. Our proposed method is also similar to bellemare2017distributional’s work on categorical/distributional DQN. The difference is that in our model the cluster embeddings (keys) for different Q-values are accessible freely (for analysis and visualization) because of the explicit soft-assignment step, whereas it is almost impossible to find such representations while having fully-connected layers like in [Bellemare, Dabney, and Munos2017]

. Although we do not employ any iterative procedure (like refining keys; we train fully using backpropagation), works on combining deep embeddings with unsupervised clustering methods

[Xie, Girshick, and Farhadi2016, Chang et al.2017] (joint optimization/iterative refinement) have started to pick up pace and show better performance compared to traditional clustering methods.

Another important direction that is relevant to our work is that of generalizing behavior of neural networks in the reinforcement learning setting. [Henderson et al.2017]

discuss in detail about general problems of deep RL research and evaluation metrics used for reporting.

[Zhang et al.2018, Zhang, Ballas, and Pineau2018] perform systematic experimental studies on various factors affecting generalization behavior such as diversity in training seeds and randomness in environment rewards. They conclude that deep RL models can easily overfit to random reward structures or when there is insufficient training diversity and careful evaluation techniques (such as isolated training and testing seeds) are needed.

Proposed Method

We follow the usual RL setting and assume the environment can be modelled as a Markov Decision Process (MDP) represented by the 5-tuple

, where is the state space, the action space,

is the state transition probability function,

is the reward function and is the discount factor. A policy maps every state to a distribution over actions. The value function is the expected discounted sum of rewards by following policy from state at time , . Similarly, the Q-value (action-value) is the expected return starting from state , taking action and then following

. Q-value function can be recursively estimated using the Bellman equation

and is the optimal policy which achieves the highest over all policies .

Similar to the traditional DQN architecture [Mnih et al.2015], any state (a concatenated set of input frames) is encoded using a series of convolutional layers each followed by a non-linearity and finally a fully-connected layer at the end . This would usually be followed by some non-linearity and a fully-connected layer that outputs Q-values. Instead, we introduce a restricted key-value store over which the network learns to attend as shown in Figure 1. Intuitively, the model is trying to learn two things. First, it learns a latent representation that captures important visual aspects of the input images. At the same time, the model also attempts to learn an association between embeddings of states and embeddings of keys in the key-value store. This would help in clustering the state (around the keys) based on the scale of expected returns (Q-values) from that state. We can think of the different keys (for a given action ) weakly as the cluster centers for the corresponding Q-values, attention weights as a soft assignment between embeddings for current state and embeddings for different Q-values . This explicit association step helps us in understanding the model in terms of attention maps and visualizations of the cluster centers (keys).

The key-value store is restricted in terms of size and values of the store. Each action has a fixed number of key-value pairs (say ) and the value associated with every key is also held constant. values are sampled uniformly at random from (usually ) once and the same set of values are used for all the actions. All of the keys () are also initialized randomly. To compute attention, the embeddings for a state are compared to all the keys in the store for a particular action using a softmax over their dot products.


These attention weights over keys and their corresponding value terms are then used to calculate the Q-values.

Now that we have Q-values, we can define the different losses that can be used to train the network,

  • Bellman Error (): The usual value function estimation error.


  • Distributive Bellman Error (): We force the distributive constraint on attention weights between current and next states similar to [Bellemare, Dabney, and Munos2017] using values as supports of the distribution. The distributive loss is defined as the KL divergence between and where is the distributional Bellman operator and is the projection operator and is best action at state i.e. .


    Equation (3) is simply the cross entropy loss (assuming to be constant with respect to , similar to the assumption for in bellman error).

  • Reconstruction Error (): We also constrain the embeddings for any state to be reconstructible. This is done by transforming using a fully-connected layer and then followed by a series of non-linearity and deconvolution layers.


    The mean squarred error between reconstructed image and original image is used,

  • Diversity Error (): The diversity error forces attention over different keys in a batch. This is important because training can collapse early with the network learning to focus on very few specific keys (because both the keys and attention weights are being learned together). We could use KL-divergence between the attention weights but [Lin et al.2017] develop an elegant solution to this in their work.

    where is a 2D matrix of size (batch size, ) and each row of

    is the attention weight vector

    . It drives to be diagonal (no overlap between keys attended to within a batch) and -2 norm of to be . Because of softmax, the -1 norm is also and so ideally the attention must peak at exactly one key however in practice it spreads over as few keys as possible. Finally, the model is trained to minimize a weighted linear combination of all the four losses. L_final(θ) = λ_1 L_bellman(θ) + λ_2 L_distrib(θ) + λ_3 L_reconstruct(θ) + λ_4 L_diversity(θ)

Environment 10M frames Reported Scores (final)
DDQN i-DQN DDQN Distrib. DQN Q-ensemble
Alien 1,533.45 2,380.72 3,747.7 4,055.8 2,817.6
Freeway 22.5 28.79 33.3 33.6 33.96
Frostbite 754.48 3,968.45 1,683.3 3,938.2 1,903.0
MsPacman 2,251.43 6,132.21 2,711.4 3,769.2 3,425.4
Qbert 10,226.93 19,137.6 15,088.5 16,956.0 14,198.25
Space Invaders 563.2 979.45 2,525.5 6,869.1 2,626.55
Table 1: Training scores (averaged over 100 episodes, 3 seeds). DDQN and Distributional DQN agents are trained for 200M frames [Hessel et al.2017], UCB with Q-ensembles for 40M frames [Chen et al.2018]
(a) SpaceInvaders
(b) Qbert
(c) MsPacman
(d) MsPacman, DDQN
Figure 2: Visualizing keys, state embeddings using t-SNE: i-DQN, Q-value 25 (a)-(c); Double DQN(d)

Experiments and Discussions

We report the performance of our model on six Atari environments [Brockman et al.2016]- MsPacman, Frostbite, Qbert, Freeway, Alien and SpaceInvaders, in Table 1. The training setup and hyper-parameters are listed in the supplemental section. Since our focus is on interpretability, we do not carry out an exhaustive performance comparison. We simply show that training rewards achieved by i-DQN model are comparable to some of the state-of-the-art models.

Directed exploration

We use the uncertainty in attention weights to drive exploration during training. is an approximate upper confidence on the Q-values. Similar to [Chen et al.2018]

we select the action maximizing a UCB style confidence interval,

Table 1 compares i-DQN’s performance (with directed exploration) against a baseline Double DQN implementation (which uses epsilon-greedy exploration) at 10M frames, the final scores of Double DQN (DDQN) and Distributional DQN (Distrib. DQN) as reported in [Hessel et al.2017] (200M frames) and directed exploration using Q-ensemble reported by [Chen et al.2018]

(40M frames). We see that on some of the games, our model reaches higher training rewards within 10M frames compared to DDQN, Distrib. DQN models. The training process is roughly 2x slower because of the multiple loss functions compared to our double dqn implementation.

What do the keys represent?

The keys are latent embeddings (randomly initialized) that behave like cluster centers for the particular action-return pairs (latent space being

). Instead of training using unsupervised methods like K-means or mixture models, we use the neural network itself to find these points using gradient descent. For example, the key for action right; Q-value 25 (Figure

1(c)) is a cluster center that represents the latent embeddings for all states where the agent expects a return of 25 by selecting action right. These keys partition the latent space into well formed clusters as shown in Figure 2, suggesting that embeddings also contain action-specific information crucial for an RL agent. On the other hand, Figure 1(d) shows embeddings for DDQN which are not easily separable (similar to the visualizations in [Mnih et al.2015]). Since we use simple dot-product based distance for attention, keys and state embeddings must lie in a similar space and this can be seen in the t-SNE visualization i.e. keys (square boxes) lie within the state embeddings (Figure 2). The fact that the keys lie close to their state embeddings is essential to interpretability because state embeddings satisfy reconstructability constraints.

(a) Down
(b) Downleft
(c) Upleft
(d) Right
(a) Right
(b) R-Fire
(c) Left
(d) L-Fire
(e) Fire
Figure 3: MsPacman, Inverting keys for Q-value 25
Figure 4: SpaceInvaders, Inverting keys for Q-value 25
Figure 3: MsPacman, Inverting keys for Q-value 25

(a) Input State

(b) Input State

(c) Input State

(d) Input State

(e) Input State

(a) (Downleft, 25)

(b) (Right, 25)

(c) (Downleft, 25)

(d) (Upleft, 25)

(e) (Up, 25)
Figure 5: MsPacman, examples where agent’s decision agrees with the reconstructed image

Inversion of keys

Although keys act like cluster centers for action-return pairs, it is difficult to interpret them in the latent space. By inverting keys, we attempt to find important aspects of input space (images) that influence the agent to choose particular action-return pair (). These are ‘global explanations’ because inverting keys is independent of the input. For example, in MsPacman, reconstructing keys for different actions (fixing return of 25) indicates yellow blobs at many different places for each action (Figure 4). We hypothesize that these correspond to the Pacman object itself and that the model memorizes its different positions to make its decision i.e. the yellow blobs in Figure 2(d) correspond to different locations of Pacman and for any input state where Pacman is in one of those positions, the agent selects action right expecting a return of 25. Figure 5 shows such examples where the agent’s action-return selection agrees with reconstructed key (red boxes indicate Pacman’s location). Similarly, in SpaceInvaders, the agent seems to be looking at specific combinations of shooter and alien ship positions that were observed during training (Figure 4).

The keys have never been observed by the deconvolution network during training and so the reconstructions depend upon its generalizability. Interestingly, reconstructions for action-return pairs that are seen more often tend to be less noisy with less artifacts. This can be observed in Figure 4 for Q-value 25 where actions Right, Downleft and Upleft nearly 65% of all actions taken by the agent. We also look at the effect of different reconstruction techniques keeping the action-return pair fixed (Figure 6

). Variational autoencoder with

set to 0 yields sharper looking images but increasing which is supposed to bring out disentanglement in the embeddings yields reconstructions with almost no objects. Dense VAE with is a slightly deeper network similar to [Oh et al.2015] and seems to reconstruct slightly clearer shapes of ghosts and pacman.

Agreement AE
Dense VAE
MsPacman (Color)
30.76 29.78 16.8 23.73
MsPacman (Gray,
19.87 18.14 10.97 14.56
Table 2: Evaluating visualizations: Agreement scores
Figure 6: Reconstruction: AE, VAE , VAE , Dense VAE

(a) Adversarial example
(b) Trajectory during training
(c) Adversarial example
(d) Trajectory during training
Figure 7: Adversarial examples for MsPacman (a)-(b) and SpaceInvaders (c)-(d)

Evaluating the reconstructions

To understand the effectiveness of these visualizations, we design a quantitative metric that measures the agreement between actions taken by the agent and the actions suggested using the reconstructed images. With a fully trained model, we reconstruct images from the keys for all actions . In every state

, we induce another distribution on the Q-values using the cosine similarity in the image space,

similar to (which is also a distribution over Q-values but in the latent space). Using , we can compute and as before and select an action . Using and , we define our metric of agreeability as

where is the indicator function. We measure this across multiple rollouts (5) using and average them . In Table 2, we report Agreement as a percentage for different encoder-decoder models. Unfortunately, the best agreement between the actions selected using the distributions in the image space and latent space is around for the unscaled color version of MsPacman. In MsPacman, the agent has 9 different actions and a random strategy would expect to have an Agreement of . However, if the agreement scores were high (80-90%), that would suggest that the Q-network is indeed learning to memorize configurations of objects seen during training. One explanation for the gap is that reconstructions rely heavily on generalizability to unseen keys.

Adversarial examples that show memorization

Looking at the visualizations and rollouts of a fully trained agent, we hand-craft a few out-of-sample environment states to examine the agent’s generalization behavior. For example, in MsPacman, since visualizations suggest that the agent may be memorizing pacman’s positions (also maybe ghosts and other objects), we simply add an extra pellet adjacent to a trajectory seen during training (Figure 6(a)). The agent does not clear the additional pellet and simply continues to execute actions performed during training (Figure 6(b)). Most importantly, the agent is extremely confident in taking actions initially (seen in attention maps Figure 6(a)) which suggest that the extra pellet was probably not even captured by the embeddings. Similarly, in case of SpaceInvaders, the agent has a strong bias towards shooting from the leftmost-end (seen in Figure 4). This helps in clearing the triangle like shape and moving to the next level (Figure 6(d)). However, when triangular positions of spaceships are inverted, the agent repeats the same strategy of trying to shoot from left and fails to clear ships (Figure 6(c)). These examples indicate that the features extracted by the convolutional channels seem to be shallow. The agent does not really model interactions between objects. For example in MsPacman, after observing 10M frames, it does not know general relationships between pacman and pellet or ghosts. Even if optimal Q-values were known, there is no incentive for the network to model these higher order dependencies when it can work with situational features extracted from finite training examples. [Zhang et al.2018] also report similar results on simple mazes where an agent trained on insufficient environments tends to repeat training trajectories on unseen mazes.


In this paper, we propose an interpretable deep Q-network model (i-DQN) that can be studied using a variety of tools including the usual saliency maps, attention maps and reconstructions of key embeddings that attempt to provide global explanations of the model’s behavior. We also show that the uncertainty in soft cluster assignment can be used to drive exploration effectively and achieve high training rewards comparable to other models. Although the reconstructions do not explain the agent’s decisions perfectly, they provide a better insight into the kind of features extracted by convolutional layers. This can be used to design interesting adversarial examples with slight modifications to the state of the environment where the agent fails to adapt and instead repeats action sequences that were performed during training. This is the general problem of overfitting in machine learning but is more acute in the case of reinforcement learning because the process of collecting training examples depends largely on the agent’s biases (exploration). There are many interesting directions for future work. For example, we know that the reconstruction method largely affects the visualizations and other methods such as generative adversarial networks (GANs) can model latent spaces more smoothly and could generalize better to unseen embeddings. Another direction is to see if we can automatically detect the biases learned by the agent and design meaningful adversarial examples instead of manually crafting test cases.


This work was funded by AFOSR award FA9550-15-1-0442.


Supplemental Material

In this supplemental material, we include details of the experiment setup, hyperparameters and experiment results in terms of training curves. We present additional visualizations for the other games discussed in the paper and a few more adversarial examples for MsPacman. We also include an ablation study on the effect of the different loss functions.

Experimental Setup

The i-DQN training algorithm is described in Algorithm 1 (including the directed exploration strategy and visualization step). As discussed in the paper, the i-DQN model attempts to minimize a weighted linear combination of multiple losses. L_final(θ) = λ_1 L_bellman(θ) + λ_2 L_distrib(θ) + λ_3 L_reconstruct(θ) + λ_4 L_diversity(θ)

// initialization
Sample independent values uniformly at random from
Initialize network parameters ,
  Conv/Deconv layers using Xavier initialization
  Linear layers and Keys
for  to  do
      // action selection
      Compute embedding for : for action, to  do
           Compare to keys :
           end for
          Compute and
           Select and store transition.
           // update network
           Sample random mini-batch
           Compute losses , , , and and perform gradient update step on network parameters
           end for
          // visualization
To visualize: pick action , value index and perform
Algorithm 1 Training the i-DQN model

Table 3 lists all the hyperparameters (including network architecture), values and their description. Most of the training procedure and hyperparameter values are similar to the settings in [Van Hasselt, Guez, and Silver2016] and [Bellemare, Dabney, and Munos2017]. Figure 8 shows the training curves for i-DQN and double DQN models over three random seeds for six different Atari environments- Alien, Freeway, Frosbite, MsPacman, Qbert and SpaceInvaders. The results reported in the paper are averaged over the best results for individual runs for each seed. Although training i-DQN is slow, the directed exploration strategy helps to quickly reach scores competitive with the state-of-the-art deep Q-learning based models.

Hyperparameter Value Description
Number of Keys per Action () 20 Total keys in key-value store for actions
Range of Values , The range from which the values of key-value store are sampled
Exploration Factor Controls the ucb style action selection using confidence intervals
Embedding Size 256 Dimensions of latent space where keys and state embeddings lie
Network Channels

Convolution layers (Channels, Filter, Stride)

Network Activations ReLU
Discount Factor 0.99
Batch Size 32
Optimizer Adam learning rate, beta1, beta2, weight decay
Gradient Clipping 10 Using gradient norm
Replay Buffer Size 10K
Training Frequency Every 4 steps
Target Network Sync Frequency 1000 Fully replaced with weights from source/training network
Frame Preprocessing Standard Grayscaling, Downsampling (84, 84), Frames stacked: 4, Repeat Actions: 4
Reward Clipping No
Loss Factors , , , The weights on each loss function
Table 3: List of hyper-parameters and their values
(a) Alien
(b) Freeway
(c) Frostbite
(d) MsPacman
(e) Qbert
(f) SpaceInvaders
Figure 8:

Training curves comparing average rewards (over 100 episodes) for i-DQN and double DQN (the shaded region indicates variance over 3 random seeds). The blue curves are the i-DQN model using a directed exploration strategy while the green curve is our implementation of double DQN model using epislon greedy exploration.

Attention Maps: During training, the i-DQN model learns to attend over keys in the key-value store using a dot-product attention.

We can visualize these weights in terms of attention heat-maps where each row is a distribution over Q-values (because of the softmax) for a particular action. Darker cells imply higher attention weights, for example in Figure 8(c), the agent expects a negative return of -25 with very high probability for action ’Left’.

Saliency Maps: They usually highlight the most sensitive regions around the input that can cause the neural network to change its predictions. Following [Greydanus et al.2017], we define saliency maps as follows,

defines the perturbed image for state at pixel location , is the saliency mask usually a Gaussian centered at defining the sensitive region, is a Gaussian blur of and denotes the sensitivity of to pixel location .

Analysing Loss Functions

Figure 8(a) shows the effect of different loss functions. We keep the diversity error constant and incrementally add each of the loss functions. Figure 8(b) and Figure 8(c) show the effect of the diversity loss. Without the diversity loss, the network can learn to attend over a small set of the keys for each action. However, the diversity loss encourages attention across elements in a batch to be diverse and ideally, the attention weights should concentrate on a single key (because of l- and l- norm constraints) but concentrate over very few keys in practice .

(a) Ablation study with training rewards, keeping the diversity error constant (red curve is diversity + bellman losses, green curve is diversity + bellman + distributional losses and finally the blue curve is diversity + bellman + distributional + reconstruction losses)
(b) Attention Maps: Without diversity error
(c) Attention Maps: With diversity error
Figure 9: Understanding the effect of different losses (PongNoFrameskip-v4 environment)
(a) Down
(b) Downleft
(c) Downright
(d) Left
(e) Up
(f) Upleft
(g) Upright
(h) Right
Figure 10: MsPacman, Inverting keys for different actions with Q-value 25
Figure 11: Inverting keys for action Right, Q-values
Figure 10: MsPacman, Inverting keys for different actions with Q-value 25
(a) Action: Right
(b) Action: Right-Fire
(c) Action: Left
(d) Action: Left-Fire
(a) Pong, Inverting keys for different actions with Q-value: 25
(b) Action: Right
(c) Action: Up
(e) Action: Down
(d) Action: Left
(d) Action: Left
(b) Qbert, Inverting keys for different actions with Q-value: 25
Figure 12: Inverting Keys for Pong, Qbert

Inversion of Keys

Figures 411 show the reconstructed images by inverting keys in MsPacman (some of these have been discussed in the paper). Figures 11(a)11(b) show the reconstructed images by inverting keys in Pong and Qbert. Most of these images are noisy, especially the ones in gray-scale but seem to indicate configurations of different objects in the game. Figure 16

shows the reconstructions by interpolating in the hidden space starting from a key embedding and moving towards a specific training state. We see that most of the hidden space can be reconstructed but the information about shapes and colors of the objects is not always clear while reconstructing the key embeddings.

Adversarial Examples for MsPacman

In Figure 13, we show some more examples for MsPacman where the agent does not learn to clear a pellet or goes around repeating action sequences performed during training. Note that Open AI gym environment does not provide access to the state-space and the required out-of-sample state needs to be generated via human gameplay. Figures 1515 present an interesting adversarial example created by taking the pixel-wise maximum over real training states. Even though this state is not a valid state, we see that in the attention-maps, all of the four selected actions are strongly highlighted. In fact, by computing the saliency maps with respect to each of these actions, we can almost get back the same regions that are highlighted in Figure 15. This suggests that the key embeddings seem to independently capture specific regions of the board where Pacman is present without modeling any high level dependencies among objects.

(a) MsPacman, Adversarial example
(b) MsPacman, Trajectory during training
(c) MsPacman, Adversarial Example
(d) MsPacman, Adversarial Example for Double DQN model
Figure 13: Adversarial Examples
(a) Action Selected: Downleft
(b) Action Selected: Upleft
(c) Action Selected: Upright
(d) Action Selected: Up
Figure 14: MsPacman, Samples from training trajectory (with attention and saliency maps)
Figure 15: MsPacman, Adversarial example generated by pixelwise-max operation over samples in Figure 15. Such a state although unseen during training, is probably meaningless. But the saliency maps can still recover the same regions almost perfectly (with extra noise) by taking gradients with respect to the particular actions.
Figure 14: MsPacman, Samples from training trajectory (with attention and saliency maps)
Figure 16: MsPacman, Interpolating in the hidden space; start embeddings are keys for Downleft, Upleft, Upright, Up (First Column); final embeddings are the same states as those in Figure 15 with channel-wise max over 4 frames instead of single frame (Last Column); We use a linear interpolation scheme as follows, , and invert each .