Explain Your Move: Understanding Agent Actions Using Focused Feature Saliency

12/23/2019 ∙ by Piyush Gupta, et al. ∙ University of California, Irvine adobe IIT Kharagpur 10

As deep reinforcement learning (RL) is applied to more tasks, there is a need to visualize and understand the behavior of learned agents. Saliency maps explain agent behavior by highlighting the features of the input state that are most relevant for the agent in taking an action. Existing perturbation-based approaches to compute saliency often highlight regions of the input that are not relevant to the action taken by the agent. Our approach generates more focused saliency maps by balancing two aspects (specificity and relevance) that capture different desiderata of saliency. The first captures the impact of perturbation on the relative expected reward of the action to be explained. The second downweights irrelevant features that alter the relative expected rewards of actions other than the action to be explained. We compare our approach with existing approaches on agents trained to play board games (Chess and Go) and Atari games (Breakout, Pong and Space Invaders). We show through illustrative examples (Chess, Atari, Go), human studies (Chess), and automated evaluation methods (Chess) that our approach generates saliency maps that are more interpretable for humans than existing approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 6

page 8

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has achieved success in various domains such as image classification (He et al., 2016; Krizhevsky et al., 2012), machine translation (Mikolov et al., 2010)

, image captioning

(Karpathy et al., 2015), and deep Reinforcement Learning (RL) (Mnih et al., 2015; Silver et al., 2017). To explain and interpret the predictions made by these complex, ”black-box”-like systems, various gradient and perturbation techniques have been introduced for image classification (Simonyan et al., 2013; Zeiler and Fergus, 2014; Fong and Vedaldi, 2017) and deep sequential models (Karpathy et al., 2015). However, interpretability for RL-based agents has received significantly less attention. Interpreting the strategies learned by RL agents can help users better understand the problem that the agent is trained to solve. For instance, interpreting the actions of a chess-playing agent in a position could provide useful information about aspects of the position. Interpretation of RL agents is also an important step before deploying such models to solve real-world problems.

Inspired by the popularity and use of saliency maps to interpret in computer vision, a number of existing approaches have proposed similar methods for reinforcement learning-based agents.

Greydanus et al. (2018)

derive saliency maps that explain RL agent behavior by applying a Gaussian blur to different parts of the input image. They generate saliency maps using differences in the value function and policy vector between the original and perturbed state. They achieve promising results on agents trained to play Atari games.

Iyer et al. (2018) compute saliency maps using a difference in the action-value () between the original and perturbed state.

There are two primary limitations to these approaches. The first is that they highlight features whose perturbation affects actions apart from the one we are explaining. This is illustrated in Figure 1, which shows a chess position (it is white’s turn). Stockfish111https://stockfishchess.org/ plays the move Bb6 in this position, which traps the black rook (a5) and queen (c7)222We follow the coordinate naming convention where columns are ‘a-h’ (left-right), rows ‘8-1’ (top-bottom), and pieces are labeled using the first letter of its name in upper case (e.g. ‘B’ denotes the bishop). A move consists of the piece and the position it moves to, e.g. ‘Bb6’ indicates that the bishop moves to position ‘b6’.. The knight protects the white bishop on a4, and hence the move works. In this position, if we consider the saliency of the white queen (square d1), then it is apparent that the queen is not involved in the tactic and hence the saliency should be low. However, perturbing the state (by removing the queen) leads to a state with substantially different values for and . Therefore, existing approaches (Greydanus et al., 2018; Iyer et al., 2018) mark the queen as salient. The second limitation is that they highlight features that are not relevant to the action to be explained. In Figure 0(c), perturbing the state by removing the black pawn on c6 alters the expected reward for actions other than the one to be explained. Therefore, it alters the policy vector and is marked salient. However, the pawn is not relevant to explain the move played in the position (Bb6).

In this work, we propose a perturbation based approach for generating saliency maps for black-box agents that builds on two desired properties of action-focused saliency. The first, specificity, captures the impact of perturbation only on the Q-value of the action to be explained. In the above example, this term downweights features such as the white queen that impact the expected reward of all actions equally. The second, relevance, downweights irrelevant features that alter the expected rewards of actions other than the action to be explained. It removes features such as the black pawn on c6 that increase the expected reward of other actions (in this case, Bb4). By combining these aspects, we generate a saliency map that highlights features of the input state that are relevant for the action to be explained. Figure 1 illustrates how the saliency map generated by our approach only highlights pieces relevant to the move, unlike existing approaches.

We use our approach to explain the actions taken by agents for board games (Chess and Go), and for Atari games (Breakout, Pong and Space Invaders). Using a number of illustrative examples, we show that our proposed approach obtains more focused and accurate interpretations for all of these setups when compared to Greydanus et al. (2018) and Iyer et al. (2018). We also demonstrate that our approach is more effective in identifying important pieces in chess puzzles, and further, in aiding skilled chess players to solve chess puzzles (improves accuracy of solving them by nearly and reduces the time taken by over existing approaches).

(a) Original Position
(b) Iyer et al. (2018)
(c) Greydanus et al. (2018)
(d) Our Approach
Figure 1: Saliency maps generated by existing approaches

2 Approach

We are given an agent , operating on a state space , with the set of actions for , and a -value function denoted as for , . Following a greedy policy, let the action that was selected by the agent at state be , i.e. . The states are parameterized in terms of state-features . For instance, in a board game such as chess, the features are the 64 squares. For Atari games, the features are pixels. We are interested in identifying which features of the state are important for the agent in taking action . We assume that the agent is in the exploitation phase and therefore plays the action with the highest expected reward. This feature importance is described by an importance-score or saliency for each feature , denoted by , where denotes the saliency of the th feature of for the agent taking action . A higher value indicates that the th feature of is more important for the agent when taking action .

Perturbation-based Saliency Maps

The general outline of perturbation based saliency approaches is as follows. For each feature , first perturb to get . For instance, in chess, we can perturb the board position by removing the piece in the th square. In Atari, Greydanus et al. (2018) perturb the input image by adding a Gaussian blur centered on the th pixel. Second, query to get . We take the intersection of and to represent the case where some actions may be legal in but not in and vice versa. For instance, when we remove a piece in chess, actions that were legal earlier may not be legal anymore. In the rest of this section, when we use “all actions” we mean all actions that are legal in both the states and . Finally, compute based on how different and are, i.e. intuitively, should be higher if is significantly different from . Greydanus et al. (2018) compute the saliency map using , and , while Iyer et al. (2018) use . In this work, we will propose an alternative approach to compute .

Properties

We define two desired properties of an accurate saliency map for policy-based agents:

  1. [leftmargin=5mm]

  2. Specificity: Saliency should focus on the effect of the perturbation specifically on the action being explained, , i.e. it should be high if perturbing the th feature of the state reduces the relative expected reward of the selected action. Stated another way, should be high if is substantially higher than . For instance, in figure 1, removing pieces such as the white queen impact all actions uniformly ( is roughly equal for all actions). Therefore, such pieces should not be salient for explaining . On the other hand, removing pieces such as the white knight on a4 specifically impacts the move (Bb6) we are trying to explain ( for other actions ). Therefore, such pieces should be salient for .

  3. Relevance: Since the -values represent the expected returns, two states and can have substantially different -values for all actions, i.e. may be higher for for all actions if is a better state. Saliency map for a specific action in should thus ignore such differences, i.e. should contribute to the saliency only if its effects are relevant to . In other words, should be low if perturbing the th feature of the state alters the expected rewards of actions other than . For instance, in Figure 1, removing the black pawn on c6 increases the expected reward of other actions (in this case, Bb4). However, it does not effect the expected reward of the action to be explained (Bb6). Therefore, the pawn is not salient for explaining the move. In general, such features that are irrelevant to should not be salient.

Existing approaches to saliency maps do not capture these properties in how they compute the saliency. Both the saliency approaches used in Greydanus et al. (2018), i.e. and , are not focusing on the action-specific effects since they aggregate the change over all actions. Although the saliency computation in Iyer et al. (2018) is somewhat more specific to the action, i.e. , it is ignoring whether the effects on are relevant only to , or effect all the other actions as well. This is illustrated in Figure 1.

Identifying Specific Changes

To focus on the effect of the change on the action, we are interested in whether the relative returns of change with the perturbation. Using directly, as in Iyer et al. (2018), does not capture the relative changes to for other actions. To support specificity, we use the softmax over Q-values to normalize the values (as is also used in softmax action selection):

(1)

and compute , the difference in the relative expected reward of the action to be explained between the original and the perturbed state. A high value of thus implies that the feature is important for the specific choice of action by the agent, while a low value indicates that the effect is not specific to the action.

Identifying Relevant Changes

Apart from focusing on the change in , we also want to ensure that the perturbation leads to minimal effect on the relative expected returns for other actions. To capture this intuition, we will compute the relative returns of all other actions, and compute saliency in proportion to their similarity. Specifically, we normalize the Q-values using a softmax apart from the selected action .

(2)

We use the KL-Divergence to measure the difference between and . A high indicates that the relative expected reward of taking some actions (other than the original action) changes significantly between and . In other words, a high indicates that the effect of the feature is spread over other actions, i.e. the feature may not be relevant for the selected action .

Computing the Saliency

To compute salience , we need to combine and . If is high, should be low, regardless of whether is high; the perturbation is affecting many other actions. Conversely, when is low, should depend on . To be able to compare these properties on a similar scale, we define a normalized measure of distribution similarity using :

(3)

As goes from to , goes from to . Thus, should be low if either is low or

is low. Harmonic mean provides this desired effect in a robust, smooth manner, and therefore we define

to be the harmonic mean of and :

(4)

Equation 4 captures our desired properties of saliency maps. If perturbing the th feature affects the expected rewards of all actions uniformly, then is low and subsequently is low. This low value of captures the property of specificity defined above. If perturbing the th feature of the state affects the rewards of some actions other than the action to be explained, then is high, is low, and is low. This low value of captures the property of relevance defined above.

3 Results

To show that our approach produces more meaningful saliency maps than existing approaches, we use sample positions from Chess, Atari (Breakout, Pong and Space Invaders) and Go (Section 3.1). To show that our approach generates saliency maps that provide useful information to humans, we conduct human studies on problem-solving for chess puzzles (Section 3.2). To automatically compare the saliency maps generated by different perturbation-based approaches, we introduce a Chess saliency dataset (Section 3.3). We use the dataset to show how our approach is better than existing approaches in identifying chess pieces that humans deem relevant in several positions. In Section 3.4, we show how our approach can be used to understand common tactical ideas in chess by interpreting the action of a trained agent.

To show that our approach works for black-box agents, regardless of whether they are trained using reinforcement learning, we use a variety of agents. We only assume access to the agent’s function for all experiments. For experiments on chess, we use the Stockfish agent333https://stockfishchess.org/. For experiments on Go, we use the pre-trained MiniGo RL agent444https://github.com/tensorflow/minigo. For experiments on Atari agents and for generating saliency maps for Greydanus et al. (2018), we use their code and pre-trained RL agents555https://github.com/greydanus/visualize_atari. For generating saliency maps using Iyer et al. (2018), we use our own implementation66footnotemark: 6. All of our code and more detailed results are available in our Github repository666https://github.com/rl-interpretation/understandingRL.

3.1 Illustrative Examples

In this section, we provide examples of generated saliency maps to highlight the qualitative differences between our approach that is action-focused and existing approaches that are not.

Chess

Figure 1 shows sample positions where our approach produces more meaningful saliency maps than existing approaches for a chess-playing agent (Stockfish). Greydanus et al. (2018) and Iyer et al. (2018) generate saliency maps that highlight pieces that are not relevant to the move played by the agent. This is because they use differences in , or the the

norm of the policy vector between the original and perturbed state to calculate the saliency maps. Therefore, pieces such as the white queen that affect the value estimate of the state are marked salient. In contrast, the saliency map generated by our approach only highlights pieces relevant to the move.

(a) Our Approach
(b) Greydanus et al. (2018)
(c) Our Approach
(d) Greydanus et al. (2018)
Figure 2: Comparing saliency of RL agents trained to play Breakout

Atari

To show that our approach generates saliency maps that are more focused than those generated by Greydanus et al. (2018), we compare the approaches on three Atari games: Breakout, Pong, and Space Invaders. Figures 2, 3, and 4 shows the results. Our approach highlights regions of the input image more precisely, while the Greydanus et al. (2018) approach highlights several regions of the input image that are not relevant to explain the action taken by the agent.

(a) Our Approach
(b) Greydanus et al. (2018)
(c) Our Approach
(d) Greydanus et al. (2018)
Figure 3: Comparing saliency of RL agents trained to play Atari Pong

Go

Figure 5 shows a board position in Go. It is black’s turn. The four white stones threaten the three black stones that are in one row at the top left corner of the board. To save those three black stones, black looks at the three white stones that are directly below the three black ones. Due to another white stone below the three white stones, the continuous row of three white stones cannot be captured easily. Therefore black moves to place a black stone below that single white stone in an attempt to start capturing the four white stones. It takes the next few turns to surround the structure of four white stones with black ones, thereby saving its pieces. The method described in Greydanus et al. (2018) generates a saliency map that highlights almost all the pieces on the board. Therefore, it reveals little about the pieces that the agent thinks are important. On the other hand, the map produced by Iyer et al. (2018) highlights only a few pieces. The saliency map generated by our approach correctly highlights the structure of four white stones and the black stones already present around them that may be involved in capturing them.

(a) Our Approach
(b) Greydanus et al. (2018)
(c) Our Approach
(d) Greydanus et al. (2018)
Figure 4: Comparing saliency of RL agents trained to play Space Invaders
(a) Original Position
(b) Our Approach
(c) Iyer et al. (2018)
(d) Greydanus et al. (2018)
Figure 5: Comparing saliency maps generated by different approaches for the MiniGo agent

3.2 Human Studies: Chess

To show that our approach generates saliency maps that provide useful information to humans, we conduct human studies on problem-solving for chess puzzles. We show fifteen chess players (ELO 1600-2000) ten chess puzzles from https://www.chess.com (average difficulty ELO 1800). For each puzzle, we show either the puzzle without a saliency map, or the puzzle with a saliency map generated by our approach, Greydanus et al. (2018), or Iyer et al. (2018). The player is then asked to solve the puzzle. We measure the accuracy (number of puzzles correctly solved) and the average time taken to solve the puzzle, shown in Table 1. The saliency maps generated by our approach are more helpful for humans when solving puzzles than those generated by other approaches. We observed that the saliency maps generated by Greydanus et al. (2018) often confuse humans, because they highlight several pieces unrelated to the tactic. The maps generated by Iyer et al. (2018) highlight few pieces and therefore are marginally better than showing no saliency maps for solving puzzles.

No Saliency Our Approach Greydanus et al. Iyer et al.
Accuracy 56.67% 72.41% 37.48% 59.51%
Average time taken 77.53 sec 52.95 sec 60.23 sec 59.75 sec
Table 1: Results of Human Studies for solving chess puzzles

3.3 Chess Saliency Dataset

To automatically compare the saliency maps generated by different perturbation-based approaches, we introduce a Chess saliency dataset. The dataset consists of 100 chess puzzles77footnotemark: 7. Each puzzle has a single correct move. For each puzzle, we ask three human experts (ELO 2200) to mark the pieces that are important for playing the correct move. We take a majority vote of the three experts to obtain a list of pieces that are important for the move played in the position. The complete dataset is available in our Github repository777https://github.com/rl-interpretation/understandingRL. We use this dataset to compare our approach to existing approaches (Greydanus et al., 2018; Iyer et al., 2018). Each approach generates a list of squares and a score that indicates how salient the piece on the square is for a particular move. We scale the scores between 0 and 1 to generate ROC curves. Figure 5(a) shows the results. Our approach generates saliency maps that are better than existing approaches at identifying chess pieces that humans deem relevant in certain positions.

(a) ROC curves for different approaches
(b) ROC curves for Ablation Studies
Figure 6: ROC curves comparing approaches on the chess saliency dataset

To evaluate the relative importance of the two components in our saliency computation (; Equation 4), we compute saliency maps and ROC curves using each component individually, i.e. or

, and compare harmonic mean to other ways to combine them, i.e. using the average, geometric mean, and minimum of

and . Figure 5(b) shows the results. Combination of the two properties via harmonic mean leads to more accurate saliency maps than alternative approaches.

3.4 Explaining Tactical Motifs in Chess

In this section, we show how our approach can be used to understand common tactical ideas in chess by interpreting the action of a trained agent. Figure 7 illustrates common tactical positions in chess. The corresponding saliency maps are generated by interpreting the moves played by the Stockfish agent in these positions.

In Figure 6(a), it is white to move. The surprising Rook x d6 is the move played by Stockfish. Figure 6(d) shows the saliency map generated by our approach. The map illustrates the key idea in the position. Once black’s rook recaptures white’s rook, white’s bishop pins it to the black king. Therefore, white can increase the number of attackers on the rook. The additional attacker is the pawn on e4 highlighted by the saliency map.

In Figure 6(b), it is white to move. Stockfish plays Queen x h7. A queen sacrifice! Figure 6(e) shows the saliency map. The map highlights the white rook and bishop, along with the queen. The key idea is that once black captures the queen with his king (a forced move), then the white rook moves to h5 with checkmate. This checkmate is possible because the white bishop guards the important escape square on g6. The saliency map highlights both pieces.

In Figure 6(c), it is black to move. Stockfish plays the sacrifice rook x d4. The saliency map in Figure 6(f) illustrates several key aspects of the position. The black queen and light-colored bishop are threatening mate on g2. The white queen protects g2. The white rook on a5 is unguarded. Therefore, once white recaptures the sacrificed rook with the pawn on c3, black can attack both the white rook and queen with the move bishop to b4. The idea is that the white queen is “overworked” or “overloaded” on d2, having to guard both the g2-pawn and the a5-Rook against black’s double attack.

(a) Pin
(b) Mate in 2
(c) Overloading
(d) Saliency
(e) Saliency
(f) Saliency
Figure 7: Saliency maps generated by our approach that demonstrate common tactical motifs in chess

3.5 Robustness to Perturbations

We are also interested in evaluating the robustness of the generated saliency maps: is the saliency different if non-salient changes are made to the state? To evaluate the robustness of our approach, we perform two irrelevant perturbations to the positions in the chess saliency dataset. First, we pick a random piece amongst the ones labeled non-salient by human experts in a particular position, and remove it from the board. We repeat this for each puzzle in the dataset to generate a new perturbed saliency dataset. Second, we remove a random piece amongst ones labeled non-salient by our approach for each puzzle, creating another perturbed saliency dataset. In order to evaluate the effect of non-salient perturbations on our generated saliency maps, we compute the AUC values for the generated saliency maps, as above, for these perturbed datasets. Since we remove non-salient pieces, we expect the saliency maps and subsequently AUC value to be similar to the value on the original dataset. For both these perturbations, we get an AUC value of 0.92, same as the value on the non-perturbed dataset, confirming the robustness of our saliency maps to these non-relevant perturbations.

4 Related Work

Since understanding RL agents is important both for deploying RL agents to the real world and for gaining insights about the tasks, a number of different kinds of interpretations have been introduced.

A number of approaches generate natural language explanations to explain RL agents (Dodson et al., 2011; Elizalde et al., 2008; Khan et al., 2009). They assume access to an exact MDP model and that the policies map from interpretable, high-level state features to actions. More recently, Hayes and Shah (2017) analyze execution traces of an agent to extract explanations. A shortcoming of this approach is that it explains policies in terms of hand-crafted state representations that are semantically meaningful to humans. This is often not practical for board games or Atari games where the agents learn from raw board/visual input. Zahavy et al. (2016) apply t-SNE (Maaten and Hinton, 2008)

on the last layer of a deep Q-network (DQN) to cluster states of behavior of the agent. They use Semi-Aggregated Markov Decision Processes (SAMDPs) to approximate the black box RL policies. They use the more interpretable SAMDPs to gain insight into the agent’s policy. An issue with the explanations is that they emphasize t-SNE clusters that are difficult to understand for non-experts. To build user trust and increase adoption, it is important that the insight into agent behavior should be in a form that is interpretable to the untrained eye and obtained from the original policy instead of a distilled one.

Most relevant to our approach are the visual interpretable explanations of deep networks using saliency maps. Methods for computing saliency can be classified broadly into two categories.

Gradient-based methods identify input features that are most salient to the trained DNN by using the gradient to estimate their influence on the output. Simonyan et al. (2013)

use gradient magnitude heatmaps, which was expanded upon by more sophisticated methods to address their shortcoming, such as guided backpropagation

(Springenberg et al., 2014), excitation backpropagation (Zhang et al., 2018), DeepLIFT (Shrikumar et al., 2017), GradCAM (Selvaraju et al., 2017), and GradCAM++ (Chattopadhay et al., 2018). Integrate gradients (Sundararajan et al., 2017) provide two axioms to further define the shortcomings of these approaches: sensitivity (relative to a baseline) and implementation invariance, and use them to derive an approach. Nonetheless, all gradient-based approaches still depend on the shape in the immediate neighborhood of a few points, and conceptually, use perturbations that lack physical meaning, making them difficult to use and vulnerable to adversarial attacks in form of imperceivable noise (Ghorbani et al., 2019). Further, they are not applicable to scenarios with black-box access to the agent, and even with white-box access to model internals, they are not applicable when agents are not fully differentiable, such as Stockfish for chess.

We are more interested in perturbation-based methods for black-box agents: methods that compute the importance of an input feature by removing, altering, or masking the feature in a domain-aware manner and observing the change in output. It is important to choose a perturbation that removes information without introducing any new information. As a simple example, Fong and Vedaldi (2017) consider a classifier that predicts ’True’ if a certain input image contains a bird and ‘False’ otherwise. Removing information from the part of the image which contains the bird should change the classifier’s prediction, whereas removing information from other areas should not. Several kinds of perturbations have been explored, e.g. Zeiler and Fergus (2014); Ribeiro et al. (2016) remove information by replacing a part of the input with a gray square. Note that these approaches are implementation invariant by definition, and are sensitive with respect to the perturbation function.

Existing perturbation-based approaches for RL (Greydanus et al., 2018; Iyer et al., 2018), however, by focusing on the complete (or ), tend to produce saliency maps that are not specific to the action of interest. Our proposed approach addresses this by measuring the impact only on the action being selected, resulting in more focused and useful saliency maps, as we show in our experiments.

5 Limitations and Future Work

Saliency maps focus on visualizing the dependence between the input and output to the model, essentially identifying the situation-specific explanation for the decision. Although such local

explanations have applications in understanding, debugging, and developing trust with machine learning systems, they do not provide any direct insights regarding the general behavior of the model, or guarantee that the explanation is applicable to a different scenario. Attempts to provide a more general understanding of the model include carefully selecting prototype explanations to show to the user 

(van der Linden et al., 2019) and crafting explanations that are precise and actionable (Ribeiro et al., 2018). We will explore such ideas for the RL setting in future, to provide explanations that accurately characterize the behavior of the policy function, in a precise, testable, and intuive manner.

There are a number of limitations of our approach to generating saliency maps in our current implementation. First, we perturb the state by removing information (removing pieces in Chess/Go, blurring pixels in Atari). Therefore, our approach cannot highlight the importance of absence of certain attributes, i.e. saliency of certain positions being empty. In games such as Chess and Go, an empty square or file (collection of empty squares) can often be important for a particular move. Future work will explore perturbation functions that add information to the state (perhaps in the form of adding pieces in Chess/Go). Such functions, along with our approach, can be used to calculate the importance of empty squares. Second, it is possible that perturbations may explore states that lie outside the manifold, i.e. they lead to invalid states. For example, unless explicitly prohibited like we do, our approach will compute the saliency of the king pieces by removing them, which is not allowed in the game, or remove the paddle from Pong. In future, we will explore strategies that take the valid state space into account when computing the saliency. Last we are estimating the saliency of each feature independently, ignoring feature dependencies and correlations, which may lead to incorrect saliency maps. We will investigate approaches that perturb multiple features to quantify the importance of each feature (Ribeiro et al., 2016; Lundberg and Lee, 2017), and combine them with our approach to explaining black-box policy-based agents.

6 Conclusion

We presented a perturbation-based approach that generates more focused saliency maps than existing approaches by balancing two aspects (specificity and relevance) that capture different desired characteristics of saliency. We showed through illustrative examples (Chess, Atari, Go), human studies (Chess), and automated evaluation methods (Chess) that our approach generates saliency maps that are more interpretable for humans than existing approaches. The results of our technique show that saliency can provide meaningful insights into a black-box RL agent’s behavior.

References

  • A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian (2018) Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. Cited by: §4.
  • T. Dodson, N. Mattei, and J. Goldsmith (2011) A natural language argumentation interface for explanation generation in markov decision processes. In International Conference on Algorithmic DecisionTheory, pp. 42–55. Cited by: §4.
  • F. Elizalde, L. E. Sucar, M. Luque, J. Diez, and A. Reyes (2008) Policy explanation in factored markov decision processes. In Proceedings of the 4th European Workshop on Probabilistic Graphical Models (PGM 2008), pp. 97–104. Cited by: §4.
  • R. C. Fong and A. Vedaldi (2017) Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3429–3437. Cited by: §1, §4.
  • A. Ghorbani, A. Abid, and J. Zou (2019)

    Interpretation of neural networks is fragile

    .

    Proceedings of the AAAI Conference on Artificial Intelligence

    33, pp. 3681–3688.
    External Links: ISSN 2159-5399, Link, Document Cited by: §4.
  • S. Greydanus, A. Koul, J. Dodge, and A. Fern (2018) Visualizing and understanding atari agents. In International Conference on Machine Learning, pp. 1787–1796. Cited by: Appendix A, Appendix A, 8(c), 8(g), Appendix C, 0(c), §1, §1, §1, §2, §2, 1(b), 1(d), 2(b), 2(d), 3(b), 3(d), 4(d), §3.1, §3.1, §3.1, §3.2, §3.3, §3, §4.
  • B. Hayes and J. A. Shah (2017) Improving robot controller transparency through autonomous policy explanation. In 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI, pp. 303–312. Cited by: §4.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §1.
  • R. Iyer, Y. Li, H. Li, M. Lewis, R. Sundar, and K. P. Sycara (2018) Transparency and explanation in deep reinforcement learning neural networks. CoRR abs/1809.06061. External Links: Link, 1809.06061 Cited by: Appendix A, 8(b), 8(f), Appendix C, 0(b), §1, §1, §1, §2, §2, §2, 4(c), §3.1, §3.1, §3.2, §3.3, §3, §4.
  • A. Karpathy, J. Johnson, and L. Fei-Fei (2015) Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078. Cited by: §1.
  • O. Z. Khan, P. Poupart, and J. P. Black (2009) Minimal sufficient explanations for factored markov decision processes. In Nineteenth International Conference on Automated Planning and Scheduling, Cited by: §4.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Neural Information Processing Systems (NIPS), I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4765–4774. Cited by: §5.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.
  • T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur (2010) Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, Cited by: §1.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: Appendix A.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016) Why should i trust you?: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §4, §5.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2018) Anchors: high-precision model-agnostic explanations. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §5.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §4.
  • A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3145–3153. Cited by: §4.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: Appendix A.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: Appendix A, §1.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §1, §4.
  • J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §4.
  • M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3319–3328. Cited by: §4.
  • I. van der Linden, H. Haned, and E. Kanoulas (2019) Global aggregations of local explanations for black box models. In SIGIR Workshop on FACTS-IR: Fairness, Accountability, Confidentiality, Transparency, and Safety, Cited by: §5.
  • T. Zahavy, N. Ben-Zrihem, and S. Mannor (2016) Graying the black box: understanding dqns. In International Conference on Machine Learning, pp. 1899–1908. Cited by: §4.
  • M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §1, §4.
  • J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff (2018) Top-down neural attention by excitation backprop. International Journal of Computer Vision 126 (10), pp. 1084–1102. Cited by: §4.

Appendix A Experimental Details

For experiments on chess, we use the latest version of the Stockfish agent: https://stockfishchess.org/

. Stockfish works using a heuristic-based measure for each state along with Alpha-Beta Pruning to search over the state-space.

For experiments on Go, we use the pre-trained MiniGo RL agent: https://github.com/tensorflow/minigo. This agent was trained using the AlphaGo Algorithm (Silver et al., 2016). It also adds features and architecture changes from the AlphaZero Algorithm Silver et al. (2017).

For experiments on Atari agents and for generating saliency maps for Greydanus et al. (2018), we use their code and pre-trained RL agents available at https://github.com/greydanus/visualize_atari. These agents are trained using the Asynchronous Advantage Actor-Critic Algorithm (A3C) (Mnih et al., 2016).

For generating saliency maps using Iyer et al. (2018), we use our implementation. All of our code and more detailed results are available in our Github repository: https://github.com/rl-interpretation/understandingRL .

For chess and Go, we perturb the board position by removing one piece at a time. We do not remove a piece if the resulting position is illegal. For instance, in chess, we do not remove the king. For Atari, we use the perturbation technique described in Greydanus et al. (2018)

. The technique perturbs the input image by adding a Gaussian blur localized around a pixel. The blur is constructed using the Hadamard product to interpolate between the original input image and a Gaussian blur. The saliency maps for Atari agents have been computed on the frames provided by

Greydanus et al. (2018) in their code repository.

The puzzles for conducting the Chess human studies, creating the Chess Saliency Dataset, and providing illustrative examples have been taken from Lichess: https://database.lichess.org/. The puzzles for illustrative examples on Go have been taken from OnlineGo: https://online-go.com/puzzles.

Appendix B Saliency Maps for top 3 moves

Figure 8 shows the saliency maps generated by our approach for the top 3 moves in a chess position. The maps highlight the different pieces that are salient for each move. For instance, Figure 7(a) shows that for the move Qd4, the pawn on g7 is important. This is because the queen move protects the pawn. For the saliency maps in Figures 7(b) and 7(c), the pawn on g7 is not highlighted.

(a) Move Qd4
(b) Move Rf1
(c) Move Bb5
Figure 8: Saliency Maps generated by our approach for the top 3 moves in a chess position

Appendix C Saliency Maps for LeelaZero

To show that our approach generates meaningful saliency maps in Chess for RL agents, we interpret the LeelaZero Deep RL agent https://github.com/leela-zero/leela-zero. Figure 9 shows the results. As discussed in Section 1, the saliency maps generated by (Greydanus et al., 2018; Iyer et al., 2018) highlight several pieces that are not relevant to the move being explained. On the other hand, the saliency maps generated by our approach highlight the pieces relevant to the move.

(a) Original Position
(b) Iyer et al. (2018)
(c) Greydanus et al. (2018)
(d) Our Approach
(e) Original Position
(f) Iyer et al. (2018)
(g) Greydanus et al. (2018)
(h) Our Approach
Figure 9: Saliency maps generated by different approaches for the LeelaZero Deep Reinforcement Learning Agent