. In spite of their impressive performance across a wide variety of tasks, they are often criticized for being black boxes and lack of interpretability, which has increasingly been a pressing concern in deep RL. In addition, while deep RL substantially benefits from powerful function approximation capability of deep neural networks (DNNs), poor interpretability of which further exacerbates such concerns. Hence, developing the ability to understand the agent’s underlying decision-making process is crucial before using deep RL to solve real-world problems where reliability and robustness are critical.
In machine learning, there has been a lot of interest in explaining decisions of black-box systems[8, 34, 26, 18]. Some popular methods have provided visual explanations for DNNs, such as LIME , LRP , DeepLIFT , Grad-CAM , Kernel-SHAP  and network dissection [4, 60]. However, these methods generally depend on class information and cannot be directly adapted to continuous RL tasks. In the context of vision-based RL, a feasible explanation approach is to learn t-Distributed Stochastic Neighbor Embedding (t-SNE) maps [33, 54, 2], but which are difficult for non-experts to understand. Moreover, there are a number of works applying gradient-based [52, 54] and perturbation-based  approaches to visualizing important features for RL agent’s decisions, but the generated saliency maps are usually coarse and only provide limited quantitative evaluation. Another promising approach incorporates attention mechanisms into actor network to explain RL agent’s decisions [30, 35, 59, 58]. However, these methods are not applicable to pretrained agent models whose internal structure cannot be changed anymore, in addition, some of these methods depend on human demonstration dataset [59, 58].
This paper aims to render causal explanations for vision-based RL where the agent’s states are color images. To overcome the limitations of the above methods, we propose a self-supervised interpretable framework, which can discover causal features for easily understanding what information is task-relevant and where to look in the state. Answering these questions can provide valuable causal explanations about how decisions are made by the agent and why the agent performs well or badly. The main idea underlying our framework is novel and simple. Specifically, for a pretrained policy that needs to be explained, our framework learns to predict an attention mask to highlight the information (or features) that may be task-relevant in the state. If the generated actions are consistent when the policy takes as input the state and the attention-overlaid state respectively, the features highlighted by our framework are considered to be task-relevant and constitute most evidence for the agent’s decisions.
In this paper, the kernel module of our framework, i.e., a self-supervised interpretable network (SSINet), is first presented for vision-based RL agents based on two properties, namely maximum behavior resemblance and minimum region retaining. These two properties force the SSINet to provide believable and easy-to-understand explanations for humans. After the validity is empirically proved, the SSINet is applied to causally explain RL agents from two facets including decision-making and performance. While the former focuses on explaining how the agent makes decisions, the latter emphasizes the explanations about why the agent performs well or badly. More concretely, the agent’s decisions are explained by understanding basic attention patterns, identifying the relative importance of features and analyzing failure cases. Moreover, to explain the agent’s performance, such as the robustness when transferred to novel scenes, two mask metrics are introduced to evaluate the attention masks generated by SSINet, then how the agent’s attention influences performance is explained quantitatively.
We conduct comprehensive experiments on several Atari 2600 games  as well as Duckietown , which is a challenging self-driving car simulator environment. Empirical results verify the effectiveness of our method, and demonstrate that the SSINet produces high-resolution and sharp attention masks to highlight task-relevant information that constitutes most evidence for the agent’s decisions. In other words, our method discovers causal features for easily explaining how the agent makes decisions and why the agent performs well or badly. Overall, our method takes a significant step towards causally interpreting vision-based RL.
It is worth noting that our whole training procedure can be seen as self-supervised, because the data for training SSINet is collected by using the pretrained RL agent. Generally, self-supervised learning is challenging due to the lack of labelled data. It is not well understood why humans excel at self-supervised learning. For example, a child has never been supervised in pixel level, but can still perform highly precise segmentation tasks. Our method reveals a self-supervised manner to learn high-quality mask by directly interacting with the environment, which may shed light on new paradigms for label-free vision learning such as self-supervised segmentation and detection.
The remainder of this paper is organised as follows. In the following two sections, we review related works and introduce the related background of RL. In Section 4, we mainly present a self-supervised interpretable framework for vision-based RL. In Section 5, empirical results are provided to verify the effectiveness of our method. In Section 6 and 7, our method is applied to causally explain how the agent makes decisions and why the agent performs well or badly, respectively. In the last section, we draw the conclusion and outline the future work.
2 Related Work
2.1 Explaining Traditional RL Agents
have focused on explaining traditional RL agents. For example, based on the assumption that an exact Markov Decision Process (MDP) model is readily accessible, natural language and logic-based explanations are given for RL agents in and  respectively. More recently, execution traces  of an agent are analyzed to extract explanations. However, these methods rely heavily on interpretable, high-level or hand-crafted state features, which is impractical for vision-based applications.
Other explanation methods include decision tree[19, 3, 40] and structural causal MDP [50, 29]. While decision tree can be represented graphically and thus aid in human understanding, a reasonably-sized tree with explainable attributes is difficult to construct, especially in the vision-based domain. Structural causal MDP methods are designed for specific MDP models and thus provide limited explanations.
2.2 Explaining Vision-Based RL Agents
Explaining the decision-making process of RL agents has been a particular area of interest for recent works. Here we review prior works that aim to explain how inputs influence sequential decisions in vision-based RL. Broadly speaking, existing methods can be partitioned into four categories: embedding-based methods, gradient-based methods, perturbation-based methods and attention-based methods. In addition to those works that focus on the explanation of vision-based RL, some popular and relevant works for visual explanations of DNNs will also be reviewed.
The main idea underlying embedding-based methods for interpreting vision-based RL is to visualize high dimensional data with t-SNE, which is a commonly used non-linear dimensionality reduction method. The simplest approach is to directly map the representation of perceptually similar states to nearby points [33, 54, 2]
. Each state is represented as a point in the t-SNE map, and the color of the points is set manually using global features or specific hand crafted features. In addition, there is some work attempting to learn an embedded map where the distance between any two states is related to the transition probabilities between them. However, an issue with these methods is that they emphasize t-SNE clusters or state transition statistics which are uninformative to those without a machine learning background.
Gradient-based methods. Methods in this category aim to understand what features of an input are most salient to its output by performing only one or a few backward passes through the network. The prototypical work is Jacobian saliency maps  where attributions are computed as the Jacobian with respect to the output of interest. Furthermore, there are several works generating Jacobian saliency maps and presenting it above the input state to understand which pixels in the state affect the value or action prediction the most [52, 54]
. Moreover, several other works modify gradients to obtain saliency maps for explanations of DNNs, such as Excitation Backpropagation, Grad-CAM , LRP , DeepLIFT  and SmoothGrad . Unfortunately, Jacobian saliency maps may be difficult to interpret due to the change of manifold , although they are efficient to compute and have clear semantics.
Perturbation-based methods. This category includes methods that measure the variation of a model’s output when some of the input information is removed or perturbed [15, 61]. The simplest perturbation approach computes attributions by replacing part of an input image with a gray square  or region . An issue with this approach is that replacing pixels with a constant color introduces undesirable information. Recently, a Gaussian perturbation approach 
is applied to visualize Atari agents by using masked interpolations between the original state and Gaussian blur, but a Gaussian perturbation fails to capture the shape of features and results in coarse saliency maps. A particular example of perturbation-based methods is Shapley values, but the exact computation of which is NP-hard. Thus there are recent works applying perturbation approaches to approximating Shapley values for explanations of DNNs, such as LIME , Kernel-SHAP  and DASP . Moreover, these gradient-based and perturbation-based methods [52, 54, 17] for RL only provide limited quantitative evaluation.
Attention-based methods. Another branch of that development is the incorporation of various attention mechanisms into vision-based RL agents. Learning attention to generate saliency maps for understanding internal decision pattern is one of the most popular methods 
in deep learning community, and there are already a considerable number of works in the direction of interpretable RL. A mainstream approach is to augment the actor network (or agent) with customized self-attention modules[48, 36, 53, 30, 37], which learn to focus its attention on semantically relevant areas for making decisions. Another notable approach implements the key-value structure of attention to learn explainable policies by sequentially querying its view of the environment [11, 2, 35]
. However, these methods generally need to change the agent’s internal structure and thus cannot explain agent models that have already been trained. Moreover, there are some works attempting to build human Atari-playing attention dataset and use it to learn an explainable policy via imitation learning[59, 58], but the cost can be prohibitive and it is impractical to do that for each vision-based RL task.
We consider a standard RL setup consisting of an agent interacting with an environment in discrete timesteps. Specifically, the agent takes an action in a state , and receives a scalar reward . Meanwhile, the environment changes its state to . We model the RL task as a Markov decision process with state space , action space , initial state distribution , transition dynamics , and reward function . In all the tasks considered here the actions are real-valued .
An agent’s behaviour is defined by a policy
, which maps states to a probability distribution over the actions. The return from a state is defined as the sum of discounted future rewards, computed over a horizon , i.e. with a discounting factor . Note that the return depends on the actions selected, and therefore on the policy . The goal of an agent is to learn a policy which maximizes the following expected return from the start distribution
In this paper, we pretrain the agent with three model-free RL algorithms including proximal policy optimization (PPO) , soft actor-critic (SAC)  and twin delayed deep deterministic policy gradient (TD3) . As an on-policy method, PPO uses trust region update to improve a general stochastic policy with gradient descent. Both SAC and TD3 are off-policy and based on actor-critic architecture. While SAC leads to a maximum entropy policy for capturing multiple modes of near-optimal behaviour, TD3 learns a deterministic policy by building on double Q-learning  and deep deterministic policy gradient .
In this section, we first present the main idea underlying causal explanations for vision-based RL. Then, a self-supervised interpretable framework is proposed for a general RL agent. Finally, a two-stage training procedure is given for this framework.
4.1 Causal Explanations for Vision-Based RL
Consider a general setting where an expert policy is obtained by pretraining an actor network (agent) , which takes as input an image to predict an action. To provide causal explanations, our goal is to train a separate explanation model that can produce a mask corresponding to the importance assigned to each pixel of state . In general, the mask can be explained as a kind of soft attention to show where the agent “looks” to make its decision. In the context of vision-based RL, the explanation model should satisfy two properties, namely maximum behavior resemblance and minimum region retaining.
Property 1 (Maximum behavior resemblance).
For an actor network and an explanation model , suppose is the attention-overlaid state corresponding to a specific state , i.e., , then
where denotes the element-wise multiplication, and is an episode generated with .
Property 2 (Minimum region retaining).
For a parameterized explanation model and a specific state , the retaining region is required to be minimum after overlaying the state with corresponding attention. That is
where denotes the -norm, and is the parameters of explanation model .
Remark Property 1 requires the agent’s behavior to keep as consistent as possible with the original after the states are overlaid with the attentions generated by . Property 2 requires to attend to as little information as possible, enabling easy understanding of decision-making for humans.
In addition to the above properties, we emphasize that an explanation model for vision-based RL should be able to provide causal explanations from two facets:
Interpretability of decision-making. In order for an agent to be interpretable, it must not only suggest informative explanations that make sense to those without a machine learning background, but also ensure these explanations accurately represent the intrinsic reasons for the agent’s decision-making. Concretely, it should be easy to understand how decisions are made, how an agent’s current state affects its action, what information is used and where to look. While these questions are solved, the underlying decision-making process of RL agent is partially uncovered. Note that this type of analysis does not rely on the optimal policy; if an agent takes a suboptimal or even bad action, but the reasons for which can be explained faithfully, we still consider it interpretable.
Interpretability of performance. In the context of RL, transferability is whether the agent can generalize its policy across different scenes, and robustness is the ability of an agent to resist unknown external disturbances such as unexpected objects and new situations in novel scenes. In practice, it is meaningful and instructive to explain the performance of interest, especially when transferring the agent to novel scenes. More concretely, how the agent’s attention influences performance. What factors will affect the performance? Do the RL algorithm and actor network architecture play a major role? Answering these questions can help explain why deep RL agents perform well or badly.
4.2 Self-Supervised Interpretable Framework
In this section, we present a self-supervised interpretable framework for the explanation model . As outlined in Fig. 1, for a RL agent modelled by an actor network, we integrate a self-supervised interpretable network (SSINet) in front of the actor network. While the agent receives a state to predict an action at each time step, the SSINet produces an attention mask to highlight the task-relevant information for making decision without any external supervised signal. To that end, the SSINet must learn which parts of the state are considered important by the agent.
and scene depth estimation. Most of those approaches adopt an encoder-decoder structure. In order to make our masks sharp and precise, we build our SSINet by directly adapting a U-Net architecture  with only minor changes, exact details of which are described in Appendix A of the supplemental material. As depicted in Fig. 1, our SSINet includes a feature extractor and a mask decoder . Specifically, a state at time step (here a frame of height , width and channel ) is encoded through to obtain low-resolution feature map, which is then taken as input of and upsampled to produce an attention mask :
is the sigmoid nonlinearity. Afterwards, the attention maskis broadcast along the channel dimension of the state , and element-wise multiplied with it to form a masked (or attention-overlaid) state :
where denotes the element-wise multiplication.
Actor Network. Generally, we use an actor network to model the policy of a RL agent. As shown in Fig. 1, the actor network includes a feature extractor and an action predictor . The feature extractor takes as input a masked state (or a state ), and outputs low-resolution feature map . The action predictor is a simple two-layer perception, which uses flatten feature map to predict an action :
where is the tanh nonlinearity for continuous RL tasks or the softmax nonlinearity for discrete RL tasks. Note that to generate interpretable attention masks that provide access to task-relevant information for making decisions, the feature extractor of actor network is shared with SSINet. For clarify, we denote by expert policy the actor network taking as input , and denote by mask policy the actor network taking as input .
There are several advantages of our self-supervised interpretable framework. First, our interpretable framework is applicable to any RL model taking as input visual images. Second, the SSINet learns to predict task-relevant information, which can provide intuitive and valuable explanations for the agent’s decisions. Finally, we emphasize that the SSINet is a flexible explanatory module which can be adapted to other vision-based decision-making systems.
4.3 Training Procedure
Our training procedure includes two stages. The first stage aims to obtain a RL agent and use its expert policy to generate state-action pairs, which is used for training SSINet in the second stage. The objective of second stage is to learn interpretable attention masks for explaining the agent’s behaviour. Overall, the whole training procedure is self-supervised, because there is no external labelled data.
In the first stage, we switch to (as shown in Fig. 1) and pretrain the feature extractor and action predictor with PPO  algorithm. After training, the resulting expert policy is used to collect data by generating state-action pairs with the action .
In the second stage, we switch to and train the SSINet. Based on Property 1, our goal is to learn attention masks such that there is minimum variation between the predicted actions (6) after changing the input from to . Moreover, Property 2
requires the learned mask to attend to as little information as possible. These considerations lead to the mask loss function as follows:
where denotes the -norm, is a positive scalar controlling the sparseness of the mask, and is the batch size. The first term ensures that the agent’s behaviour does not change much after overlaying the state with corresponding attention mask, while the second term is a sparse regularization that pushes for better visual explanations for humans.
One point worth noting is that only the mask decoder is trained in the second stage, because the feature extractor of SSINet is shared from the actor network pretrained in the first stage. The pseudo-code of our training procedure is summarized in Algorithm 1.
5 Validity of Our Method
Before we apply the proposed method to provide causal explanations for vision-based RL, we first verify the effectiveness of our method through performance evaluation and comparative evaluation in this section. Then, our method is applied to provide causal explanations and empirical evidences about how the agent makes decisions in Section 6 and why the agent performs well or badly in Section 7.
We verify and evaluate the performance of our method on several Atari 2600 games and Duckietown environment (see below for details). All experimental details are given in Appendix C of the supplemental material. Note that during data collection, the expert policy is used to generate state-action pairs for training the SSINet.
Duckietown. Duckietown  is a self-driving car simulator environment for OpenAI Gym . It places the agent, a Duckiebot, inside of an instance of a Duckietown: a loop of roads with turns, intersections, obstacles, and so on. Specifically, the states are selected as color images which are resized from single camera images with , and the actions contain two continuous normalized numbers corresponding to forward velocity and steering angle respectively. The goal of an agent is to drive forward along the right lane, hence the agent will be rewarded for being as close as possible to the center line of the lane, and also for facing the same direction as the lane’s tangent. In our experiments, we evaluate our method on Lane-following task, and the empty map is chosen as the training scene. In addition to empty and zigzag maps provided by the official environment, another eight customized maps (including empty-city, zigzag-city, corner, corner-city, U-turn, U-turn-city, S-turn and S-turn-city) are designed only for evaluation. These ten maps are mainly different from each other in background and driving route, detailed descriptions are given in Appendix B of the supplemental material.
Atari 2600. In addition to Duckietown, we also perform experiments in the Arcade Learning Environment (ALE) , which is a commonly used benchmark environment for discrete RL tasks. We select Assault and Tennis games in our experiments. Assault is a fixed shooter game where the player has to destroy all small ships continually deployed by an alien mother ship while preventing being shot. Tennis is a singles tennis game which follows standard tennis rules and allows players to hit assorted forehand and backhand shots to any location on the court. In each game, the agent receives stacked grayscale images as inputs, as described in .
To demonstrate the effectiveness of our proposed method, we verify the RL agent’s behaviour consistency between mask policy and expert policy from two aspects, i.e., average return and behavior matching. Average return represents the long-term rewards of two policies, while behavior matching characterizes the behavioural similarity of two policies. We note that similar metrics are also used for attention-guided learning in recent work .
Fig. 2 shows the results of our SSINet on three tasks in terms of behavior matching. We observe that the mask policy makes decisions using partial information (or the bright areas in the heatmaps) while the expert policy uses all information in the state, but as expected, the mask policy predicts almost the same actions as the expert policy. This observation verifies that the attention masks produced by SSINet can accurately highlight the task-relevant information constituting most evidence for the expert policy’s behaviour.
Fig. 4 compares the performance of expert policy and mask policy in terms of average return. Fig. 5 visualizes the performance on empty, empty-city, zigzag and zigzag-city maps, more visualization results are given in Appendix D of the supplemental material. It can be seen in Fig. 4 that the mask policy consistently achieves greater long-term rewards than the expert policy on all maps except the S-turn-city map. As stated in Section 4.2, the expert policy and mask policy take as input original state and attention-overlaid state, respectively. Therefore, we can conclude that the attention masks produced by SSINet can retain task-relevant information and filter out task-irrelevant information. This conclusion verifies the effectiveness of our method.
Comparative evaluation. We compare our method against several popular explanation methods including Jacobian-based saliency method (Jacobian-Saliency) , Gaussian perturbation-based saliency method (Perturbation-Saliency) , attention augmented agent model (A3M)  and sparse free-lunch saliency via attention method (Sparse FLS) . These methods focus on the interpretability of vision-based RL and have been briefly reviewed in Section 2. Fig. 3 visualizes the saliency maps generated by our method and the other four methods. The results show that, overall, our method produces higher-resolution and sharper saliency maps than others. Moreover, it is worth noting that the saliency maps generated by our method can reflect the relative importance of different features by the depth of color, which is further discussed in the next section.
6 Interpreting the Decision-Making of Agents
In this section, our method is applied to explain how RL agent makes decisions from three aspects. First, basic attention patterns for making decisions are recognized and understood. Second, the relative importance of different task-relevant features is identified for easy understanding of the agent’s decision-making process. Three, some failure cases are analyzed from the viewpoint of attention shift.
6.1 Basic Attention Patterns for Making Decisions
Here we explain how vision-based RL agent makes decisions by visualizing and understanding the agent’s basic attention patterns. As can be observed in Fig. 2, the most dominant pattern is that the agent focuses on only small regions which are strongly task-relevant, while other regions are very “blurry” and can be ignored. In other words, the state is not a primitive, the agent learns what information is important for making decisions and where to look at each time step. For example, the task-relevant features are white edge line and yellow dashed line on Lane-following task, enemies and health points on Assault shooting task, players and ball on Tennis task. In fact, this conclusion is consistent with human gaze-action pattern , one characteristic of which is that humans tend to focus attention selectively on parts of the visual space to acquire task-relevant information when and where it is needed.
6.2 Relative Importance of Task-Relevant Features
In addition to making it clear what information is used and where to look, understanding the relative importance of different task-relevant features is also crucial for easily explaining the agent’s decision-making process. Although the value of attention mask (or the depth of color in heatmap) has intuitively indicated the relative importance of different features in the state, it is not strictly verifiable.
In this section, we seek to identify the relative importance of task-relevant features in a more interpretable way. We observe that greater regularization scale in mask loss (7) actually means severer penalty to the agent for attending to task-irrelevant regions. Based on this observation, we propose to assess the relative importance of task-relevant features by comparing multiple attention masks trained with different values of . To that end, we perform a fine search to visualize the evolving process of attention masks. Fig. 6 shows the evolution of attention masks in the form of heatmap as the regularization scale varies.
As can be seen in Fig. 6, with the increasing of regularization scale, the “attended” regions are gradually narrowed down to the most important information as expected. Concretely, the inner side of both yellow dashed line and white edge line is considered to be more important than the outer side for making decisions, and nearby lines are considered to be more important than distant lines. In fact, this conclusion is consistent to human gaze system where limited visual sensor resources will be assigned to the most important information.
6.3 Analysis of Failure Case
In practice, it is critical to ensure that a trained RL agent can be directly transferred to novel scenes different from the scene for training. However, robustness is not always guaranteed. Take Lane-following task for example, as can be seen in Fig. 4, there is a significant performance degradation when transferring the agent trained on empty map to other maps, such as S-turn-city, zigzag and zigzag-city. This robustness problem can be explained intuitively from the point of view of attention shift.
In those failure cases, we notice that the agent is prone to divert its attention from task-relevant information to background when facing some novel situations, this phenomenon is called attention shift. Fig. 7 visualizes a common problematic situation leading to poor robustness in S-turn-city and zigzag-city maps, in which the agent needs to turn left on a corner surrounded by the grassland and lake. However, this novel situation has never been encountered on empty map when training, hence it may be difficult for the agent to judge what features are important for making decisions in current situation. As a result, the agent gradually loses attention to task-relevant information (white edge line) and mistakenly attends to the background (lake and grassland), then it gets into stuck due to catastrophic cumulative attention shift.
7 Interpreting the Performance of Agents
In this section, our method is applied to explain why the agent performs well or badly
quantitatively, especially when transferred to novel scenes. To that end, we start with introducing two evaluation metrics to assess the attention masks generated by SSINet. These metrics allow us to give quantitative explanation about how the agent’s attention influences its performance. Then, they are further used to explain the performance of RL agents trained with different algorithms and actor architectures. Finally, potential extension of our method to self-supervised learning is briefly discussed.
7.1 Mask Evaluation Metrics
To interpret the performance of RL agents, two evaluation metrics are introduced to assess the quality of generated attention masks. Specifically, feature overlapping rate and background elimination rate are defined as:
Feature Overlapping Rate (FOR) - the overlapping ratio between the area of true mask and learned mask.
Background Elimination Rate (BER) - the ratio of eliminated background area by the mask to the whole background area.
For a specific state , mask metrics and are calculated as follows:
where and are union and intersection operators respectively. , and represent the area of extracted features, “true” task-relevant features and “true” background respectively, as shown in Fig. 8. Note that the “true” background is the area outside the “true” features in Fig. 8(b). In general, FOR indicates how agents can extract useful information from the state and BER indicates how the SSINet can eliminate task-irrelevant information in the state.
An example for calculating the mask metrics. (a) is a specific state from Duckietown, red area in (b) is the “true” task-relevant features that we expect to learn, and unmasked area in (c) is the features extracted by our SSINet.
To compute FOR and BER, “true” features are annotated manually. Note that in this section, Duckietown is chosen as the main experimental environment due to two reasons. First, as described in Section 5.1, Duckietown is a self-driving car simulator environment, task-relevant features of which are clear and easy to identify (i.e., white edge line and yellow dashed line). Second, Duckietown is a highly customized environment. The background and driving route of each task can be varied to satisfy the researcher’s demand, which is important for evaluating the generalization of RL agents. In our experiments, we use averaged mask metrics and on the training map to characterize the attention masks generated with our SSINet.
7.2 How the Agent’s Attention Influences Performance
In order to analyze quantitatively how vision-based RL agent’s attention influences its performance, especially when transferred to novel scenes, we compare the average returns of multiple mask policies. These mask policies are all trained to interpret identical RL agent but produce different attention masks for the same state. In our experiments, we train SSINet under different regularization scales for the same actor network. Fig. 9 shows how the average mask metrics ( and ) influence the average return on four maps.
As can be seen in Fig. 9, when evaluated on the same map, the agent performs differently with regards to different . Only when both and have high values, the best performance can be achieved. In other words, the agent can not perform well enough if it neglects task-relevant information or attends too much to the background, reflected either by a small or a small .
7.3 Explaining the Performance of Different Agents
Generally, RL agents may exhibit different performance even on simple tasks. To explain why the agent performs well or badly, especially in terms of stability and robustness when transferred to novel scenes, the above mask metrics are utilized to analyze the behaviour of multiple RL algorithms and actor architectures. Such an analysis can provide explainable basis for the selection of models and actor architectures.
Case 1: RL algorithm. To explain the performance difference of RL algorithms, we analyze the average return of three popular RL algorithms (PPO, SAC and TD3) with the above mask metrics. As shown in Fig. 11, PPO consistently outperforms both SAC and TD3 on all maps for Lane-following task. The reason for this is the background information has adverse effect on the agent’s performance, and our mask metrics can help quantify it. Specifically, although the of SAC and TD3 are close to one indicating that almost all task-relevant information are identified, a large amount of background information is also mistakenly attended to due to small . In contrast, PPO focuses on main task-relevant information while masking most background information. These conclusions are illustrated and further verified by Fig. 12, which visualizes the performance of PPO, SAC and TD3. Moreover, we observe that PPO shows better stability than SAC and TD3. Concretely, while PPO agent tends to drive smoothly in the center line of right lane, both SAC and TD3 agents have obvious lateral deviation and drive unsteadily.
Case 2: Actor architecture. To understand how the actor architecture affects the agent’s performance, we analyze the average return of four popular semantic segmentation architectures (U-Net , RefineNet 111Notice that RefineNet-1 and RefineNet-2 are exactly the same except for utilizing ResNet and MobileNet as the backbone network respectively. , FC-DenseNet  and DeepLab-v3 ) with the above mask metrics. Detailed descriptions of these architectures can be found in Appendix A of the supplemental material. As shown in Fig. 10 and Fig. 13, the performance difference among five agents results from diverse attention behaviours indicated by and . Losing a lot of task-relevant information (corresponding to small such as FC-DenseNet) or mistakenly matching much background (corresponding to small such as RefineNet-1 and RefineNet-2) leads to bad performance. Moreover, we conclude that complex actor architecture does not necessarily lead to good performance. In fact, task-relevant features are relatively simple in the context of RL, hence most DNN architectures can extract them well. In practice, small DNNs are generally capable of learning representations and preferred to make the training algorithm focus on the credit assignment problem.
7.4 Potential Extension to Self-Supervised Learning
As a relatively recent learning technique in machine learning, self-supervised learning is challenging due to the lack of labelled data. In our work, we presented a self-supervised interpretable framework for vision-based RL, and a two-stage training procedure is applied to train the SSINet in a self-supervised manner. The learning signal is acquired through the direct interaction between RL agent and environment, and the whole training process is completely label-free. Empirical results in Fig. 2 and Fig. 5 demonstrate that our method is capable of learning high-quality mask through the direct interaction with the environment and without any external supervised signal. In summary, our work may shed light on new paradigms for label-free vision learning such as self-supervised segmentation and detection.
In this paper, we addressed the growing demand for human-interpretable vision-based RL from a fresh perspective. To that end, we proposed a general self-supervised interpretable framework, which can discover causal features for easily explaining the agent’s decision-making process. Concretely, a self-supervised interpretable network (SSINet) was employed to produce high-resolution and sharp attention masks for highlighting task-relevant information, which constitutes most evidence for the agent’s decisions. Then, our method was applied to provide causal explanations and empirical evidences about how the agent makes decisions and why the agent performs well or badly, especially when transferred to novel scenes. Overall, our work takes a significant step towards interpretable vision-based RL. Moreover, our method exhibits several appealing benefits. First, our interpretable framework is applicable to any RL model taking as input visual images. Second, our method does not use any external labelled data. Finally, we emphasize that our method demonstrates the possibility to learn high-quality mask through a self-supervised manner, which provides an exciting avenue for applying RL to self automatically labelling and label-free vision learning such as self-supervised segmentation and detection.
-  (2019) Explaining deep neural networks with a polynomial time algorithm for shapley values approximation. In International Conference on Machine Learning, Cited by: §2.2.
Towards better interpretability in deep q-networks.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4561–4569. Cited by: §1, §2.2, §2.2.
-  (2018) Verifiable reinforcement learning via policy extraction. In Advances in Neural Information Processing Systems, pp. 2494–2504. Cited by: §2.1.
-  (2017) Network dissection: quantifying interpretability of deep visual representations. In , pp. 6541–6549. Cited by: §1.
-  (2013) The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research 47, pp. 253–279. Cited by: §1, §5.1.
-  (2016) Layer-wise relevance propagation for neural networks with local renormalization layers. In International Conference on Artificial Neural Networks, pp. 63–71. Cited by: §1, §2.2.
-  (2016) OpenAI gym. External Links: Cited by: §5.1.
-  (2019) Interpretable visual question answering by reasoning on dependency trees. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
-  (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §4.2, §7.3.
-  (2018) Duckietown environments for openai gym. GitHub. Note: https://github.com/duckietown/gym-duckietown Cited by: §1, §5.1.
-  (2017) Multi-focus attention network for efficient deep reinforcement learning. In Workshops at the Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.2.
-  (2011) A natural language argumentation interface for explanation generation in markov decision processes. In International Conference on Algorithmic Decision Theory, pp. 42–55. Cited by: §2.1.
-  (2008) Policy explanation in factored markov decision processes. In Proceedings of the 4th European Workshop on Probabilistic Graphical Models (PGM 2008), pp. 97–104. Cited by: §2.1.
-  (2001) Learning embedded maps of markov processes. In International Conference on Machine Learning, Cited by: §2.2.
-  (2017) Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3429–3437. Cited by: §2.2.
-  (2018) Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, Cited by: §3.
-  (2018) Visualizing and understanding atari agents. In International Conference on Machine Learning, Cited by: §1, §2.2, §2.2, Fig. 3, §5.2.
-  (2019) A survey of methods for explaining black box models. ACM Computing Surveys (CSUR) 51 (5), pp. 93. Cited by: §1.
-  (2015) Policy tree: adaptive representation for policy gradient. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §2.1.
-  (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, Cited by: §3.
-  (2017) Improving robot controller transparency through autonomous policy explanation. In 2017 12th ACM/IEEE International Conference on Human-Robot Interaction, pp. 303–312. Cited by: §2.1.
-  (2017) The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 11–19. Cited by: §7.3.
-  (2009) Vision, eye movements, and natural behavior. Visual Neuroscience 26 (1), pp. 51–62. Cited by: §6.1.
-  (2016) Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, Cited by: §3.
-  (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1925–1934. Cited by: §7.3.
-  (2019) What is tabby? interpretable model decisions by learning attribute-based classification criteria. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
-  (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774. Cited by: §1, §2.2.
-  (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9 (Nov), pp. 2579–2605. Cited by: §2.2.
-  (2019) Explainable reinforcement learning through a causal lens. arXiv preprint arXiv:1905.10958. Cited by: §2.1.
-  (2019) Reinforcement learning with attention that works: a self-supervised approach. arXiv preprint arXiv:1904.03367. Cited by: §1, §2.2.
-  (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048. Cited by: §4.2.
-  (2017) Learning to navigate in complex environments. In Proceedings of the International Conference on Learning Representations, Cited by: §1.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §1, §2.2, §5.1.
-  (2019) Moments in time dataset: one million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2), pp. 502–508. Cited by: §1.
-  (2019) Towards interpretable reinforcement learning using attention augmented agents. In Advances in Neural Information Processing Systems, pp. 12350–12359. Cited by: §1, §2.2, Fig. 3, §5.2.
-  (2016) Learning to predict where to look in interactive environments using deep recurrent q-learning. arXiv preprint arXiv:1612.05753. Cited by: §2.2.
-  (2019) Free-lunch saliency via attention in atari agents. arXiv preprint arXiv:1908.02511. Cited by: §2.2, Fig. 3, §5.2.
”Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. Cited by: §1, §2.2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Cited by: §4.2, §7.3.
-  (2019) Conservative q-improvement: reinforcement learning for an interpretable decision-tree policy. arXiv preprint arXiv:1907.01180. Cited by: §2.1.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §3, §4.3.
-  (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §1, §2.2.
-  (1953) A value for n-person games. Contributions to the Theory of Games 2 (28), pp. 307–317. Cited by: §2.2.
-  (2016) Not just a black box: learning important features through propagating activation differences. arXiv preprint arXiv:1605.01713. Cited by: §1, §2.2.
-  (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484. Cited by: §1.
-  (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. In Workshop Proceedings of the International Conference on Learning Representations, Cited by: §2.2.
-  (2017) Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §2.2.
-  (2015) Deep attention recurrent q-network. arXiv preprint arXiv:1512.01693. Cited by: §2.2.
-  (2016) Deep reinforcement learning with double q-learning. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §3.
-  (2018) Contrastive explanations for reinforcement learning in terms of expected consequences. In Proceedings of the Workshop on Explainable AI on the IJCAI conference, Cited by: §2.1.
-  (2020) Paying attention to video object pattern understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.2.
-  (2016) Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning, Vol. 48, pp. 1995–2003. Cited by: §1, §2.2, §2.2.
-  (2018) Learn to interpret atari agents. arXiv preprint arXiv:1812.11276. Cited by: §2.2.
-  (2016) Graying the black box: understanding dqns. In International Conference on Machine Learning, pp. 1899–1908. Cited by: §1, §2.2, §2.2, §2.2, Fig. 3, §5.2.
-  (2014) Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision, pp. 818–833. Cited by: §2.2.
-  (2018) Top-down neural attention by excitation backprop. International Journal of Computer Vision 126 (10), pp. 1084–1102. Cited by: §2.2.
-  (2016) Learning deep neural network policies with continuous memory states. In 2016 IEEE International Conference on Robotics and Automation, pp. 520–527. Cited by: §1.
-  (2019) Atari-head: atari human eye-tracking and demonstration dataset. arXiv preprint arXiv:1903.06754. Cited by: §1, §2.2, §5.2.
-  (2018) Agil: learning attention from human for visuomotor tasks. In Proceedings of the European Conference on Computer Vision, pp. 663–679. Cited by: §1, §2.2.
-  (2018) Interpreting deep visual representations via network dissection. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (9), pp. 2131–2145. Cited by: §1.
-  (2017) Visualizing deep neural network decisions: prediction difference analysis. arXiv preprint arXiv:1702.04595. Cited by: §2.2.