Reinforcement Explanation Learning

11/26/2021
by   Siddhant Agarwal, et al.
24

Deep Learning has become overly complicated and has enjoyed stellar success in solving several classical problems like image classification, object detection, etc. Several methods for explaining these decisions have been proposed. Black-box methods to generate saliency maps are particularly interesting due to the fact that they do not utilize the internals of the model to explain the decision. Most black-box methods perturb the input and observe the changes in the output. We formulate saliency map generation as a sequential search problem and leverage upon Reinforcement Learning (RL) to accumulate evidence from input images that most strongly support decisions made by a classifier. Such a strategy encourages to search intelligently for the perturbations that will lead to high-quality explanations. While successful black box explanation approaches need to rely on heavy computations and suffer from small sample approximation, the deterministic policy learned by our method makes it a lot more efficient during the inference. Experiments on three benchmark datasets demonstrate the superiority of the proposed approach in inference time over state-of-the-arts without hurting the performance. Project Page: https://cvir.github.io/projects/rexl.html

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 16

page 17

page 19

page 20

page 22

page 24

page 25

01/27/2020

Black Box Explanation by Learning Image Exemplars in the Latent Feature Space

We present an approach to explain the decisions of black box models for ...
12/31/2020

iGOS++: Integrated Gradient Optimized Saliency by Bilateral Perturbations

The black-box nature of the deep networks makes the explanation for "why...
01/30/2020

Black-Box Saliency Map Generation Using Bayesian Optimisation

Saliency maps are often used in computer vision to provide intuitive int...
01/04/2022

McXai: Local model-agnostic explanation as two games

To this day, a variety of approaches for providing local interpretabilit...
10/12/2021

A Rate-Distortion Framework for Explaining Black-box Model Decisions

We present the Rate-Distortion Explanation (RDE) framework, a mathematic...
11/01/2019

Explanation by Progressive Exaggeration

As machine learning methods see greater adoption and implementation in h...
12/18/2019

Iterative and Adaptive Sampling with Spatial Attention for Black-Box Model Explanations

Deep neural networks have achieved great success in many real-world appl...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Learning methods are enjoying enormous success in computer vision, natural language and robotics research. Over the years, the models have become overly complex and large with millions of parameters. For example, GPT-

Brown et al. (2020) has about billion parameters. Yet, it remains largely unclear how the system comes to a decision, how certain the model is about its decision, if and when it can be trusted or when it has to be corrected. Explainable AI (XAI) is being extensively researched upon Fong et al. (2019); Fong and Vedaldi (2017); Petsiuk et al. (2018); Selvaraju et al. (2017); Springenberg et al. (2015b); Zeiler and Fergus (2013); Zhang et al. (2018); Zhou et al. (2016) to get a better insight into AI made decisions to not only improve the models but also to instill accountability in them. The explanations could benefit users in many ways such as improving safety and fairness when relying on AI decisions, especially in safety critical applications e.g., healthcare Dave et al. (2020); Pawar et al. (2020), autonomous driving Shen et al. (2020) or criminal justice Larson et al. (2016).

XAI Strategies for deep visual models offer model specific Andreas et al. (2015); Ribeiro et al. (2016) or saliency map based explanations Fong and Vedaldi (2017); Petsiuk et al. (2018); Rebuffi et al. (2019); Selvaraju et al. (2017); Zhou et al. (2016). This work subscribes to the later where saliency maps highlight relevant regions in an input e.g., image, video or natural language texts. Saliency maps are obtained either following a black box or a white/grey box technique. White/grey box approaches Selvaraju et al. (2017); Zhang et al. (2018); Zhou et al. (2016) either assume access to the internals of the base model (the model whose predictions are explained) or modify the base model which can cost the base model its performance Zhou et al. (2016). This makes them unsuitable to be generalized over all architectures.

Black box methods Fong et al. (2019); Fong and Vedaldi (2017); Petsiuk et al. (2018); Zeiler and Fergus (2013) on the contrary, do not touch the internals of the base model. Instead, the input image pixels are perturbed and passed through the base model. The output prediction resulting in from the perturbed image differentiates between the important and non-important regions as perturbing important or relevant regions affects the output score more than the case when non-important image regions are perturbed. A popular technique to achieve this is to view saliency maps as weighted masks. The perturbations to the input are performed by masking portions of the image using these weighted masks. Different methods differ in the way these masks are constructed. Fong et al. Fong et al. (2019); Fong and Vedaldi (2017) solve an optimization objective to construct optimal masks while RISE Petsiuk et al. (2018) uses large number of random masks for an image. A good explanation is associated with a low deletion score Petsiuk et al. (2018)

which measures the drop in probability of a class as pixels are gradually removed according to the saliency map given importance.

Figure 1: Approach Overview: At the start, the base model (the classifier whose decision is explained) provides the probability of a class (here aeroplane) in the unperturbed image. Subsequently, at every step, the agent learns to choose important regions of the image to be masked such that on passing the masked image through the base model, the probability drops. The drop in the probability gives the relevance of the masked region behind the base model’s decision and is provided by the color-coded weight corresponding to the masked region in the saliency map (last row). The weight value increases from blue to red. Though initial importance of the chosen region at is not high, our approach provides due credit to it via a cumulating factor for driving the agent to choose regions leading to greater drop subsequently. The final explanation is obtained by smoothening the discretized saliency map for better visualization. (Best viewed in color.)

In this paper, we address explainability as a sequential search problem where a sequence of image patches or regions are investigated in order to get the mask giving the saliency map. Specifically, an agent starting with an unperturbed image, learns to choose the best image region to perturb given the choices made earlier. The agent chooses the sequence of regions from the currently unperturbed portions of the image so that perturbing the region maximally drops the class probability which in turn, causes a low deletion score. An approximation of the exhaustive search is performed by RISE Petsiuk et al. (2018) and is shown to achieve good deletion score albeit being prohibitively expensive both in time and memory requirements. Our approach is fundamentally different as RISE does not make any conscious effort to get low deletion score whereas we learn to intelligently search for the mask with a goal to achieve as low deletion score as quickly possible. The goal oriented learning offered by the proposed method not only makes the inference fast but also makes it deterministic.

The core of the problem is to find a policy that assesses the image perturbed by the present mask and decides to perturb a new relevant image region. Figure (1) illustrates the decision making process of such an agent. At first, the unperturbed image is passed through the base classifier to get the probability of the class present in the image (an ‘aeroplane’ in Figure (1)) The agent then starts by choosing a patch from the image and replacing it with uniform random noise. The perturbed image is assessed by the base model to get the change in probability of the same class. Subsequently, the agent decides to pick the most important region so that perturbing it decreases the probability of the class further until the agent decides to stop exploring or the maximum number of patches is perturbed. As the perturbed regions contribute to a drop of class probability, these regions are pivotal evidences for the classifier to classify the image. The significance of the perturbed regions behind the base classifier’s decision is manifested by the drop in probability. Thus at every step, the chosen region gets an weight proportional to the drop of the class probability. This weighted regions provide a discrete weighted mask that after smoothening provides the saliency map explaining the decision of the classifier. As region perturbation is sequential and its effect is cumulative on the base model, we introduce a cumulating factor that partially credits the regions perturbed earlier but instead of immediate drop, may have resulted in a significant drop in class probability later.

Learning such a dynamic agent capable of performing sequential actions towards a long-term goal by actively interacting with the environment in absence of any supervision is performed by Reinforcement Learning (RL). Motivated by its recent success Su et al. (2017); Weisz et al. (2018) coupled with its ease of use, we use ACER Wang et al. (2016) to learn an RL agent giving the final saliency map. Our approach is equally capable of learning the optimal perturbing sequence on a single image, a class of images or a whole dataset of images. This makes our approach flexible in terms of scalability and generalization. To the best of our knowledge, we are the first to propose a RL based strategy for explaining AI made decisions and we term our approach as Reinforcement Explanation Learning (RExL).

We perform experiments on three popular benchmarks in Computer Vision, ImageNet

Deng et al. (2009)

, PASCAL VOC

Everingham et al. (2010) and MSCOCO Lin et al. (2014). For each of these datasets, we use VGG Simonyan and Zisserman (2014) and ResNet He et al. (2015) as the base models i.e. the classifiers who’s decisions are to be explained. We compare our results with recent black box and white box methods. We show that RExL does not hurt the performance and provides competitive results. In fact it performs better than the baselines in a few cases. Finally we also show the speedup given by RExL compared to other the black box techniques.

2 Related Work

There have been a lot of work in Explainable AI Simonyan et al. (2013); Springenberg et al. (2015a); Bang et al. (2019); Petsiuk et al. (2018); Fong et al. (2019); Selvaraju et al. (2017) recently. These have used different techniques to explain the decisions taken by Deep Learning models. Some of them Bang et al. (2019); Chen et al. (2018) leverage the use of Information Theory to maximize the mutual information between the compressed representation and the decision. Bang et al. (2019), in addition, also reduces the mutual information between the explanation and the input to ensure that the explanation chooses the minimal subset of input features. These involve complex objective functions which are difficult to use.

Backpropagation Based Methods generate the important measure of a pixel by backpropagating the output of a deep neural network back to the input space using gradients or their variants. Some of them Baehrens et al. (2010); Simonyan et al. (2013) use the derivatives of a class score with respect to the image as an importance score. Guided Backprop Springenberg et al. (2015a) and DeConvNet Zeiler and Fergus (2013)

modify the backpropogation scheme for ReLU while Excitation Backprop

Zhang et al. (2018) ensures that the sum of the attribution signal is unity. Integrated gradients Qi et al. (2019) additionally accumulate gradients along a path from base image to input image. Activation based methods like CAM Zhou et al. (2016) (and its variants like Grad-CAM Selvaraju et al. (2017) and NormGrad Rebuffi et al. (2019)) use a linear combination of activation values across the convolutional layers. Grad-CAM and NormGrad uses gradients information to weigh the maps. These methods can generate saliency maps for any activation layer in the network but it is found that later layers, i.e. layers closer to the output generate more informative saliency maps.

These techniques achieve interpretability by making changes to a white-box model or utilising the internals of it. Thus, they cannot always be generalised over all model architectures. This demands for a need of black box techniques that can be used for any base model. LIME Ribeiro et al. (2016) is a black box approach which draws random samples around each instance to be explained and fits an approximate local linear decision model in the vicinity of the input. For complex non-linear classifiers, LIME fails to achieve good performance. Its dependence on superpixels leads to inferior saliency maps. Perturbation based methods perturb the inputs (blur some pixels, or add some noise etc) and measure the response of the model to these. Wagner et al. (2019) is such a method that also fine grains gradients to avoid adversarial examples. RISE Petsiuk et al. (2018) is a perturbation based method that generates a large number of random masks and applies them on the image. It then generates the saliency map as the weighted sum of these masks. The weights are the class scores predicted by the model after the mask is applied. It has been successful in generating decent saliency maps but it uses large number of masks (around ) which makes it computationally expensive and time taking. Meaningful perturbations Fong and Vedaldi (2017) and Extremal perturbations Fong et al. (2019) solve optimization objective on single images to generate explanations but these involve several iterations of every images and take time.

Reinforcement Learning is increasingly been adopted in standard computer vision tasks like Object Detection Uzkent et al. (2019); Caicedo and Lazebnik (2015); Mathe et al. (2016), Visual Question Answering Liu et al. (2018), Action Recognition Gowda et al. (2021) etc. TRPO Schulman et al. (2015), PPO Schulman et al. (2017) and DQN Mnih et al. (2013) are the most commonly used RL algorithms. TRPO and PPO are on-policy while DQN is off-policy. There have been lots of developments on Actor Critic Methods Mnih et al. (2016); Wang et al. (2016); Haarnoja et al. (2018). These include both off-policy algoritms Haarnoja et al. (2018) and on-policy methods Mnih et al. (2016). There has been some work done Raman et al. (2020) where RL agents are used to perturb the inputs to study model behaviour. In most cases, the agent chooses a particular perturbation on the input and then gets rewards from the outputs of a downstream model for the perturbed input. We follow a perturbation based explanation technique that uses RL to learn an optimal mask and generate accurate saliency maps quickly.

3 Proposed Approach

We now describe the proposed method which leverages Reinforcement Learning for getting saliency based explanation of a convnet classifier where at any time, most convincing evidences are collected depending on what evidences have already been collected.

3.1 Problem Definition

Formally, we deal with a dataset of images where each image can contain one or more objects. The base model or the model we want to explain provides the probabilities indicating the presence of at least one object of class among classes. For each object in the image, the explanation generation task is to output importance values for each pixel (or sub-images/patches) of the image. An explanation is good if gradually removing pixels from the image in order of decreasing importance score confuses the base model. This will be manifested by a sharp drop of the probability of the object compared to the case when all pixels were present.

We cast explanation generation as a Reinforcement Learning (RL) problem in a Markov Decision Process (MDP) setting. An MDP has a set of actions

, a set of states and a reward function . At any time instant , the agent selects an action , executes it and as a result, the state changes from to . From the environment, the agent receives 1) a scalar reward and 2) the next state in the form of an observation. The action to be taken by the agent is governed by a policy

that takes in environment states at any point of time and outputs a probability distribution over a set of actions. The goal of an RL agent is to learn an optimum policy

such that the sum of rewards through an episode is maximized. Next we outline the actions, states and rewards of the proposed RL setup.

States: For an image of size , the number of possible masks naturally, is as each pixel can be either masked or not. The state at any time instant is the masked image where masked pixels are replaced with random noise sampled uniformly in the range of the pixel intensity values. For a reasonably sized image, the state-space and thus the search-space of size can be huge. Thus we relax the state-space size by discretizing the image into a grid, ( being or ).

Actions: Denoting the probability given by the base model for the image to belong to class as , the actions taken by the agent aims to reduce to a minimum. The agent chooses regions to construct the mask sequentially so that the base model provides a low probability or a low score for the masked image to belong to that class. It should be noted that theoretically, the agent can choose an already masked region also. However, a well trained agent would find no incentive in performing such an action as the probability value will not reduce much (if at all, it reduces) as a result of masking an already masked region. For an image divided as grid, the number of possible actions is naturally . There seems to be an incentive in terms of training computation to stop the search early by limiting the agent to take only a preset number of actions. However, this may restrict the exploration ability of the agent. Thus, to strike a balance, we allow the agent to take as many actions as there are number of grids i.e., .

Rewards: The reward function quantifies the worth of the agent’s effort for successfully masking important image regions. A successful masking operation in turn, provides a good saliency map explaining the decision behind predicting the presence of a class in the image. This is manifested by the drop in probability of a class present in it. As a natural choice, thus, the reward is taken to be the negative of the probability given by the base model for the class when the perturbed image is passed through the base model. It is given by,

(1)

where is the state and is the action executed at timestep . is the probability of the image to belong to class , when the masked image at timestep is passed through the base model. To maximize the discounted return , the agent will have to choose the regions such that the classification score of the class drops as quickly as possible thereby minimizing the deletion score. The discount factor determines the present value of future rewards. As , the agent tries to maximize only the immediate rewards, while a value close to means the agent becomes farsighted.

3.2 Solution Strategy

The goal of the agent is to choose relevant regions of the input image to mask such that the probability or the score of the masked image drops as quickly as possible during the interactions with the environment until termination. The core of the problem is to find the policy function mapping states to actions that guides the agent’s decision making process. We use a recently proposed Actor Critic with Experience Replay (ACER) Wang et al. (2016) which is an improvement to the traditional actor-critic algorithm that is more sample efficient.

The policy function of the agent is represented by a multi-layer neural network with learnable parameters . An episode in an RL setting is composed of all the state action pairs from start to the terminal action and is denoted as where is the number of steps taken before the termination of the episode. The agent learns the optimal policy by maximizing the expected return over the possible trajectories.

(2)

is abbreviated as above to reduce clutter. The objective function in Equation (2

) can be directly optimized using standard optimization techniques like SGD but suffers from high variance especially under small samples resulting in very fragile convergence. With the help of the state and action value functions, the Actor Critic algorithms 

Konda and Tsitsiklis (2000) overcome this limitation. While the state value function gives the expected return given the agent is at state and follows the policy , the action value function is the expected return given that the agent performs an action at state and follows the policy thereafter. The modified objective in terms of the advantage is given by,

(3)

The basic objective function (Equation (3)) optimized by the Actor-Critic methods is inherently on policy. ACER’s objective contains the both off-policy and on-policy components. For the off-policy component, it uses Retrace algorithm Munos et al. (2016)

to estimate the action value function and further uses importance sampling. The on-policy part is very similar to the basic objective shown in Equation (

3).

Figure 2: Training the Reinforcement Learning Agent for Explaining Image Classification Model. The state, , at time is a partially masked image. The RL Agent, which consists of a fixed feature extractor and a learnable MLP, chooses an action . The action is the region of the image that needs to be masked. The result of the action is the next state i.e. . The base model evaluates to provide the probability of the class being explained which is used to construct the reward .

An illustration of our proposed approach RExL is provided in Figure (2

). We provide a saliency map based explanation to the classification decision made by the base model on an image. The image is given as the input. The RL agent of RExL consists of a pretrained feature extarctor and a multi-layer perceptron (MLP). The feature extractor remains fixed during the training and the learnable MLP is the policy network to predict the action. At every step, the agent observes the perturbed image which constitutes it’s state and predicts the next action. The action is a specific region in the image to mask. Once the image is perturbed or the action is executed, the base model gives the classification score for the class to be explained. This confidence score is used as reward during training (ref. Equation (

1)) and also to generate the explanations as saliency maps. The perturbed image becomes the next state and the loop continues until termination.

RExL gives the flexibility to train a single agent to explain all images of a dataset (dataset specific), multiple agents each one expertly trained to explain a single class (class specific) or extremely fine-grained agents that are experts in explaining a single image (image specific). We train the dataset specific agents by using the whole training partition of the dataset in which the base model was trained. A class specific agent is trained on a single class of training images. Finally, an image specific agent is trained only on the image to be explained. The explainability of image specific agents are very good but this is not scalable. Although dataset specific agent is scalable and performs good for images containing mostly single objects, the performance may degrade for a complex dataset with images having multiple instances from possibly multiple classes. As our experiments show, the best trade-off between performance and scalability can be seen for class specific agents.

3.3 Inference

After a model is trained, it is used to generate saliency maps. Given an image, the agent sequentially deletes a block of the grid from it and incrementally builds the saliency map. Let block of an image is perturbed as a result of the action taken by the agent at time and the change in classification confidence from the previous state is . The saliency weight assigned to block is , where is the cumulating factor that gives the due credit to previous deletions that may have resulted in a greater drop in probability at a later stage. A value equal to is the trivial case where only the grid that is deleted most recent gets all the credit. Finally, the map is normalized so that the saliency weights sum to .

4 Experiments

Datasets and Evaluation Metrics

: We evaluate the performance of our proposed method on three benchmark datasets namely ImageNet

Deng et al. (2009), PASCAL VOC Everingham et al. (2010) and MSCOCO Lin et al. (2014). Following the standard practice, we have reported the results from the val, test and val splits respectively. For all the three datasets, we use pretrained VGG Simonyan and Zisserman (2014) and ResNet He et al. (2015)

as base models. For ImageNet, we used the trained models provided by Pytorch Models Zoo

111https://pytorch.org/vision/stable/models.html. For PASCAL VOC and MSCOCO, we used the same multilabel models as RISE and made public by Zhang et al. (2018).

We follow Petsiuk et al. (2018) and use the causal deletion and insertion metrics to compare the performance with different approaches Petsiuk et al. (2018); Fong et al. (2019); Rebuffi et al. (2019); Selvaraju et al. (2017).While GradCAM Selvaraju et al. (2017) and NormGrad Rebuffi et al. (2019) are white box techniques that make use of the intermediate activations and gradients of the network, RISE Petsiuk et al. (2018) and Extremal Perturbations Fong et al. (2019) are black box techniques that only study the input-output relationship of the model. RISE is a sampling based method and samples thousands of random masks and weighs them based on their performance. Extremal Perturbations on the other hand is a learning based method which solves an optimization objective for each image to obtain the corresponding mask.

Deletion metric removes pixels from the image gradually according to the importance weight given by the saliency map and measures the classification score. The deletion score is the area under the curve (AUC) of these classification scores with percentage of pixels removed. Similarly the insertion score evaluates the performance inversely i.e., uncovering highly blurred image regions gradually according to the importance weights and getting the AUC. While a low deletion score is indicative of better explanation, in case of insertion score the opposite is true. Though, RExL follows the deletion philosophy while training, the insertion score for it is also competitive. We posit that an agent trained with an insertion strategy (i.e., starting with a blank canvas and learning to choose relevant regions to uncover) would be equally useful and we keep it as a possible future work. While our primary aim was to improve on the inference time of black box model, we first show that RExL does not hurt the performance, infact in some cases, it performs better than even white box methods. Thereafter we provide a runtime analysis showing the significant speedup given by us compared to similar black box XAI strategies.

Implementation Details: The policy feature extractor can be any good pretrained feature extractor and is independent from the base model. We use ResNet pretrained on the dataset the RL agent being trained on as the feature extractor for the policy. The learnable MLP consists of two hidden layers with and units respectively. We fix the value of as and the length of the episode, thus, will be . We trained ACER 111Implementation: https://stable-baselines.readthedocs.io/en/master/modules/acer.html for steps when training class specific and steps when training dataset specific and used discount factor, throughout our experiments. Also for all the inferences, we have used

unless explicitly specified. As dataset specific setting is a single agent to explain all the classes in the dataset, information about the class to be explained was passed as an one-hot encoding along with the image during training. We used NVIDIA GeForce RTX

Ti GPUs for training all our models. Next we discuss the experimental results for all the three variants of RExL i.e., dataset specific (RExL-DS), class specific (RExL-CS) and image specific (RExL-IS). Due to limitations in space, we provide the comparison over deletion metric only. Additional results including insertions scores, more implementation details and source codes are provided in the appendix.

4.1 Results and Discussions

Figure 3: Comparison of the saliency maps: The top three rows show saliency maps for random samples from VOC when RExL agent is applied on VGG base model. Likewise the bottom rows correspond to ResNet base model. The leftmost column shows the original image. Rest of the column headings show the approach used to generate the respective saliency maps. The results show the superiority of the proposed approach over the black box and white box approaches as well as the specificity of the maps by the RExL-CS approach compared to the RExL-DS approach.
Method ImageNet PASCAL VOC 2007 COCO 2014
VGG16 ResNet50 VGG16 ResNet50 VGG16 ResNet50
GradCAM Selvaraju et al. (2017) 0.109 0.123 0.275 0.310 0.225 0.261
NormGrad Rebuffi et al. (2019) 0.086 0.126 0.285 0.349 0.232 0.303
RISE Petsiuk et al. (2018) 0.098 0.108 0.275 0.290 0.240 0.227
Extremal Fong et al. (2019) 0.125 0.126 0.355 0.357 0.292 0.299
RExL-DS 0.130 0.122 0.342 0.243 0.284 0.200
Table 1: Comparison of Deletion scores on Imagenet, PASCAL VOC 2007 and COCO 2014 Datasets. RExL-DS outperforms white box methods on VOC and COCO for ResNet base model. For a less strong VGG base model, RExL-DS is competitive.
Method VGG16 ResNet50
RExL-DS 0.342 0.243
RExL-CS 0.292 0.216
Table 2: Comparison of the Deletion scores of RExL-DS and RExL-CS on VOC2007 test split. The deletion score is improved for the more specific RExL-CS agent compared to RExL-DS for both the base models.

Table (1) shows a comparative evaluation of the deletion scores on the three benchmark datasets with two different base models explaining the classes present in the images. RExL-DS performs better when explaining a stronger model like ResNet. For PASCAL VOC and COCO datasets, RExL-DS outperforms all the baselines (including the white box approaches like GradCAM and NormGrad) while for ImageNet, its performance is at par (slightly better than most of the other baselines). With VGG, the performance is at par with the competing methods. The comparatively higher deletion scores of RExL-DS for VGG is intuitive as this being an inferior base model the reward for a right choice of image region to delete may not always be right.

As we go to class specific RExL agents (ref Table (2)), our approach becomes more competitive for VGG base model while further improvement is seen in case of ResNet making RExL-CS the new SOTA for PASCAL VOC. In case of Imagenet, due to the presence of a large number of classes, we focused on explaining the top classes for which VGG16 and ResNet50 classifiers give best performance in terms of top- accuracy. The average deletion score of the RExL-CS agents trained on these classes for ResNet50 base model is as low as while RISE being the closest with a value of . Likewise, for VGG16, the proposed approach gives a value of with RISE being close with a deletion score of . The improvement over RExL-DS shows the flexibility of our approach as well as the expertise imbibed by the class specific models compared to one single agent trained to explain all the classes in the dataset. Additional results on both Imagenet and PASCAL VOC are provided in the appendix.

Figure (3) shows a comparison between the saliency maps generated by different baseline methods and the two variants of RExL. Targeted saliency and much less noise of the RExL maps make them qualitatively better compared to the others. While extremal perturbation Fong et al. (2019) generates less noisy maps, its dependence on manually set area of relevant regions can harm if the object of interest does not conform to this preset value (pronounced in the top row for VGG and penultimate row for ResNet). This is especially detrimental for RISE where the probability of the masked regions in the image is fixed likewise. RExL, on the other hand, learns to mask regions that is important for making the base model decide the class without constraining the area to be prespecified. Improvement by our method is also observed when the specificity of the agent increases and it becomes more expert i.e., when we go from dataset-specific to class-specific agents. This is especially prominent for the image of the cycle in case of VGG and for the image of the cow in case of ResNet.

Although RExL-IS is the extremely specific, time to train image specific models for all images is also high. As a result, we don’t specifically evaluate RExL-IS on all images and provide quantitative results. Instead, we do qualitative comparison with existing approaches on a few images along with a comparison of the deletion scores. Due to space limitation we provide additional qualitative examples including RExL-IS comparisons in the appendix.

4.2 Comparison of Inference Time

The major advantage of and motivation for RExL is the targeted fast search of relevant image regions without degrading the quality of the explanations. In this section we compare the inference times of RExL with that of the state-of-the-art black box explanation techniques e.g., RISE and the extremal perturbation approaches.We do not compare the running time of RExL with white box techniques like GradCAM and NormGrad because they are much quicker but come at the cost of peeking into the model. Our aim is to bring about considerable speedup in the inference time of a black box explanation technique without hurting the performance.

For comparing the running times, we use a computer with one NVIDIA GeForce GTX Ti, ryzen x CPU and GB DDR MHz RAM. Table (3) shows the average running times over images. RExL provides almost speedup over both the approaches for VGG base model. The improvement in speed is almost and respectively over RISE and extremal perturbation in case of ResNet base model. RExL-CS is comparatively fast than RExL-DS as RExL-DS gets an additional class encoding as input. The faster running time with comparable or better performance shows the merit of our RL based explainable AI approach.

Method ImageNet PASCAL VOC MSCOCO
VGG16 ResNet50 VGG16 ResNet50 VGG16 ResNet50
RExL-CS 1.461 1.494 1.748 1.488 1.738 1.528
RExL-DS 1.869 1.872 1.754 1.532 1.800 1.519
RISE Petsiuk et al. (2018) 14.618 8.853 20.408 8.370 20.397 8.392
Extremal Fong et al. (2019) 13.722 15.975 19.114 16.298 19.118 16.298
Table 3: Comparison of Running Times (in seconds) while generating saliency maps. RExL-DS and RExL-CS are compared with the related black box methods (RISE and Extremal Perturbations). Both variants of RExL outperforms the competing methods by a large margin.
Figure 4: Comparison between the saliency maps produced by RExL-DS for different values of the cumulating factor . The values are put on the top. The leftmost column shows the original image from the PASCL VOC .
Inference VGG16 ResNet50 0.362 0.290 0.349 0.266 0.347 0.260 0.342 0.243 Table 4: Comparison of the deletion scores for the different values of the cumulating factor during inference on VOC test split using RExL-DS.

4.3 Ablation on Cumulating Factor

To see the effect of the cumulating factor while assigning credit to the chosen image regions we experimented with different values of it for RExL-DS on the PASCAL VOC dataset. Table (4) shows the comparison. Though the values are close, one important observation is that giving the entire credit at a certain instant to the region that is deleted most recently (i.e., ) hurts the performance. As provides the best performance, we went ahead in experimenting with this value for all the datasets. Figure (4) shows sample saliency maps with different values applied on the same image. For a large object, naturally, a large portion of the image is responsible for the base model to get its decision and this is rightly captured when credit assignment is delayed instead of immediate.

5 Conclusions

We have presented a reinforcement learning based model for explaining deep classifiers. In contrast to methods that approximate an exhaustive space, we propose to formulate XAI as an intelligent search and direct an agent to efficiently choose and accumulate evidences responsible behind AI made decisions. Our model is dataset, category and image specific striking a perfect balance of model specificity and training burden. Ours is a plug and play black box approach that is applicable to any classifier without requiring to modify the internals of them. We report encouraging results in three benchmark datasets achieving to times speed-up over related black box XAI approaches.

References

  • [1] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein (2015) Deep compositional question answering with neural module networks. CoRR abs/1511.02799. External Links: Link, 1511.02799 Cited by: §1.
  • [2] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K. Müller (2010-08) How to explain individual classification decisions. J. Mach. Learn. Res. 11, pp. 1803–1831. External Links: ISSN 1532-4435 Cited by: §2.
  • [3] S. Bang, P. Xie, W. Wu, and E. P. Xing (2019) Explaining a black-box using deep variational information bottleneck approach. CoRR abs/1902.06918. External Links: Link, 1902.06918 Cited by: §2.
  • [4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language Models are Few-Shot Learners. CoRR abs/2005.14165. External Links: Link, 2005.14165 Cited by: §1.
  • [5] J. C. Caicedo and S. Lazebnik (2015) Active object localization with deep reinforcement learning. CoRR abs/1511.06015. External Links: Link, 1511.06015 Cited by: §2.
  • [6] J. Chen, L. Song, M. J. Wainwright, and M. I. Jordan (2018) Learning to explain: an information-theoretic perspective on model interpretation. CoRR abs/1802.07814. External Links: Link, 1802.07814 Cited by: §2.
  • [7] D. Dave, H. Naik, S. Singhal, and P. Patel (2020) Explainable AI meets healthcare: A study on heart disease dataset. CoRR abs/2011.03195. External Links: Link, 2011.03195 Cited by: §1.
  • [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §1, §4.
  • [9] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §1, §4.
  • [10] R. C. Fong and A. Vedaldi (2017-10) Interpretable explanations of black boxes by meaningful perturbation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §A.1, §1, §1, §1, §2.
  • [11] R. Fong, M. Patrick, and A. Vedaldi (2019-10) Understanding deep networks via extremal perturbations and smooth masks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §A.1, Table 6, §1, §1, §2, §2, §4.1, Table 1, Table 3, §4.
  • [12] S. N. Gowda, L. Sevilla-Lara, F. Keller, and M. Rohrbach (2021) CLASTER: clustering with reinforcement learning for zero-shot action recognition. CoRR abs/2101.07042. External Links: Link, 2101.07042 Cited by: §2.
  • [13] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. CoRR abs/1801.01290. External Links: Link, 1801.01290 Cited by: §2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §1, §4.
  • [15] V. R. Konda and J. N. Tsitsiklis (2000) Actor-critic algorithms. In Neural Information Processing Systems, pp. 1008–1014. Cited by: §3.2.
  • [16] J. Larson, S. Mattu, L. Kirchner, and J. Angwin (2016) How we analyzed the compas recidivism algorithm. ProPublica. External Links: Link Cited by: §1.
  • [17] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. CoRR abs/1405.0312. External Links: Link, 1405.0312 Cited by: §1, §4.
  • [18] F. Liu, T. Xiang, T. M. Hospedales, W. Yang, and C. Sun (2018) Inverse visual question answering: A new benchmark and VQA diagnosis tool. CoRR abs/1803.06936. External Links: Link, 1803.06936 Cited by: §2.
  • [19] S. Mathe, A. Pirinen, and C. Sminchisescu (2016-06) Reinforcement learning for visual object detection. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §2.
  • [20] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. CoRR abs/1602.01783. External Links: Link, 1602.01783 Cited by: §2.
  • [21] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller (2013) Playing atari with deep reinforcement learning. CoRR abs/1312.5602. External Links: Link, 1312.5602 Cited by: §2.
  • [22] R. Munos, T. Stepleton, A. Harutyunyan, and M. G. Bellemare (2016) Safe and efficient off-policy reinforcement learning. CoRR abs/1606.02647. External Links: Link, 1606.02647 Cited by: §3.2.
  • [23] U. Pawar, D. O’Shea, S. Rea, and R. O’Reilly (2020) Explainable ai in healthcare. In 2020 International Conference on Cyber Situational Awareness, Data Analytics and Assessment (CyberSA), Vol. , pp. 1–2. External Links: Document Cited by: §1.
  • [24] V. Petsiuk, A. Das, and K. Saenko (2018-09) RISE: Randomized Input Sampling for Explanation of Black-box Models. In British Machine Vision Conference, Cited by: §A.1, §A.3, Table 6, §1, §1, §1, §1, §2, §2, Table 1, Table 3, §4.
  • [25] Z. Qi, S. Khorram, and F. Li (2019) Visualizing Deep Networks by Optimizing with Integrated Gradients. CoRR abs/1905.00954. External Links: Link, 1905.00954 Cited by: §2.
  • [26] M. Raman, S. Agarwal, P. Wang, A. Chan, H. Wang, S. Kim, R. A. Rossi, H. Zhao, N. Lipka, and X. Ren (2020)

    Learning to deceive knowledge graph augmented models via targeted perturbation

    .
    CoRR abs/2010.12872. External Links: Link, 2010.12872 Cited by: §2.
  • [27] S. Rebuffi, R. Fong, X. Ji, H. Bilen, and A. Vedaldi (2019) NormGrad: Finding the Pixels that Matter for Training. CoRR abs/1910.08823. External Links: Link, 1910.08823 Cited by: Table 6, §1, §2, Table 1, §4.
  • [28] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) "Why should I trust you?": explaining the predictions of any classifier. CoRR abs/1602.04938. External Links: Link, 1602.04938 Cited by: §1, §2.
  • [29] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2015) Trust region policy optimization. CoRR abs/1502.05477. External Links: Link, 1502.05477 Cited by: §2.
  • [30] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Link, 1707.06347 Cited by: §2.
  • [31] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017-10) Grad-cam: visual explanations from deep networks via gradient-based localization. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Table 6, §1, §1, §2, §2, Table 1, §4.
  • [32] Y. Shen, S. Jiang, Y. Chen, E. Yang, X. Jin, Y. Fan, and K. D. Campbell (2020) To explain or not to explain: A study on the necessity of explanations for autonomous vehicles. CoRR abs/2006.11684. External Links: Link, 2006.11684 Cited by: §1.
  • [33] K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps.. CoRR abs/1312.6034. External Links: Link Cited by: §2, §2.
  • [34] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Link Cited by: §1, §4.
  • [35] J.T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2015) Striving for simplicity: the all convolutional net. In ICLR (workshop track), External Links: Link Cited by: §2, §2.
  • [36] J.T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2015) Striving for simplicity: the all convolutional net. In ICLR (workshop track), External Links: Link Cited by: §1.
  • [37] P. Su, P. Budzianowski, S. Ultes, M. Gasic, and S. J. Young (2017) Sample-efficient actor-critic reinforcement learning with supervised data for dialogue management. CoRR abs/1707.00130. External Links: Link, 1707.00130 Cited by: §1.
  • [38] B. Uzkent, C. Yeh, and S. Ermon (2019) Efficient object detection in large images using deep reinforcement learning. CoRR abs/1912.03966. External Links: Link, 1912.03966 Cited by: §2.
  • [39] J. Wagner, J. Mathias Kohler, T. Gindele, L. Hetzel, J. Thaddaus Wiedemer, and S. Behnke (2019-06)

    Interpretable and fine-grained visual explanations for convolutional neural networks

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [40] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas (2016) Sample efficient actor-critic with experience replay. CoRR abs/1611.01224. External Links: Link, 1611.01224 Cited by: §1, §2, §3.2.
  • [41] G. Weisz, P. Budzianowski, P. Su, and M. Gasic (2018) Sample efficient deep reinforcement learning for dialogue systems with large action spaces. CoRR abs/1802.03753. External Links: Link, 1802.03753 Cited by: §1.
  • [42] M. D. Zeiler and R. Fergus (2013) Visualizing and Understanding Convolutional Networks. CoRR abs/1311.2901. External Links: Link, 1311.2901 Cited by: §1, §1, §2.
  • [43] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff (2018-10) Top-down neural attention by excitation backprop. Int. J. Comput. Vision 126 (10), pp. 1084–1102. External Links: ISSN 0920-5691, Link, Document Cited by: §A.1, §A.1, §A.1, §1, §1, §2, §4.
  • [44] B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba (2016)

    Learning Deep Features for Discriminative Localization.

    .
    CVPR. Cited by: §1, §1, §2.

Appendix

Appendix A Implementation Details

In this section, we discuss all the details that are needed to reproduce our results as well as the scores obtained for the baseline methods. These parameters can be categorized as follows:

a.1 Environment and Agent Parameters

The mask is discretized into a grid as discussed in section 3 of the main paper. For our experiments we choose . Naturally, the number of steps in an episode as discussed will be . For VOC and COCO , we used VGG and ResNet trained by [43]. These models are trained on BGR images loaded with pixel values in and then mean corrected to respectively. For ImageNet, we use the pretrained models provided by Pytorch model zoo 222https://pytorch.org/vision/stable/models.html and used the standard normalization practices described there. The random noise that is used to mask the image is sampled from the same distribution as that of the image. For example in case of VOC or COCO, the noise has a mean .

The RL agent consists of two parts, a feature extractor which is fixed during training and a learnable MLP. The feature extractor used is a pretrained ResNet. So, for VOC, we will be using the ResNet module that was pretrained on VOC by [43]. The learnable MLP consists of two hidden layers of and

units each. The activation functions at these hidden units is ReLU. The output layer predicts the catergorical distribution over actions and hence contains

output neurons.

Parameter Value
discount factor, 1.0
Number of environment steps per update 490
weight for loss of Q value 1.0
RMS prop alpha 0.9
learning rate 0.0001
Table 5: RL training hyperameters

During inference, as reported in section 4.4, gives the best performance so we report the results using . RExL agents are trained on the respective partitions of the train split and following standard practices [24, 10, 11, 43] tested on the val, test and val split of ImageNet, VOC and COCO.

a.2 RL training parameters

We used ACER333https://stable-baselines.readthedocs.io/en/master/_modules/stable_baselines/acer/acer_simple.html#ACER to train the policy. The parameters that we used during training are in Table (5

). We used the default hyperparameters as mentioned in the official documentation

444https://stable-baselines.readthedocs.io/en/master/modules/acer.html for the rest of the parameters.

a.3 Baselines

For RISE we used the official implementation555https://github.com/eclique/RISE with the hyperparameters discussed in [24]. For GradCAM and NormGrad, we used the publically available implementation666https://github.com/ruthcfong/TorchRay/tree/normgrad. Finally for Extremal Perturbations, we used the implementation in the official repository777https://github.com/facebookresearch/TorchRay using contrastive rewards and area parameter of while generating explanations.

Method ImageNet PASCAL VOC 2007 COCO 2014
VGG16 ResNet50 VGG16 ResNet50 VGG16 ResNet50
GradCAM [31] 0.615 0.677 0.850 0.850 0.713 0.740
NormGrad [27] 0.517 0.455 0.795 0.807 0.640 0.690
RISE [24] 0.666 0.727 0.764 0.782 0.631 0.701
Extremal [11] 0.641 0.690 0.871 0.870 0.718 0.750
RExL-DS 0.464 0.529 0.755 0.755 0.592 0.678
Table 6: Comparison of Insertion scores on Imagenet, PASCAL VOC 2007 and COCO 2014 Datasets.

Appendix B Insertion Scores

RExL gets competitive scores for all the datasets even though the preocess of insertion is complementary to our training process. A method that optimizes deletion scores will not necessary also give good insertion scores. For deletion score to be low, the agent will learn the mask that will remove some features that increases the confusion in the model for that class. But inserting the same features may not necessarily lead to a high probability for that class.

Method VGG16 ResNet50
RExL-DS 0.755 0.755
RExL-CS 0.778 0.777
Table 7: Comparison of the Insertions scores of RExL-DS and RExL-CS on VOC2007 test split. The deletion score is improved for the more specific RExL-CS agent compared to RExL-DS for both the base models.

For example, in ImageNet, there are a lot of classes pertaining to different kinds of dogs. There will be some features (like the some portion of the face) on removal of which, the base model will get confused among all the other dogs. But if we insert this feature on a blank image, the base model will not necessarily classify it correctly.

Naturally this will be more pronounced where the classes are a lot similar to each other. As a result, (ref. Table (6)) RExL does not do as well in terms of insertion scores (compared to) on ImageNet but does very well on VOC where the classes are a lot different. Similar to the deletion score, insertion score for RExL-CS is better than RExL-DS, as seen from Table (7). RExL-CS results will be further analyzed in the next sections.

Inference VGG16 ResNet50
0.783 0.785
0.771 0.769
0.768 0.765
0.755 0.755
Table 8: Comparison of the insertion scores for the different values of the cumulating factor during inference on VOC test split using RExL-DS.

One interesting observation in Table (8) is about the general trend that shows insertion scores are better for lower values of . This is counterintuitive to the trend seen in case of deletion scores (Table 4 in the main paper). For an object, the lower values of credit the later actions more while the larger value the earlier actions. The results show that the later actions are more important for insertion scores. These are the actions that are probably removing the significant portions of the object while the earlier actions aim to remove the features that separate similar classes. We posit that using an insertion-style training can lead to agents which will do better in terms of insertion scores too. This is a possible future work.

Appendix C RExL-CS on VOC2007

For VOC, we train RExL-CS on all the 20 classes. Saliency maps generated by RExL-CS is compared with the baseline methods for every class. For quantitative comparisons, we use both the deletion scores and insertion scores. The maps generated by RExL-CS and baselines for every class are also compared qualitatively.

c.1 Vgg16

Figure (5) and (6) show the comparison of class wise deletion and insertion scores among the different methods. Moreover, Figure (7) and Figure (8) qualitatively compare the saliency maps generated by RExL-CS with the baselines for every class. VGG gives competitive results for both deletion and insertion scores. The saliency maps produced are also more accurate and concise.

Figure 5: Comparison of deletion scores (the lower the better) for VGG on all classes of VOC: RExL-CS gets competitive deletion scores for all the classes. It outperforms all the baselines for 5 classes.
Figure 6: Comparison of insertion scores (the higher the better) for VGG on all classes of VOC: Insertion scores of RExL-CS are at par with the baselines. On average it has the better insertion score than RISE.
Figure 7: Qualitative comparison of saliency maps for VGG on classes of VOC: RExL-CS always generates better maps than RISE. Extremal perturbations gives almost equal importance to the entire object but its is difficult to identify the sailient features of the object. RExL-CS is at par with the whitebox methods in most cases and definitely better in some cases like "aeroplane" and "cat".
Figure 8: Qualitative comparison of saliency maps for VGG on classes of VOC: Similar to Figure (7), RExL-CS generated better maps than both RISE and Extremal Perturbations. RExL-CS outperforms all the baselines for classes like "plant" and "tvmonitor".

c.2 ResNet50

Similarly, RExL-CS agents are trained for all the classes using ResNet as the base model. Figure (9) and Figure (10) respectively show the comparison of deletion and insertion scores for every class. As seen in all the results, RExL-CS improves considerable as we use a stronger base model like ResNet. Finally, Figure (11) and Figure (12) contains the saliency maps generated by the different methods for qualitative comparison.

Figure 9: Comparison of deletion scores (the lower the better) for ResNet on all classes of VOC: RExL-CS outperforms the baselines most of the times (in classes). It performs exceptionally well on some classes like "aeroplane" (class id ) and this will be further highlighted in the saliency maps in Figure (11).
Figure 10: Comparison of insertion scores (the higher the better) for ResNet on all classes of VOC: RExL-CS is at par with all the baselines and on average better than RISE.
Figure 11: Qualitative comparison of saliency maps for ResNet on classes of VOC: As expected, the saliency map for "aeroplane" is better than the baseline methods. Moreover, for classes "bicycle", "bus", "chair" and "cow", the maps are much better than the four baselines. For the same input for class "boat", RExL performs way better while using ResNet than when it was using VGG.
Figure 12: Qualitative comparison of saliency maps for ResNet on classes of VOC:Like the previous Figure, RExL-CS outperforms the baseline methods in most of the classes. It almost always generates better explanations than the two black box methods and is always at par (and sometimes better) with the white box methods.

Appendix D RExL-CS on ImageNet

We also perform classwise comparison of the saliency maps generated by RExL-CS with the baseline methods on ImageNet. But instead of all the 1000 classes, we show results only on the top 10 classes (as per the classifcation accuracy). Similar to VOC, we use deletion and insertion scores to compare quantitatively while qualitatively we show the saliency maps produced for a random example from each of the 10 classes. Again we perform the experiments for both the base models.

d.1 Vgg16

Figure (13) and (14) show the comparison of class wise deletion and insertion scores among the different methods respectively. And, Figure (15) qualitatively compare the saliency maps generated by RExL-CS with the baselines for every class. For VGG16, the scores are at par with the baselines, outperforming the baselines few times.

Figure 13: Comparison of deletion scores (the lower the better) for VGG on top 10 classes of ImageNet: RExL-CS gets competitive deletion scores for all the classes and outperforms in 4 classes.
Figure 14: Comparison of insertion scores (the higher the better) for VGG on top 10 classes of ImageNet: RExL-CS gets competitive insertion scores for all the classes.
Figure 15: Qualitative comparison of saliency maps for VGG on top 10 classes of ImageNet: RExL-CS always generated better maps than Extremal. In most cases, RExL-CS outperforms RISE as well. GradCAM does not perform well on some classes which can also be reflected from its poor scores on these classes. In several classes like "giant panda", "leonberg" and "proboscis monkey", RExL-CS generates better maps than all the baselines including the white box methods.

d.2 ResNet50

We perform similar comparisons using ResNet as the base model. Figure (16) and (17) compare the class wise deletion and insertion scores among the different methods respectively. And, Figure (18) qualitatively compares the saliency maps generated by RExL-CS with the baselines for every class. As expected, RExL performs better using ResNet as the base model and this can be illustrated from both the scores and the saliency maps.

Figure 16: Comparison of deletion scores (the lower the better) for ResNet on top 10 classes of ImageNet: RExL-CS outperforms the baselines on classes. Wherever it fails to be the best, it just falls behind by a very small margin and is almost always the second best.
Figure 17: Comparison of insertion scores (the higher the better) for ResNet on top 10 classes of ImageNet: RExL-CS gets competitive insertion scores for all the classes. On average, it is better than three of the baselines.
Figure 18: Qualitative comparison of saliency maps for ResNet on top 10 classes of ImageNet: RExL-CS again is always better than Extremal and at par (if not better) than RISE. It is also better than the two white box methods which give a large blob of importance rather than providing any intricate information. For classes "black swan", "proboscis monkey", "hummingbird", "leonberg" and "ruddy turnstone", RExL by far outperforms the baselines.

Appendix E RExL-IS

These agents are trained on single images and explain the decisions made by the base model on that image only. These are hence not scalable for any practical datasets but if the base model needs to be explained very accurately and for a small number of images, RExL-IS will provide the best results. A comparison between the different baselines and RExL-IS can be seen in Figure (19). These images are taken from VOC test split.

Figure 19: Qualitative comparison of saliency maps for RExL-IS using ResNet base model on VOC:For all the images, RExL-IS is significantly better than all the four baselines. While Extremal Perturbation provides almost equal importance to the entire object, white box methods (GradCAM and NormGrad) also simple add a large region of importance on the image instead of looking for the specific features. This is clearly highlighted in the "bicycle" image where GradCAM and NormGrad give large importance to the entire region containing the human as well.