PyTorch implementation of Interpretable Explanations of Black Boxes by Meaningful Perturbation
As machine learning algorithms are increasingly applied to high impact yet high risk tasks, e.g. problems in health, it is critical that researchers can explain how such algorithms arrived at their predictions. In recent years, a number of image saliency methods have been developed to summarize where highly complex neural networks "look" in an image for evidence for their predictions. However, these techniques are limited by their heuristic nature and architectural constraints. In this paper, we make two main contributions: First, we propose a general framework for learning different kinds of explanations for any black box algorithm. Second, we introduce a paradigm that learns the minimally salient part of an image by directly editing it and learning from the corresponding changes to its output. Unlike previous works, our method is model-agnostic and testable because it is grounded in replicable image perturbations.READ FULL TEXT VIEW PDF
Recent work in model-agnostic explanations of black-box machine learning...
Surrogate explainers of black-box machine learning predictions are of
Most state-of-the-art machine learning algorithms induce black-box model...
Deep neural networks (DNNs) are successfully applied in a wide variety o...
Complex models are commonly used in predictive modeling. In this paper w...
Due to their black-box and data-hungry nature, deep learning techniques ...
Text data are increasingly handled in an automated fashion by machine
PyTorch implementation of Interpretable Explanations of Black Boxes by Meaningful Perturbation
Given the powerful but often opaque nature of modern black box predictors such as deep neural networks [4, 5], there is a considerable interest in explaining and understanding predictors a-posteriori, after they have been learned. This remains largely an open problem. One reason is that we lack a formal understanding of what it means to explain a classifier. Most of the existing approaches [19, 16, 8, 7, 9, 19], etc., often produce intuitive visualizations; however, since such visualizations are primarily heuristic, their meaning remains unclear.
In this paper, we revisit the concept of “explanation” at a formal level, with the goal of developing principles and methods to explain any black box function , e.g. a neural network object classifier. Since such a function is learned automatically from data, we would like to understand what has learned to do and how it does it. Answering the “what” question means determining the properties of the map. The “how” question investigates the internal mechanisms that allow the map to achieve these properties. We focus mainly on the “what” question and argue that it can be answered by providing interpretable rules that describe the input-output relationship captured by . For example, one rule could be that is rotation invariant, in the sense that “ whenever images and are related by a rotation”.
In this paper, we make several contributions. First, we propose the general framework of explanations as meta-predictors (sec. 2), extending ’s work. Second, we identify several pitfalls in designing automatic explanation systems. We show in particular that neural network artifacts are a major attractor for explanations. While artifacts are informative since they explain part of the network behavior, characterizing other properties of the network requires careful calibration of the generality and interpretability of explanations. Third, we reinterpret network saliency in our framework. We show that this provides a natural generalization of the gradient-based saliency technique of  by integrating
information over several rounds of backpropagation in order to learn an explanation. We also compare this technique to other methods[15, 16, 20, 14, 19] in terms of their meaning and obtained results.
Our work builds on ’s gradient-based method, which backpropagates the gradient for a class label to the image layer. Other backpropagation methods include DeConvNet  and Guided Backprop [16, 8], which builds off of DeConvNet  and ’s gradient method to produce sharper visualizations.
Another set of techniques incorporate network activations into their visualizations: Class Activation Mapping (CAM)  and its relaxed generalization Grad-CAM  visualize the linear combination of a late layer’s activations and class-specific weights (or gradients for ), while Layer-Wise Relevance Propagation (LRP)  and Excitation Backprop  backpropagate an class-specific error signal though a network while multiplying it with each convolutional layer’s activations.
With the exception of ’s gradient method, the above techniques introduce different backpropagation heuristics, which results in aesthetically pleasing but heuristic notions of image saliency. They also are not model-agnostic, with most being limited to neural networks (all except [15, 1]) and many requiring architectural modifications [19, 16, 8, 22] and/or access to intermediate layers [22, 14, 1, 20].
A few techniques examine the relationship between inputs and outputs by editing an input image and observing its effect on the output. These include greedily graying out segments of an image until it is misclassified  and visualizing the classification score drop when an image is occluded at fixed regions . However, these techniques are limited by their approximate nature; we introduce a differentiable method that allows for the effect of the joint inclusion/exclusion of different image regions to be considered.
Our research also builds on the work of [18, 12, 2]. The idea of explanations as predictors is inspired by the work of , which we generalize to new types of explanations, from classification to invariance.
The Local Intepretable Model-Agnostic Explanation (LIME) framework  is relevant to our local explanation paradigm and saliency method (sections 3.2, 4) in that both use an function’s output with respect to inputs from a neighborhood around an input that are generated by perturbing the image. However, their method takes much longer to converge ( vs. our iterations) and produces a coarse heatmap defined by fixed super-pixels.
Similar to how our paradigm aims to learn an image perturbation mask that minimizes a class score, feedback networks 
learn gating masks after every ReLU in a network to maximize a class score. However, our masks are plainly interpretable as they directly edit the image while’s ReLU gates are not and can not be directly used as a visual explanation; furthermore, their method requires architectural modification and may yield different results for different networks, while ours is model-agnostic.
A black box is a map from an input space to an output space , typically obtained from an opaque learning process. To make the discussion more concrete, consider as input color images where is a discrete domain. The output can be a boolean telling whether the image contains an object of a certain type (e.g. a robin), the probability of such an event, or some other interpretation of the image content.
An explanation is a rule that predicts the response of a black box to certain inputs. For example, we can explain a behavior of a robin classifier by the rule , where is the subset of all the robin images. Since is imperfect, any such rule applies only approximately. We can measure the faithfulness of the explanation as its expected prediction error: , where is the indicator function of event . Note that implicitly requires a distribution over possible images . Note also that is simply the expected prediction error of the classifier. Unless we did not know that was trained as a robin classifier, is not very insightful, but it is interpretable since is.
Explanations can also make relative statements about black box outcomes. For example, a black box , could be rotation invariant: , where means that and are related by a rotation. Just like before, we can measure the faithfulness of this explanation as .111For rotation invariance we condition on because the probability of independently sampling rotated and is zero, so that, without conditioning, would be true with probability 1. This rule is interpretable because the relation is.
A significant advantage of formulating explanations as meta predictors is that their faithfulness can be measured as prediction accuracy. Furthermore, machine learning algorithms can be used to discover explanations automatically, by finding explanatory rules that apply to a certain classifier out of a large pool of possible rules .
In particular, finding the most accurate explanation is similar to a traditional learning problem and can be formulated computationally as a regularized empirical risk minimization such as:
Here, the regularizer has two goals: to allow the explanation to generalize beyond the samples considered in the optimization and to pick an explanation which is simple and thus, hopefully, more interpretable.
Simplicity and interpretability are often not sufficient to find good explanations and must be paired with informativeness. Consider the following variant of rule : , where means that and are related by a rotation of an angle . Explanations for larger angles imply the ones for smaller ones, with being trivially satisfied. The regularizer can then be used to select a maximal angle and thus find an explanation that is as informative as possible.222Naively, strict invariance for any implies invariance to arbitrary rotations as small rotations compose into larger ones. However, the formulation can still be used to describe rotation insensitivity (when varies slowly with rotation), or ’s meaning can be changed to indicate rotation w.r.t. a canonical “upright” direction for a certain object classes, etc.
A local explanation is a rule that predicts the response of in a neighborhood of a certain point . If is smooth at , it is natural to construct by using the first-order Taylor expansion of :
This formulation provides an interpretation of ’s saliency maps, which visualize the gradient as an indication of salient image regions. They argue that large values of the gradient identify pixels that strongly affect the network output. However, an issue is that this interpretation breaks for a linear classifier: If , is independent of the image and hence cannot be interpreted as saliency.
The reason for this failure is that eq. 2 studies the variation of for arbitrary displacements from and, for a linear classifier, the change is the same regardless of the starting point . For a non-linear black box such as a neural network, this problem is reduced but not eliminated, and can explain why the saliency map is rather diffuse, with strong responses even where no obvious information can be found in the image (fig. 3).
We argue that the meaning of explanations depends in large part on the meaning of varying the input to the black box. For example, explanations in sec. 3.1 are based on letting vary in image category or in rotation. For saliency, one is interested in finding image regions that impact ’s output. Thus, it is natural to consider perturbations obtained by deleting subregions of . If we model deletion by multiplying point-wise by a mask , this amounts to studying the function 333 is the Hadamard or element-wise product of vectors.
is the Hadamard or element-wise product of vectors.. The Taylor expansion of at is For a linear classifier , this results in the saliency , which is large for pixels for which and are large simultaneously. We refine this idea for non-linear classifiers in the next section.
In order to define an explanatory rule for a black box , one must start by specifying which variations of the input will be used to study . The aim of saliency is to identify which regions of an image are used by the black box to produce the output value . We can do so by observing how the value of changes as is obtained “deleting” different regions of . For example, if denotes a robin image, we expect that as well unless the choice of deletes the robin from the image. Given that is a perturbation of , this is a local explanation (sec. 3.2) and we expect the explanation to characterize the relationship between and .
While conceptually simple, there are several problems with this idea. The first one is to specify what it means “delete” information. As discussed in detail in sec. 4.3, we are generally interested in simulating naturalistic or plausible imaging effect, leading to more meaningful perturbations and hence explanations. Since we do not have access to the image generation process, we consider three obvious proxies: replacing the region with a constant value, injecting noise, and blurring the image (fig. 4).
Formally, let be a mask, associating each pixel with a scalar value . Then the perturbation operator is defined as
where is an average color, are i.i.d. Gaussian noise samples for each pixel and
is the maximum isotropic standard deviation of the Gaussian blur kernel(we use , which yields a significantly blurred image).
Given an image , our goal is to summarize compactly the effect of deleting image regions in order to explain the behavior of the black box. One approach to this problem is to find deletion regions that are maximally informative.
In order to simplify the discussion, in the rest of the paper we consider black boxes that generate a vector of scores for different hypotheses about the content of the image (e.g. as a softmax probability layer in a neural network). Then, we consider a “deletion game” where the goal is to find the smallest deletion mask that causes the score to drop significantly, where is the target class. Finding can be formulated as the following learning problem:
where encourages most of the mask to be turned off (hence deleting a small subset of ). In this manner, we can find a highly informative region for the network.
One can also play an symmetric “preservation game”, where the goal is to find the smallest subset of the image that must be retained to preserve the score : . The main difference is that the deletion game removes enough evidence to prevent the network from recognizing the object in the image, whereas the preservation game finds a minimal subset of sufficient evidence.
Both optimization problems are solved by using a local search by means of gradient descent methods. In this manner, our method extracts information from the black box by computing its gradient, similar to the approach of . However, it differs in that it extracts this information progressively, over several gradient evaluations, accumulating increasingly more information over time.
By committing to finding a single representative perturbation, our approach incurs the risk of triggering artifacts of the black box. Neural networks, in particular, are known to be affected by surprising artifacts [5, 10, 7]; these works demonstrate that it is possible to find particular inputs that can drive the neural network to generate nonsensical or unexpected outputs. This is not entirely surprising since neural networks are trained discriminatively on natural image statistics. While not all artifacts look “unnatural”, nevertheless they form a subset of images that is sampled with negligible probability when the network is operated normally.
Although the existence and characterization of artifacts is an interesting problem per se, we wish to characterize the behavior of black boxes under normal operating conditions. Unfortunately, as illustrated in fig. 5, objectives such as eq. 3 are strongly attracted by such artifacts, and naively learn subtly-structured deletion masks that trigger them. This is particularly true for the noise and constant perturbations as they can more easily than blur create artifacts using sharp color contrasts (fig. 5, bottom row).
We suggests two approaches to avoid such artifacts in generating explanations. The first one is that powerful explanations should, just like any predictor, generalize as much as possible. For the deletion game, this means not relying on the details of a singly-learned mask . Hence, we reformulate the problem to apply the mask stochastically, up to small random jitter.
Second, we argue that masks co-adapted with network artifacts are not representative of natural perturbations. As noted before, the meaning of an explanation depends on the meaning of the changes applied to the input ; to obtain a mask more representative of natural perturbations we can encourage it to have a simple, regular structure which cannot be co-adapted to artifacts. We do so by regularizing in total-variation (TV) norm and upsampling it from a low resolution version.
With these two modifications, eq. 3 becomes:
where . is the upsampled mask and is a 2D Gaussian kernel. Equation 4
can be optimized using stochastic gradient descent.
Unless otherwise specified, the visualizations shown were generated using Adam  to minimize GoogLeNet’s  softmax probability of the target class by using the blur perturbation with the following parameters: learning rate iterations, upsampling a mask ( for GoogLeNet) by a factor of , blurring the upsampled mask with
, and jittering the mask by drawing an integer from the discrete uniform distribution onwhere . We initialize the mask as the smallest centered circular mask that suppresses the score of the original image by when compared to that of the fully perturbed image, i.e. a fully blurred image.
An advantage of the proposed framework is that the generated visualizations are clearly interpretable. For example, the deletion game produces a minimal mask that prevents the network from recognizing the object.
When compared to other techniques (fig. 2), this method can pinpoint the reason why a certain object is recognized without highlighting non-essential evidence. This can be noted in fig. 2 for the CD player (row 7) where other visualizations also emphasize the neighboring speakers, and similarly for the cliff (row 3), the street sign (row 4), and the sunglasses (row 8). Sometimes this shows that only a part of an object is essential: the face of the Pekenese dog (row 2), the upper half of the truck (row 6), and the spoon on the chocolate sauce plate (row 1) are all found to be minimally sufficient parts.
While contrastive excitation backprop generated heatmaps that were most similar to our masks, our method introduces a quantitative criterion (i.e., maximally suppressing a target class score), and its verifiable nature (i.e., direct edits to an image), allows us to compare differing proposed saliency explanations and demonstrate that our learned masks are better on this metric. In fig. 6, row 2, we show that applying a bounded perturbation informed by our learned mask significantly suppresses the truck softmax score, whereas a boxed perturbation on the truck’s back bumper, which is highlighted by contrastive excitation backprop in fig. 2, row 6, actually increases the score from to .
The principled interpretability of our method also allows us to identify instances when an algorithm may have learned the wrong association. In the case of the chocolate sauce in fig. 6
, row 1, it is surprising that the spoon is highlighted by our learned mask, as one might expect the sauce-filled jar to be more salient. However, manually perturbing the image reveals that indeed the spoon is more suppressive than the jar. One explanation is that the ImageNet “chocolate sauce” images contain more spoons than jars, which appears to be true upon examining some images. More generally, our method allows us to diagnose highly-predictive yet non-intuitive and possibly misleading correlations by identified machine learning algorithms in the data.
To test that our learned masks are generalizable and robust against artifacts, we simplify our masks by further blurring them and then slicing them into binary masks by thresholding the smoothed masks by (fig. 7, top; tends to cover the salient part identified by the learned mask). We then use these simplified masks to edit a set of 5,000 ImageNet images with constant, noise, and blur perturbations. Using GoogLeNet , we compute normalized softmax probabilities444, where are the masked, original, and fully blurred images’ scores (fig. 7, bottom). The fact that these simplified masks quickly suppress scores as increases for all three perturbations gives confidence that the learned masks are identifying the right regions to perturb and are generalizable to a set of extracted masks and other perturbations that they were not trained on.
In this experiments we assess the ability of our method to correctly identify a minimal region that suppresses the object. Given the output saliency map, we normalize its intensities to lie in the range , threshold it with , and fit the tightest bounding box around the resulting heatmap. We then blur the image in the box and compute the normalized4 target softmax probability from GoogLeNet  of the partially blurred image.
From these bounding boxes and normalized scores, for a given amount of score suppression, we find the smallest bounding box that achieves that amount of suppression. Figure 8 shows that, on average, our method yields the smallest minimal bounding boxes when considering suppressive effects of . These results show that our method finds a small salient area that strongly impacts the network.
From qualitatively examining learned masks for different animal images, we noticed that faces appeared to be more salient than appendages like feet. Because we produce dense heatmaps, we can test this hypothesis. From an annotated subset of the ImageNet dataset that identifies the keypoint locations of non-occluded eyes and feet of vertebrate animals , we select images from classes that have at least 10 images which each contain at least one eye and foot annotation, resulting in a set of 3558 images from 76 animal classes (fig. 9). For every keypoint, we calculate the average heatmap intensity of a window around the keypoint. For all 76 classes, the mean average intensity of eyes were lower and thus more salient than that of feet (see supplementary materials for class-specific results).
Adversarial examples  are often generated using a complementary optimization procedure to our method that learns a imperceptible pattern of noise which causes an image to be misclassified when added to it. Using our re-implementation of the highly effective one-step iterative method ()  to generate adversarial examples, our method yielded visually distinct, abnormal masks compared to those produced on natural images (fig. 10, left). We train an Alexnet  classifier (learning rate , weight decay , and momentum ) to distinguish between clean and adversarial images by using a given heatmap visualization with respect to the top predicted class on the clean and adversarial images (fig. 10, right); our method greatly outperforms the other methods and achieves a discriminating accuracy of .
Lastly, when our learned masks are applied back to their corresponding adversarial images, they not only minimize the adversarial label but often allow the original, predicted label from the clean image to rise back as the top predicted class. Our method recovers the original label predicted on the clean image 40.64% of time and the ground truth label 37.32% (). Moreover, 100% of the time the original, predicted label was recovered as one of top-5 predicted labels in the “mask+adversarial” setting. To our knowledge, this is the first work that is able to recover originally predicted labels without any modification to the training set-up and/or network architecture.
Saliency methods are often assessed in terms of weakly-supervised localization and a pointing game , which tests how discriminative a heatmap method is by calculating the precision with which a heatmap’s maximum point lies on an instance of a given object class, for more harder datasets like COCO . Because the deletion game is meant to discover minimal salient part and/or spurious correlation, we do not expect it to be particularly competitive on localization and pointing but tested them for completeness.
For localization, similar to [20, 2], we predict a bounding box for the most dominant object in each of 50k ImageNet  validation images and employ three simple thresholding methods for fitting bounding boxes. First, for value thresholding, we normalize heatmaps to be in the range of and then threshold them by their value with . Second, for energy thresholding , we threshold heatmaps by the percentage of energy their most salient subset covered with . Finally, with mean thresholding , we threshold a heatmap by , where is the mean intensity of the heatmap and . For each thresholding method, we search for the optimal value on a heldout set. Localization error was calculated as the IOU with a threshold of .
Table 1 confirms that our method performs reasonably and shows that the three thresholding techniques affect each method differently. Non-contrastive, excitation backprop  performs best when using energy and mean thresholding; however, our method performs best with value thresholding and is competitive when using the other methods: It beats gradient  and guided backprop  when using energy thresholding; beats LRP , CAM , and contrastive excitation backprop  when using mean thresholding (recall from fig. 2 that the contrastive method is visually most similar to mask); and out-performs Grad-CAM  and occlusion  for all thresholding methods.
|Guid [16, 8]||0.05||50.2||0.30||47.0||4.5||42.0|
|C Exc ||—||—||—||—||0.0||57.0|
For pointing, table 2 shows that our method outperforms the center baseline, gradient, and guided backprop methods and beats Grad-CAM on the set of difficult images (images for which 1) the total area of the target category is less than of the image and 2) there are at least two different object classes). We noticed qualitatively that our method did not produce salient heatmaps when objects were very small. This is due to L1 and TV regularization, which yield well-formed masks for easily visible objects. We test two variants of occlusion , blur and variable occlusion, to interrogate if 1) the blur perturbation with smoothed masks is most effective, and 2) using the smallest, highly suppressive mask is sufficient (Occ and V-Occ in table 2 respectively). Blur occlusion outperforms all methods except contrast excitation backprop while variable while variable occlusion outperforms all except contrast excitation backprop and the other occlusion methods, suggesting that our perturbation choice of blur and principle of identifying the smallest, highly suppressive mask is sound even if our implementation struggles on this task (see supplementary materials for examples and implementation details).
We propose a comprehensive, formal framework for learning explanations as meta-predictors. We also present a novel image saliency paradigm that learns where an algorithm looks by discovering which parts of an image most affect its output score when perturbed. Unlike many saliency techniques, our method explicitly edits to the image, making it interpretable and testable. We demonstrate numerous applications of our method, and contribute new insights into the fragility of neural networks and their susceptibility to artifacts.
Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks.In
Proceedings of the IEEE International Conference on Computer Vision, pages 2956–2964, 2015.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5188–5196, 2015.
Learning deep features for discriminative localization.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2921–2929, 2016.