Regularized adversarial examples for model interpretability

11/18/2018 ∙ by Yoel Shoshan, et al. ∙ ibm 0

As machine learning algorithms continue to improve, there is an increasing need for explaining why a model produces a certain prediction for a certain input. In recent years, several methods for model interpretability have been developed, aiming to provide explanation of which subset regions of the model input is the main reason for the model prediction. In parallel, a significant research community effort is occurring in recent years for developing adversarial example generation methods for fooling models, while not altering the true label of the input,as it would have been classified by a human annotator. In this paper, we bridge the gap between adversarial example generation and model interpretability, and introduce a modification to the adversarial example generation process which encourages better interpretability. We analyze the proposed method on a public medical imaging dataset, both quantitatively and qualitatively, and show that it significantly outperforms the leading known alternative method. Our suggested method is simple to implement, and can be easily plugged into most common adversarial example generation frameworks. Additionally, we propose an explanation quality metric - APE - "Adversarial Perturbative Explanation", which measures how well an explanation describes model decisions.



There are no comments yet.


page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As machine learning algorithms continue to improve, there is an increasing need for explaining why a given model produces a certain prediction. The benefits of such explanation fall roughly into two categories: explaining the end result for the user, and analysis of the network by the researcher. An example of the former is providing localized information on medical images classification to understand which areas in a given image made a model to classify an image as malignant. An example of the latter is that by seeing what caused decisions in a model, one can understand what needs to be improved in the algorithm.

In recent years, several methods for detecting saliency have been developed, aiming to provide explanation of which subset regions of the model input are the main reason for the model prediction. Part of the methods [ribeiro2016should, 3, 1]focus on introducing perturbations into the input of a model and then analyzing the modified model prediction.

In parallel, a significant effort is being put in recent years into developing adversarial examples generation methods for fooling models, usually aiming to keep the “true label” of the input, as will be classified by a human reader ([8, 4]). The goal of such adversarial attacks is to identify unwanted behavior of a network and exploit it.

In this paper, we bridge the gap between the domains of network explanation and adversarial examples. We introduce a modification to the adversarial example generation process which aims to maximize interpretability. Similarly to the domain of adversarial attacks, we change the input in a way that affects the output of a model. Unlike previous works on model explanation, our method does not require any reference “deletion” image and does not require training an additional NN model. Additionally, our method provides explanation in the full resolution of the original input.

We analyze the proposed method on a public medical imaging dataset, both quantitatively and qualitatively, and show that it significantly outperforms the leading known alternative method.

This modification is simple to implement, and can be easily plugged in into most common adversarial example generation frameworks. The resulting method is also applicable to non-image-related tasks.

The paper is organized as follows: Section 2 reviews related work, in section 3 we introduce a new metric that we use to measure performance, section 4 describes the proposed method, section 5 presents experimental data, section 6 discusses key differences from existing methods, and section 7 discusses and concludes this work.

2 Related Work

When analyzing a model (e.g. neural network) based classifier, one of the questions that arise is given an input and a classification results, which parts of the input most affect the result. This gave rise to several methods of deriving saliency maps. In particular, several gradient-based techniques were proposed, such as

[11], computing the gradient of the classification loss with respect to the image, and using the gradient magnitude as a measure of saliency. However, this resulted in highly irregular saliency maps. This issue was addressed by Fong and Vedaldi [3]. There, the authors define perturbations of the input as a cross-fade between the original image and an image that represents deletion of data. Regularization terms were introduced when deriving the cross-fade “mask”, which was then used as a saliency map. Chang [1] suggest an alternative method in which a generative model (Variational Auto-Encoder) attempts to “delete” region by inpainting. Dabkowski and Gal [2] achieve real-time performance by training a network to directly generate saliency masks from images, with the training data still dependent on mask and reference image. Rey-de-Castro and Rabitz [9] suggest adding a perturbation generator network, a differentiable neural network component that learns to distort the input image.

An important relevant field of study, which is gaining popularity in recent years is adversarial attacks. [4]. There, the original image is perturbed to modify a classifier prediction, in a way that will not be noticed by a human reader. Usually, small changes of the input are acceptable, as long as the “true label” (the label that a human will assign to this input) remains unchanged.

3 Proposed Metric

3.1 Saliency

Defining image saliency in the context of a neural network is a non trivial matter. In non formal terms, given a model, an input, and a model prediction based on the given input, the saliency map should explain how different parts of the input influenced the prediction. There is strong motivation to find saliency generation methods and metrics. A meaningful saliency map should help us see through the “black box” of the model. It may help us discover situations when a classifier model got the right answers but due to the wrong reasons. For example, in the famous story in which a classifier was trained to discover camouflaged tanks, and performed very well when looking only at classification results. When the results were analyzed, it was discovered that all of the camouflaged tanks images were taken on cloudy days, while the tank-less images were taken on sunny days. A proper saliency method would clearly have revealed this bias, and show that the classifier didn’t really learn to detect the tanks at all. An additional example is explaining a medical imaging classifier decision that helps to deliver more localized information on the decision, which may help in providing more information to the human reader, increase trust of the model and even lead to discovery of new features undiscovered by humans until now.

3.2 Proposed APE metric - “Adversarial Perturbative Explanation”

Ground truth (GT) based metrics, by themselves, do not represent well explanation performance. Firstly, we are explaining a model prediction behavior that may be flawed to begin with. In such case it is desirable that the explanation will reveal the weakness of the model. Secondly, there may be evidence outside the annotated GT object that may legitimately influence a model decision.

Therefore, to be able to quantitatively compare between explanation methods, we formulate the (Adversarial Perturbative Explanation) metrics.

Perturbation based explanation describes regions that affect the decision of a model, given a modified (perturbed) input. When creating a formal metric, we considered two main aspects. Requirement 1: In the spirit of Ockham’s razor, we want the explanation to be as “simple” as possible. Requirement 2: Additionally, it should suppress (or alternatively, maintain) class evidence, depending whether a SDR (smallest destroying region) or a SSR (smallest sufficient region) [2], respectively, is required. We define two metrics corresponding to SDR and SSR, and respectively.

Note: when deriving this metric we were inspired by Fong and Vedaldi [3] loss formulation, with the key difference that we use L0 directly and not an approximation of it in one of the terms.


Let us consider a model , an input , a perturbed input and class index . Firstly, we define to be the clipped version of

, constraining it to remain within valid input value range. Let binarization be defined as:


we propose a saliency metric composed of the following terms:

  1. Classification term:


    being the prediction of the model given input w.r.t. class index . This term expresses the “destructiveness” of the saliency region w.r.t. class .

  2. Sparsity term:

  3. Smoothness term:


The overall proposed saliency metric is defined as follows:


= number of elements on

Classification term Eq.2 addresses requirement 1. Sparsity and smoothness terms Eqs. 3, 4(combined) address requirement 2.


Similarly to the SDR version, we define to be the clipped version of , constraining it to remain within valid input value range.

  1. Classification term:


    being the prediction of the model given input w.r.t. class index . This term expresses how well the original model classification is preserved.

  2. Sparsity term:

  3. Smoothness term:


The overall proposed saliency metric is defined as follows:


= number of elements on

3.2.3 terms weights

In some settings, it may prove useful to define coefficients that provide different weights to each term.

We formalize such term weighting in eq. 10, , and being the sparsity, smoothness and classification weight respectively. We formalize it for , but it can be symmetrically formalized for which we omit for brevity.


In this paper, when measuring based performance (table 1), we only use (eq. 10) as we feel that this is appropriate for the examined domain and task. However, it is possible that in a different domain or task, different metric coefficients will be more suitable. For example, if in the examined domain or task, it is known in advance that objects are relatively big, can be increased and possibly can be reduced. Such modification will strongly favor smooth connected components explanations and discourage too sparse explanations.

4 Proposed Method

We focus only on optimizing w.r.t. , as we believe that in certain domains, such as medical imaging, is preferred, since it explores less drastic modifications to the input and allows the model to consider larger context.

Let be the perturbed input. Since support function is not continuously differentiable, we approximate binarization Eq. 1 by the following:


The result approximates a smoothed step function which receives negative values below (we found to achieve good results).

Inspired by Kurakin [7] we formulate an optimization problem which aims to find that minimizes the proposed metric 5.

Phase 1:We initialize to . Then, we iteratively modify using the gradient of the loss w.r.t. the input, while keeping model frozen.

Any gradient based optimizer may be used, including, for example, SGD [10], Adam [6], AdaDelta [13]. For brevity we describe the SGD update step, while in practice we use Adam [6] optimizer..


The loss is a smooth version of the saliency metric:


The first term reduces classification value of class . The second term approximates the size of support of , with two important differences. Firstly, as was mentioned previously, it is smooth w.r.t. . Secondly, very small values of the perturbation result in negative values of , decreasing the overall value of the second member. This should encourage close-to-zero perturbations over most of the image. This will also be useful later when we derive the saliency mask. The third term encourages smoothness of , preferring continues regions of non-zero values over scattered individual elements (pixels in the case of images).

On each iteration, after computing the update step Eq. 12we constrain to remain in the original applicable values range which is sampled from, by clipping its values. After either completing a defined number of iterations, or a when reaching convergence, we derive the saliency mask by thresholding at zero:


We found that this is sufficient for the purpose of explanation, however, zeroing out some regions of may increase the classification term, increasing the overall

score. We therefore introduce phase 2, which finds the smallest achievable classification probability (for class

), given that perturbations are only allowed within the mask derived in 14

Phase 2: Small (below ) perturbations of the input may also affect classification (and classification term), and eliminating those perturbations in Eq. 14 may increase overall loss. In order to guarantee that classification loss remain small for the derived saliency mask, we introduce the second phase of the algorithm. The purpose of this stage is to make sure that we minimize the classification term while allowing perturbations only within the mask derived in Eq. 14. For this purpose we find a perturbed image, similarly to Eq. 12, that is non-zero only inside the mask, starting with ,:


This time, we only minimize the classification term.


On this second phase we drop both sparsity and smoothness terms to allow non-regulated changes to occur within the mask regions, compensating for ommiting the out-of-mask changes.

5 Experiments

a b c
Figure 1: Column a: ground truth with malignant lesion delineated in green, column b: meaningful perturbations mask [3], column c: proposed method resulting mask. As can be seen in table 1, the proposed method changes much smaller parts of the input, while reaching better classification term.

We compare our method with Fong and Vedaldi ([3]) both qualitatively and quantitatively. Quantitative comparison is based on two metrics. One is the metric discussed in section 3 (). The other measures a ground truth (GT) match by counting the portion of the explanation mask CCs (Connected Components) that intersect with the tested object. The proposed method performs significantly better performance w.r.t. both metrics (Table 1).

To evaluate our method we have selected the DDSM dataset [5], a digital database for screening mammography. The main reasons are a. The dataset is publicly available; b. The images are high resolution averaging at around 6000x4000 pixels; c. The objects are of varying sizes ranging between pixel sized micro-calcifications to large tumors; d. Malignancy classification is a hard and non-trival task, representing an interesting setting to explore explainability.

5.1 Model and experiment setting

For the purpose of the experiment we train a single inception-resnet-v3[12]

model on per-image classification task of malignant vs. non-malignant images. The model architecture is modified to accept single-channel inputs (grayscale). Feature extraction layers up to 2 layers after “mixed_6a” layer are kept, followed by a fully connected layer of size 256, and finally a fully connected layer of size 2, followed by a softmax operation over the two classes “malignant” vs. “non-malignant”. It is important to note that, during training, the model is never exposed to localization information, and is trained w.r.t. the overall image malignancy label (the presence of at least a single malignant finding). This results in ~8M trainable parameters. Standard batch-normalization was used.

We randomly split the dataset into 80% train-set (~8000 images) and 20% validation-set (~2000 images), making sure that no patient appears on both train and validation sets. The model is tested on the validation set on which it achieves a ~0.8 rocauc in the mentioned classification task. The qualitative and quantitative (table 1) results are calculated on the validation set.

We explore several setups (Table 1) of both methods w.r.t. approximation, Total Variation (TV) and (Total Variation ) [3]. Masks for Fong and Vedaldi [3] are thresholded with T value which provides the best score for the entire validation set, scanned at steps of 1e-6. Masks for our proposed method are calculated by thresholding with 0 (eq. 14) (eq. 11). We do not scan for additional and values as our initial “guess” worked well.

5.2 GT localization metric

DDSM dataset contains localized GT (ground truth) information delineating malignant lesions. Since it is important to know if the perturbation method managed to “fool” the model in nonsensical ways, especially in the context of adversarial example generation, we examine how well the explanation masks correlate with the actual findings in the images. For this purpose, we take each generated mask, and measure which percentage of its CCs (connected components) have non zero intersection with GT lesions/objects. See table 1.

a b c d
Figure 2: Column a: ground truth with malignant lesion delineated in green, column b: meaningful perturbations [3] smoothing perturbation, column c: meaningful perturbations [3] reduction to zero perturbation, column d: proposed method adversarial perturbation. It is interesting to note that while our method generates adversarial noise-like patterns, it is mostly formed within well localized adversarial “patch”.

5.3 Results

As demonstrated in Table 1, our method achieves best performance on both and localization metrics. Furthermore, we achieve top results on each separate component. Figure 1 1 shows 4 representative example images, demonstrating GT malignant lesions, our proposed method explanation mask and an alternative method explanation mask. Figure 2 2shows how the adversarial perturbation images look like for the compared methods, zoomed in at the malignant lesion (the object that the model should search for). It can be seen how Fong and Vedaldi [3] either darkens or blurs regions localized at the lesion area, depending on the reference deletion image, while our proposed method generates localized adversarial “patches”. It can also be seen that our method generates significantly less localized false positive mask locations when compared with the GT, which explains why our proposed method got, quantitatively, better GT localization performance results. (“GT localization results” column in table 1).

experiment setting results (
Method approx. coeff tv coeff tv classification
MP min 0.01 0.2 1 0.9947 0.0019 0.0044 1.001
MP min 0.01 0.2 3 0.5254 0.0059 0.0027 0.5340
MP min 6 120 1 0.6699 0.0002 0.0883 0.7585
MP min 40 120 1 0.0606 0.0002 0.0881 0.1489
MP blur 0.01 0.2 1 0.9958 0.0042 0.0137 1.0137
MP blur 0.01 0.2 3 0.5487 0.0100 0.009 0.5680
MP blur 6 120 1 0.7240 0.0130 0.0927 0.8297
MP blur 40 120 1 0.191 0.0009 0.0760 0.2679
ours 0.01 0.2 - 0.0203 0.0027 0.0003 0.0234
ours 6 120 - 0.0004 0.0001 0.0286 0.0291
ours 40 120 - 0.0001 0.0001 0.0174 0.0176
GT localization results
CCs hit rate (%)
Table 1: MP min = Meaningful Perturbation when using a reference deletion image containing a constant value of the minimal value in the original image. MP blur = Meaningful Perturbation when using a reference deletion image containing a blurred version of the original image.

5.4 Other methods

In addition to Fong and Vedaldi [3] we tried to get performance results for Rey-de-Castro and Rabitz [9]. However, no matter what hyper parameters we tried, we could not make it generate meaningful explanations. We either got an almost full or completely empty mask. We believe that it’s due to the relatively small capacity of the perturbation generator architecture and its lack of ability to capture complex objects due to a too local context. We believe that it can be seen, on Rey-de-Castro and Rabitz [9] visualizations, where it appears clear that the perturbation generator, due to lack of capacity, resorted into distorting most edges in the image, many times regardless of correlation with the relevant GT object. This can be seen, for example, in figure 3 in [9] where non-cat-related edges are highlighted in addition to the cat-related edges. However, on medical images the images are rich with edges, and the vast majority of them are irrelevant to potential malignancy. We tried to increase the model capacity by increasing the number of layers, and the number of input/output channels, excluding the first and the last layer which consist of a single output channel, but it did not help the model converge. We tried layers numbers ranging between 1 to 30, and convolution filters number ranging from 1 to 128 on each layer. It is possible that a larger model and/or a different architecture for the perturbation generator will be able to converge, however, we did not manage to find it.

6 Difference from existing methods

In this section we will compare the proposed method with some of the state-of-the-art methods in the field.

In [3]

, Fong and Vedaldi describe a loss function which is similar to ours (namely, classification, smoothness and sparsity approximation). However there are several main differences between the methods:

  1. The requirement to provide a “deletion” image. For example, a blurred version of the input. It is not always clear what is the best way to delete information, especially in domains outside natural images. Our proposed method does not require providing any reference “deletion” image. A dark or blurry region in medical imaging setting does not always equal a lack of evidence or objects. Additionally, since in practice in their described method, an element-wise cross-fade is performed between the original input and the reference “deletion” image, the possible pixel values are limited. As an extreme example, if both the input and the reference “deletion” image have the same value, their method will not be able to provide explanation that perturb this element at all. Furthermore, non visual features like, for example, clinical data records, or stocks market values, may have complex relations between them, and it is not clear what the reference “deletion” values should be.

  2. In some settings it is better to provide explanations in the original input full resolution. An example is calcifications (accumulation of calcium salts in a body tissue) as seen on xray, which may be as small as a single pixel, but provide valuable medical evidence. Our proposed method converges well and does not seem to overfit, unlike what Fong and Vedaldi describe.

In [1], Chang et al

. introduce a different way to suppress the information within a region of an image. They train a generative model (Variational Auto-Encoder) to impute information inside a mask. Training a generative model (such as GAN or VAE) is a non trivial matter, and our proposed method does not require it.

Rey-de-Castro and Rabitz in [9] propose to train an additional model, to generate image perturbations. While there are several similarities to our proposed methods, there are key differences:

  1. The method requires an additional perturbation generator model. Firstly, choosing an appropriate architecture may prove difficult. On one hand, a too simplistic model may not be sufficient to capture the desired behavior, due to low capacity and/or lack of large enough context. On the other hand, a complex model may be difficult to train. In contrast, our proposed method does not require constructing any additional model architecture, as we perturb the values of the input directly. It’s worth noting that our modification of the input is non-linear, as it originates from gradients propagation through a non-linear function (the original model).

  2. The lack of total variation in the proposed loss function does not encourage “simple“ explanations.

  3. Rey-de-Castro and Rabitz optimize for L1, while we optimize to minimize a closer approximation of L0 (eq. 11).

Additionally, as we describe in section 5.4 it proved difficult, in practice, to make the perturbation generator provide meaningful explanations that go beyond perturbing most edges in the image, regardless of their relation to any object.

7 Conclusions

In this paper we presented a new method for explaining the decision of a given model on a specified input. The method is based on adversarial examples generation, which, when constrained to simple changes, is shown to provide well-localized meaningful explanations. We test the method on a public dataset of breast mammogram images, and show that it significantly outperforms the current state-of-the art in the field, both quantitatively, using heuristic metrics and ground-truth-based comparison, and qualitatively.

While the analyzed network is never exposed to localization information, the proposed explanation method also extracts meaningful local cues. Extending this functionality within the framework of weakly-supervised segmentation is part of currently ongoing work.