Log In Sign Up

GANMEX: One-vs-One Attributions using GAN-based Model Explainability

by   Sheng-Min Shih, et al.

Attribution methods have been shown as promising approaches for identifying key features that led to learned model predictions. While most existing attribution methods rely on a baseline input for performing feature perturbations, limited research has been conducted to address the baseline selection issues. Poor choices of baselines limit the ability of one-vs-one explanations for multi-class classifiers, which means the attribution methods were not able to explain why an input belongs to its original class but not the other specified target class. Achieving one-vs-one explanation is crucial when certain classes are more similar than others, e.g. two bird types among multiple animals, by focusing on key differentiating features rather than shared features across classes. In this paper, we present GANMEX, a novel approach applying Generative Adversarial Networks (GAN) by incorporating the to-be-explained classifier as part of the adversarial networks. Our approach effectively selects the baseline as the closest realistic sample belong to the target class, which allows attribution methods to provide true one-vs-one explanations. We showed that GANMEX baselines improved the saliency maps and led to stronger performance on perturbation-based evaluation metrics over the existing baselines. Existing attribution results are known for being insensitive to model randomization, and we demonstrated that GANMEX baselines led to better outcome under the cascading randomization of the model.


page 1

page 2

page 3

page 4

page 5

page 7

page 10

page 12


Fooling Explanations in Text Classifiers

State-of-the-art text classification models are becoming increasingly re...

Towards a Theory of Faithfulness: Faithful Explanations of Differentiable Classifiers over Continuous Data

There is broad agreement in the literature that explanation methods shou...

Shortcomings of Top-Down Randomization-Based Sanity Checks for Evaluations of Deep Neural Network Explanations

While the evaluation of explanations is an important step towards trustw...

Maximum Entropy Baseline for Integrated Gradients

Integrated Gradients (IG), one of the most popular explainability method...

A Baseline for Shapely Values in MLPs: from Missingness to Neutrality

Being able to explain a prediction as well as having a model that perfor...

RELAX: Representation Learning Explainability

Despite the significant improvements that representation learning via se...

Looking Deeper into Deep Learning Model: Attribution-based Explanations of TextCNN

Layer-wise Relevance Propagation (LRP) and saliency maps have been recen...

1 Introduction

Modern Deep Neural Network (DNN) designs have been advancing the state-of-the-art performance of numerous machine learning tasks with the help of increasing model complexities, which at the same time reduces model transparency. The need for explainable decision is crucial for earning trust of decision makers, required for regulatory purposes

Goodman and Flaxman (2017), and extremely useful for development and maintainability.

Due to this, various attribution methods were developed to explain the DNNs decisions by attributing an importance weight to each input feature. In high level, most attribution methods, such as integrated gradient (IG) (Sundararajan et al. (2017)), DeepSHAP (Lundberg and Lee (2017)), DeepLift (Shrikumar et al. (2017)) and Occlusion (Zeiler and Fergus (2013)), alter the features between the original values and the values of some baseline instance, and accordingly highlight the features that impacts the model’s decision. While extensive research has been conducted on the attribution algorithms, research regarding the selection of baselines is rather limited, and it is typically treated as an afterthought. Most existing methodologies by default apply a uniform-value baseline, which can dramatically impact the validity of the feature attributions (Sturmfels et al. (2020)), and as a result, existing attribution methods showed rather unperturbed output even after complete randomization of the DNN (Adebayo et al. (2018)).

In a multi-class classification setting, existing baseline choices do not allow specifying a target class, and this has limited the ability for providing a class-targeted or one-vs-one explanation, meaning explaining why the input belongs to class A and not a specific class B. These explanations are crucial when certain classes are more similar than others, as often happens for example when the classes have a hierarchy among them. For example, in a classification task of apples, oranges and bananas, a model decision for apples vs oranges should be based on their color rather than the shape since both an apple and orange are round. This would intuitively only happen when asking for an explanation of ‘why apple and not orange’ rather than ‘why apple’.

In this paper, we present GAN-based Model EXplainability (GANMEX), a novel methodology for generating one-vs-one explanations leveraging GAN. In a nutshell, we use GANs to produce a baseline image which is a realistic instance from a target class that resembles the original instance. A naive use of GANs can be problematic because the explanation generated would not be specific to the to-be-explained DNN. We lay out a well-tuned recipe that avoids these problems by incorporating the classifier as a static part of the adversarial networks and adding a similarity loss function for guiding the generator. We showed in the ablation study that both swapping in the DNN and adding the similarity loss are critical for resulting the correct explanations. To the best of our knowledge, GANMEX is the first to apply GAN on 1-vs-1 explanations of DNN decisions.

We showed that GANMEX baselines can be used with a variety of attribution methods, including IG, DeepLIFT, DeepSHAP and Occlusion, to produce one-vs-one attribution superior compared with existing approaches. GANMEX outperformed the existing baseline choices on perturbation-based evaluation metrics and showed more desirable behavior under the sanity checks of randomizing DNNs. Other than its obvious advantage for one-vs-one explanations, we show that by replacing only the baselines and without changing the attribution algorithms, GANMEX greatly improves the saliency maps for binary classifiers, where one-vs-one and one-vs-all are equivalent.

2 Related Works

2.1 Attribution Methods and Saliency Maps

Attribution methods and their visual form, saliency maps, have been commonly used for explaining DNNs. Given an input and model output , an attribution method for output assign contribution to each pixel . There are two major attribution method families: Local attribution methods that are based on infinitesimal feature perturbations, such as gradient saliency (Simonyan et al. (2014)) and gradient*input (Shrikumar et al. (2016)), and global attribution methods that are based on feature perturbation with respect to a baseline input (Ancona et al. (2018)). We focus on global attribution methods since they tackle the gradient discontinuity issue in local attribution methods, and they are known to be more effective on explaining the marginal effect of a feature’s existence (Ancona et al. (2018)). In this paper, we discussed five popular global attribution methods below:

Integrated Gradient (IG) (Sundararajan et al. (2017)) calculates a path integral of the model gradient from a baseline image to the input image : . The baseline is commonly chosen to be the zero input and the integration path is selected as the straight path between the baseline and the input.

DeepLIFT (Shrikumar et al. (2017)

) addressed the discontinuity issue by performing backpropagation and assigns a score

to each neuron in the networks based on the input difference to the baseline

and the difference in the activation to that of the baseline , that satisfies the summation-to-delta property .

Occlusion (Zeiler and Fergus (2013); Ancona et al. (2018)) applies full-feature perturbations by removing each feature and calculating the impacts on the DNN output. The feature removal was performed by replacing its value with zero, meaning an all zero input was implicitly used as the baseline.

DeepSHAP (Chen et al. (2019); Lundberg and Lee (2017); Shrikumar et al. (2017)

) was built upon the framework of DeepLIFT but connecting the multipliers of attribution rule (rescale rule) to SHAP values, which are computed by ‘erasing features’. The operation of erasing one or more features require the notion of a background, which is defined by either a distribution (e.g. uniform distribution over the training set) or single baseline instance. For practical reasons, it is common to choose a single baseline instance to avoid having to store the entire training set in memory.

Expected Gradient (Erion et al. (2019)) is a variant of IG that calculates the expected attribution over a prior distribution of baseline input, usually approximated by the training set , meaning where is the uniform distribution. In other words, the baseline of IG is replaced with a uniform distribution over the samples in the training set.

A crucial property of the above methods is their need for a baseline, which is either explicitly or implicitly defined. In what follows we show that these methods are greatly improved by modifying their baseline to that chosen by GANMEX.

2.2 The Baseline Selection Problem

Limited research has been done on the problem of baseline selection so far. A simple ”most natural input”, such as zero values of all numerical features is commonly chosen as the baseline. For image inputs, uniform images with all pixels set to the max/min/medium values are commonly chosen. The static baselines frequently cause the attribution to only focus on or even overly highlight the area where the feature values are different from the baseline values, and hide the feature importance where the input values are close to the baseline values (Sundararajan and Taly (2018); Adebayo et al. (2018); Kindermans et al. (2017); Sturmfels et al. (2020)).

Several none-static baselines have been proposed in the past, but each of them suffered from its own downsides (Sturmfels et al. (2020)). Fong and Vedaldi (2017) used blurred images as baselines, but the results are biased toward highlighting high-frequency information from the input. Bach et al. (2015) make use of the training samples by finding the training example belonging to the target class closest to the input in Euclidean distance. Even though the concept of minimum distance is highly desirable, but in practice, the nearest neighbor selection in high dimensional space can frequently lead to poor outcome, and most of the nearest neighbors are rather distant from the original input.

Along the same concept, expected gradient simply samples over all training instances instead of identifying the closest instance (Erion et al. (2019)). Expected gradient benefits from ensembling in a way similar to that of SmoothGrad, which averages over multiple saliency maps produced by imposing Gaussian noise on the original image (Smilkov et al. (2017); Hooker et al. (2019)). We claim however that averaging over the training set does not solve the issue; for example, due to the foreground being located in different sections of the images, the average image would often resemble a uniform baseline.

2.3 One-vs-One and One-vs-All Attribution

In multi-class settings, while one-vs-all explanation was designed to explain why the input belong to its original class and not the others, one-vs-one explanations aim to provide an attribution that explains why belong to and not the specified target class

. Most existing attribution methods were primarily designed for one-vs-all explanation, but was proposed to extend to one-vs-one by simply calculating the attribution with respect to the difference of the original class probability to the target class probability

(Bach et al. (2015); Shrikumar et al. (2017)).

It is easy to think of examples where this somewhat naive formulation will not provide a true one-vs-one explanation. Taking the example of fruit classification, for both apples and oranges the explanation could easily be the round shape, and taking the difference between those will result in an arbitrary attribution. We claim that without a class-targeted baseline, the modified attributions do not provide true one-vs-one explanation. Take IG for example, . With zero baseline, the target class score and its gradient will likely stay close to zero along the straight path from the input to the zero baseline, meaning that because the instance never belongs to the target class. With this in mind the one-vs-one explanation is not very informative with respect to the target class .

Few class-targeted baselines were proposed in the past. The minimum distance training sample (MDTS) described in Section 2.2 is class-targeted as the sample was selected from the designated target class. While the original expected gradient was defined for one-vs-all explanation only, we extended the method to one-vs-one by sampling the baselines only from the target class. However, as mentioned in Section 2.2, MDTS is frequently hindered by the sparsity of the training set in the high dimensional space, and expected gradient suffers from undesired effects caused by uncorrelated training samples. The problem of baseline selection, especially for the one-vs-one explainability, has posted a challenging problem, because the ideal baseline choice can simply be absent from the training set.

2.4 GAN and Image-to-Image Translation

Image-to-Image Translation is a family of GAN originally introduced by Isola et al. (2017) for creating mappings between two domains of data. While the corresponding pairs of images are rare in most real-world dataset, Zhu et al. (2017) has made the idea widely applicable by introducing a reconstruction loss to tackle the tasks with unpaired training dataset. Since then, more efficient and better performing approaches have been developed to improve few-shot performance (Liu et al. (2019)) and output diversity (Choi et al. (2020)). Nevertheless, we found the StarGAN variant proposed by Choi et al. (2017) specifically applicable to the baseline selection problem because of its standalone class discriminator in the adversarial networks as well as the deterministic mapping that preserve the styles of the translated images (Choi et al. (2020)). GANs have not been applied for explaining DNNs in the past to our best knowledge.

Prior to our work, Chang et al. (2018) proposed the fill-in the dropout region (FIDO) methods and suggested generators including CA-GAN Yu et al. (2018) for filling in the masked area. However, the CA-GAN generation was designed for calculating the smallest sufficient region and smallest destroying region Dabkowski and Gal (2017) that only produced 1-vs-all explanations. FIDO is computationally expensive as an optimization task is required for each attribution map. The fill-in method requires an unmasked area for reference, hence only works for a small subset of attribution methods. More importantly, the FIDO is highly dependent on the generator’s capability of recreating the image based on partially masked features. With pre-trained generators like CA-GAN, we argue that the resulting saliency map is more associated with the pre-trained generator instead of the classifier itself.

3 GAN-based Model Explainability

It has been previously established that attribution methods are more sensitive to features where the input values are the same as the baseline values and less sensitive to those where the input values and the baseline values are different (Adebayo et al. (2018); Sundararajan and Taly (2018)). Therefore, we expect a well-chosen baseline to differ from the input only on the key features. Good candidates for achieving this would be the sample in the target class but with minimum distance to the input.

Formally, for a one-vs-one attribution problem , We define the class-targeted baseline to be the closest point in the input space (not limited to the train set) that belongs to the target class


Here, is the set of realistic examples in the target class, and is the Euclidean distance. By using this baseline we have providing the explanations as to why input belongs to its original class and not class . Now, since it isn’t realistic to optimize within the actual set we work with a softer version of Equation 1: . where represent the probability of belonging to the target class, meaning . Given a classifier

, we have the estimate

to the probability of a realistic image to be in class . In order to make use of this we decompose where indicates the probability of being a realistic image. We end up with the following objective for the baseline instance.

Figure 1:

Intuition of using GANs for generating class-targeted baselines in SVHN dataset. Without GANs, a closest target class sample can easily be unrealistic (a), while the GAN helps confine the sample in the realistic sample space (b). The MDTS baseline and other training samples used in expected gradient can be very different from the input (c). (d) shows the zero baseline that is the most commonly used.

3.1 Applying StarGAN to the Class-Targeted Baseline

Here we introduce GAN-based Model EXplainability (GANMEX) that uses GAN to generate the class-targeted baselines. Given an input and a target class , GANMEX aims to generate a class-targeted baseline that achieve the three following objectives:

  1. The baseline belongs to the target class (with respect to the classifier).

  2. The baseline is a realistic sample.

  3. The baseline is close to the input.

To further explain the need for all 3 objectives, we point the reader to Figure 1. The GANMEX baseline represents the ”closest and realistic target class baseline”. Without the assistance of GANs, the selected baseline can easily either fall into the domain of unrealistic image. A naive fix will choose a realistic image from the training set, but that will not be close to the input. Finally, for true 1-vs-1 explainability we need the baseline to belong to a specific target class.

We chose StarGAN (Choi et al. (2017)) as the method for computing the above or rather function. Although many Image-to-Image translation methods could be applied to do so, StarGAN inherently works with multi-class problems, and allows for a natural way of using the already trained classifier as a discriminator, rather than having us train a different discriminator.

StarGAN provides a scalable image-to-image translation approach by introducing (1) a single generator accepting an instance and a class , and producing a realistic example in the target class , (2) two separate discriminators: for distinguishing between real and fake images, and for distinguishing whether belongs to class . It introduced following loss functions


Here, defines the average over the variables in the subscript, where is an example in the training set, and are classes. is the standard adversarial loss function between the generator and the discriminators, and are domain classification loss functions for real images and fake images, respectively, and is the reconstruction loss commonly used for unpaired image-to-image translation to make sure two opposite generation action will lead to the original input. The combined loss functions for the generator and the discriminator are

Figure 2:

Saliency maps for multi-class datasets (MNIST and SVHN) generated with various baselines, including zero baseline (Zero), MDTS and GANMEX.

The optimization procedure for StarGAN alternates between modifying the discriminators , to minimize , and the generator to minimize .

Equation 8 is almost analogical to equation 2. The term corresponds to and the term corresponds to the term . There is a mismatch between the term and . One forces the generator to be invertible, while the other forces the generated image to be close to the original. We found that the term is useful to encourage the convergence of the GAN. However, a similarity term is also needed in order for the baseline image to be close to the origin - this allows for better explainability. We show in what follows (Figure 5.B, Appendix B) that without this similarty term, the created image can indeed be farther away from the origin. Other than the added similarity term, for GANMEX we replace the discriminator with the classifier , since as mentioned above, this way the generator provides a baseline adapted to our classifier. Concluding, we optimize the following term for the generator


where is short for . Notice that we used L1 distance rather than L2 for the similarity loss, because L2 distance leads to blurring outputs for image-to-image translation algorithms (Isola et al. (2017)). Other image-to-image translation approaches can potentially select baselines satisfying the criteria (2) and (3) above, but they lack the replaceable class discriminator component, that is crucial for explaining the already trained classifier. We provide several ablation studies in Appendix B where we show that without incorporating the to-be-explained classifier to the adversarial networks, the GAN generated baselines will fail the randomization sanity checks. We provide more implementation details including hyper-parameters in Appendix A.2.

4 Experiments

In what follows we experiment with the datasets MNIST (LeCun and Cortes (2010)) , Street-View House Numbers (SVHN) (Netzer et al. (2011)) and apple2orange (Zhu et al. (2017)). Further details about the datasets and classifiers are given in Appendix A.1.

Our techniques are designed to improve any global attribution method by providing an improved baseline. In our experiments we consider four attribution methods - IG, DeepLIFT, Occlusion, and DeepSHAP. The baselines we consider include the zero baseline (the default baseline in all four methods), minimum distance training sample (MDTS), and the GANMEX baseline. We also compared our results with a modified version of expected gradient aimed to provide 1-vs-1 explanations, which runs IG over a randomly chosen target class image from the training set, as opposed to a random image from the training set.

4.1 One-vs-one attribution for Multi-class Classifiers

We tested the one-vs-one attribution on two multi-class datasets - MNIST and SVHN. As shown in Figure 2, the GANMEX baseline successfully identified the closest transformed image in the target class as the baseline. Take explaining why 0 and not 6 for example, the ideal baseline would keep the ”C”-shape part unchanged, and only erase the top-right corner and complete the lower circle, which was achieved by GANMEX. Limited by the training space, MDTS baselines were generally more different from the input image. Therefore, the explanation made with respect to GANMEX baselines were more focused on the key features compared to that of the MDTS baseline and expected gradient. We observed the same trends across more numbers, where GANMEX helps IG, DeepLIFT, Occlusion and DeepSHAP disregard the common strokes between the original and targeted digits, and focusing only on the key differences. The out-performance of GANMEX was even more obvious in the SVHN datasets, where the numbers can have any font, color, and background. Notice that both the zero baseline and training set baseline cause the explanation to have a lot of focus on the background. In contrast, the GANMEX example focuses only on the digit and furthermore, only on the key features that would cause the digit to change.

Zero baselines, on the other hand, were generally unsuccessful in making one-vs-one explanations. The attributions on MNIST look similar to the original input and ignores everything in the background, and the attributions on SVHN were rather noisy. As shown in Figure 6, attributions based on zero baselines only changed marginally with different target classes. This shows that purposely designed class-targeted baselines are required for meaningful one-vs-one explanation.

Perturbation-based evaluation

We followed the perturbation-based evaluation suggested by Bach et al. (2015) that flips input features starting from the ones with the highest saliency values and evaluates the cumulative impacts on the score delta as proposed by Shrikumar et al. (2017). Flipping a feature means to provide with a value of where is its original value, assuming all features are normalized to . A wanted behavior from the attribution map is that the score delta will decrease as rapidly as possible as we flip the features one by one. We provide in Figure 3.A-B the perturbation curves for both MNIST and SVHN, plotting the score delta as a function of the number of flipped features. It is painfully clear that that by using a GANMEX baseline rather than the alternative zero baseline, the descent of the curve is much faster, meaning that we successfully capture the most important features using GANMEX. This holds true for all attribution methods. As a side note, notice that in SVHN when we flip all features the score delta goes back to where it was in the beginning as opposed to going down to zero. This is due to the fact that once all features are flipped, we are back to having the same digit as before.

Figure 3: (A-B) Perturbation-based evaluation plots for MNIST and SVHN, respectively. The dashed lines represent the non-class-targeted baselines and the solid lines represent class-targeted baselines. (C-D) Gini indices, with the yellow bars represent saliency maps with zero baselines and the green bars represent that of GANMEX baselines. (E) Sanity checks showing the original saliency maps (Orig) and saliency maps under cascading randomization over the four layers: output layer (Output), fully connected (FC), and two CNN layers (CNN1, CNN2).


We calculated the Gini Index representing the sparseness of the saliency maps as proposed by Chalasani et al. (2018), where a larger score means sparser saliency map, which is a desired property. Figure 3.C-D shows our experiments comparing the different techniques with a zero baseline vs. GANMEX. Other than IG/SVHN where GANMEX has a visible advantage, the results are roughly the same; we suspect that the sparseness of zero baseline attribution was benefited from incorrectly hiding key features, as shown in Figure 2 and 3.A-B. Expected gradient, on the other hand, consistently under-performs its counterpart (IG+GANMEX) in both datasets.

Cascading randomization

We performed the sanity checks proposed by Adebayo et al. (2018)

that performs cascading randomization from top to bottom layers of the DNN and observe the changes in the saliency maps. Specifically, layer by layer we replace the model weights with Gaussians random variables scaled to have the same norm. For meaningful model explanations, we would expect the attributions to be gradually randomized during the cascading randomization process. In contrast, unperturbed saliency maps during the model randomization would suggest that the attributions were based on general features of the input and not specifically based on the trained model.

Figure 3.E shows the experiment on MNIST data with the network layers named (input to output) CNN1, CNN2, FC, Output. It shows that even though the saliency maps generated by the original IG, DeepLIFT and Occlusion were rather unperturbed (still showing the shape of the digit) after the model randomization, with the help of GANMEX, both the baselines and the saliency maps were perturbed over the cascading randomization. Expected gradient, while showing more randomization compared to the zero baseline saliency maps, still roughly shows the shape of the digit throughout the sanity check.

4.2 Attribution for Binary Classifiers

In addition to the one-vs-one aspect, GANMEX generally improves the quality of the saliency maps compared with the existing baselines, and this can be tested on binary datasets where the one-vs-one explanations and the one-vs-all explanations are equivalent. For apple2orange dataset, conceptually apples and oranges both have round shapes but have different colors, so we would expect the saliency maps on a reasonably performing classifier to highlight the colors of the fruit, but not the shapes, and definitely not the background.

In Figure 4 and Figure 7 we compared the saliency map generated by DeepLIFT and IG with the zero input, max input, blurred image, with those generated with the GANMEX baselines. In all non-GANMEX baselines (Zero, Max, Blur) we commonly observe one of two errors in the saliency map. The first error consists of highlighting the background. The second highlights only the edge of the apple(s) providing the false indication that the model is basing its decisions on the shape of the object rather than its color. It is quite clear that neither of these errors occur when using the GANMEX baselines as the background is never present and the full shape of the apple(s)/orange(s) is highlighted.

Figure 4: Saliency maps for the classifier on the apple2orange dataset with four baseline choices: zero baseline (Zero), maximum value baseline (Max), blurred baseline (Blur), and GANMEX baseline (GANMEX).

5 Conclusion and future work

We have proposed GAN-based model explainability, a novel approach for generating one-vs-one explanation baselines without being constrained by the training set. We used the GANMEX baselines in conjunction with IG, DeepLIFT, SHAP, and Occlusion, and to our surprise, the baseline replacement was all it takes to address the common downside of the existing attribution methods (blind to certain input values and fail to randomize with the model randomization) and significantly improve the one-vs-one explainability. The out-performance was demonstrated through perturbation-based evaluation, sparseness measures, and cascading randomization sanity checks. The one-vs-one explanation achieved by GANMEX opens up possibilities for obtaining more insights about how DNNs differentiate similar classes.

While GANMEX showed promising results on explaining binary classifiers, where one-vs-all and one-vs-one explanations are directly comparable, open questions remain on how to apply GANMEX to one-vs-all explainability for multi-class classifiers, and how to best optimize the GAN component to effective generate baselines for classification tasks with large number of classes.


  • J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim (2018) Sanity checks for saliency maps. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 9505–9515. External Links: Link Cited by: §1, §2.2, §3, §4.1.
  • M. Ancona, E. Ceolini, C. Öztireli, and M. Gross (2018) Towards better understanding of gradient-based attribution methods for deep neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §2.1, §2.1.
  • S. Bach, A. Binder, G. Montavon, F. Klauschen, K. Müller, and W. Samek (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE 10 (7), pp. 1–46. External Links: Link, Document Cited by: §2.2, §2.3, §4.1.
  • P. Chalasani, J. Chen, A. R. Chowdhury, S. Jha, and X. Wu (2018) Concise explanations of neural networks using adversarial training. CoRR abs/1810.06583. External Links: Link, 1810.06583 Cited by: §4.1.
  • C. Chang, E. Creager, A. Goldenberg, and D. Duvenaud (2018) Explaining image classifiers by adaptive dropout and generative in-filling. CoRR abs/1807.08024. External Links: Link, 1807.08024 Cited by: §2.4.
  • H. Chen, S. Lundberg, and S. Lee (2019) Explaining models by propagating shapley values of local components. External Links: 1911.11888 Cited by: §2.1.
  • Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2017) StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. CoRR abs/1711.09020. External Links: Link, 1711.09020 Cited by: §A.2, §2.4, §3.1.
  • Y. Choi, Y. Uh, J. Yoo, and J. Ha (2020) StarGAN v2: diverse image synthesis for multiple domains. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §2.4.
  • P. Dabkowski and Y. Gal (2017) Real time image saliency for black box classifiers. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. 6967–6976. External Links: Link Cited by: §2.4.
  • G. G. Erion, J. D. Janizek, P. Sturmfels, S. Lundberg, and S. Lee (2019) Learning explainable models using attribution priors. CoRR abs/1906.10670. External Links: Link, 1906.10670 Cited by: §A.3, §2.1, §2.2.
  • R. Fong and A. Vedaldi (2017) Interpretable explanations of black boxes by meaningful perturbation. CoRR abs/1704.03296. External Links: Link, 1704.03296 Cited by: §2.2.
  • B. Goodman and S. Flaxman (2017) European union regulations on algorithmic decision-making and a “right to explanation”. AI magazine 38 (3), pp. 50–57. Cited by: §1.
  • S. Hooker, D. Erhan, P. Kindermans, and B. Kim (2019) A benchmark for interpretability methods in deep neural networks. External Links: 1806.10758 Cited by: §2.2.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    MobileNets: efficient convolutional neural networks for mobile vision applications

    CoRR abs/1704.04861. External Links: Link, 1704.04861 Cited by: §A.1.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 5967–5976. Cited by: §2.4, §3.1.
  • P. Kindermans, S. Hooker, J. Adebayo, M. Alber, K. T. Schütt, S. Dähne, D. Erhan, and B. Kim (2017) The (un) reliability of saliency methods. arXiv preprint arXiv:1711.00867. Cited by: §2.2.
  • Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: External Links: Link Cited by: §A.1, §4.
  • M. Liu, X. Huang, A. Mallya, T. Karras, T. Aila, J. Lehtinen, and J. Kautz (2019) Few-shot unsupervised image-to-image translation. CoRR abs/1905.01723. External Links: Link, 1905.01723 Cited by: §2.4.
  • S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. pp. 4765–4774. External Links: Link Cited by: §1, §2.1.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §A.1, §4.
  • A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. CoRR abs/1704.02685. External Links: Link, 1704.02685 Cited by: §1, §2.1, §2.1, §2.3, §4.1.
  • A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje (2016) Not just a black box: learning important features through propagating activation differences. CoRR abs/1605.01713. External Links: Link, 1605.01713 Cited by: §2.1.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. CoRR abs/1312.6034. External Links: Link, 1312.6034 Cited by: §2.1.
  • D. Smilkov, N. Thorat, B. Kim, F. B. Viégas, and M. Wattenberg (2017) SmoothGrad: removing noise by adding noise. CoRR abs/1706.03825. External Links: Link, 1706.03825 Cited by: §2.2.
  • J.T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2015) Striving for simplicity: the all convolutional net. In ICLR (workshop track), External Links: Link Cited by: §A.1.
  • P. Sturmfels, S. Lundberg, and S. Lee (2020) Visualizing the impact of feature attribution baselines. Distill. Note: External Links: Document Cited by: §1, §2.2, §2.2.
  • M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. CoRR abs/1703.01365. External Links: Link, 1703.01365 Cited by: §1, §2.1.
  • M. Sundararajan and A. Taly (2018) A note about: local explanation methods for deep neural networks lack sensitivity to parameter values. CoRR abs/1806.04205. External Links: Link, 1806.04205 Cited by: §2.2, §3.
  • J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018)

    Generative image inpainting with contextual attention

    External Links: 1801.07892 Cited by: §2.4.
  • M. D. Zeiler and R. Fergus (2013) Visualizing and understanding convolutional networks. CoRR abs/1311.2901. External Links: Link, 1311.2901 Cited by: §1, §2.1.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR abs/1703.10593. External Links: Link, 1703.10593 Cited by: §A.1, §2.4, §4.

Appendix A Implementation Details

a.1 Datasets and Classifiers

MNIST (LeCun and Cortes (2010)

) The classifier consists of two 6x6 CNN layers with a stride of 2, followed by a 256-unit fully connected layer, a dropout layer with

, and the 10 output neurons. As shown in Springenberg et al. (2015) the stride

1 CNN achieved comparable performance with pooling layers. The classifier was trained for 50 epochs and achieve a test accuracy of 99.3%.

Street-View House Numbers (SVHN) (Netzer et al. (2011)) We tested our models on the cropped version of SVHN and used the same model architecture with that of MNIST and achieved a test accuracy of 90.3% after 50 epochs of training.

apple2orange (Zhu et al. (2017)) We train a classifier taking the original 256x256 image as input. The classifier was constructed by adding a global average pooling layer on top of MobileNet (Howard et al. (2017)), and then followed by a dense layer of 1024 neurons and a dropout layer of before the output neurons. The classifier was trained for 50 epochs and achieve a test accuracy of 87.7%.

a.2 Baseline Generation with GANMEX

Our baseline generation process is based on StarGAN (Choi et al. (2017)

). We used the Tensorflow-GAN implementation ( and made the following two modifications (Equation 9):

  1. The class discriminator is replace by the target classifier to be explained.

  2. A similarity loss is added to the training objective function.

We train the GANMEX model for 100k steps for the MNIST and apple2orange datasets, and 300k steps for the SVHN dataset. Only the train split is used for training, and the attribution results and evaluation were done on the test split of the dataset.

a.3 Attribution Methods

We used DeepExplain ( for generating saliency maps with IG, DeepLIFT, and Occlusion. We modified the code base to use the score delta () instead of the original class score () and allowing replacing the zero baseline (see Section 2.1) by custom baselines from GANMEX and MDTS. Expected gradient was separately implemented according to the formulation in Erion et al. (2019). We set the number of sampling steps to 200 for both IG and Expected Gradient, and used Occlusion-1 that only perturb the pixel itself (as supposed to perturbing the whole neighboring patch of pixels).

The DeepSHAP saliency maps were calculated using SHAP ( We made similar modification to replace the original class score by the score delta and feed in the custom baseline instances.

In all saliency maps shown in the paper, blue color in indicates positive values and red color indicates negative values. We skipped Occlusion for large images (apple2orange) and also skipped SHAP for full dataset evaluations due to the computation resource constraints.

Appendix B Ablation Studies

Here we analyzed the possibilities of using other GAN models. Different from StarGAN, most other image-to-image translation algorithms do not have a stand-alone class discriminator that can be swapped with a trained classifier. To simulate such restrictions, we trained a similar GAN model but with the class discriminator trained jointly with the generator from scratch. Figure 5.A shows that while the stand-alone GAN yields similar baseline with GANMEX, both the baselines and saliency maps of the stand-alone GAN remains unperturbed under cascading randomization of the model. This indicates that the class-wise explanations provided by stand-alone GAN were not specific to the to-be-explained classifier.

The importance of the similarity loss in Equation 9 can be demonstrated on a colored-MNIST dataset, where we randomly assigned the digits with one of the three colors , with labels of the instances remain unchanged from the original MNIST labels of . The classifier was trained with the same model architecture and training process as for MNIST.

The dataset demonstrated different modes (colors in this case) that are irrelevant to the labels, and we would expect the class-targeted baseline for would be another instance that has the same color as . Figure 5.B shows that the similarity loss is the crucial component for ensuring that the baseline has the same color with the input. Without the similarity loss, the generated baseline instance can easily have a different color with the original image. The reconstruction loss itself does not provide the same-mode constraint because a mapping of and does not get penalized by the reconstruction loss. While the reconstruction loss was not required for GANMEX and the same-mode constraint, we observed that some degrees of reconstruction loss help GANs converge faster.

Figure 5: (A) Cascading randomization on baselines generated by a stand-alone GAN lead to little randomization on the saliency maps. (B) Colored-MNIST dataset. GAN baselines generated with both similarity loss and reconstruction loss (S+R), similarity loss only (S), reconstruction loss only (R), and none of those (NA). Only S+R and S successfully constrained the baselines in the same modes (colors) with the inputs.

Appendix C Additional Figures

Figure 6: One-vs-one saliency maps using class-targeted baselines (GANMEX) vs non-class-targeted baselines (zero baselines). One-vs-one saliency maps generated using zero baselines show almost the same attributions regardless of the target class. GANMEX baselines corrected the behavior for both IG, DeepLIFT and DeepSHAP, and produced different attributions depending on the target classes.
Figure 7: Additional examples of saliency maps for the classifier on the apple2orange dataset with four baseline choices: zero baseline (Zero), maximum value baseline (Max), blurred baseline (Blur), and GANMEX baseline (GANMEX).