Black Box Explanation by Learning Image Exemplars in the Latent Feature Space

We present an approach to explain the decisions of black box models for image classification. While using the black box to label images, our explanation method exploits the latent feature space learned through an adversarial autoencoder. The proposed method first generates exemplar images in the latent feature space and learns a decision tree classifier. Then, it selects and decodes exemplars respecting local decision rules. Finally, it visualizes them in a manner that shows to the user how the exemplars can be modified to either stay within their class, or to become counter-factuals by "morphing" into another class. Since we focus on black box decision systems for image classification, the explanation obtained from the exemplars also provides a saliency map highlighting the areas of the image that contribute to its classification, and areas of the image that push it into another class. We present the results of an experimental evaluation on three datasets and two black box models. Besides providing the most useful and interpretable explanations, we show that the proposed method outperforms existing explainers in terms of fidelity, relevance, coherence, and stability.



There are no comments yet.



Explaining Black-box Models for Biomedical Text Classification

In this paper, we propose a novel method named Biomedical Confident Item...

Interpreting Undesirable Pixels for Image Classification on Black-Box Models

In an effort to interpret black-box models, researches for developing ex...

Reinforcement Explanation Learning

Deep Learning has become overly complicated and has enjoyed stellar succ...

Coloring black boxes: visualization of neural network decisions

Neural networks are commonly regarded as black boxes performing incompre...

Decision Explanation and Feature Importance for Invertible Networks

Deep neural networks are vulnerable to adversarial attacks and hard to i...

McXai: Local model-agnostic explanation as two games

To this day, a variety of approaches for providing local interpretabilit...

Information-Theoretic Visual Explanation for Black-Box Classifiers

In this work, we attempt to explain the prediction of any black-box clas...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automated decision systems based on machine learning techniques are widely used for classification, recognition and prediction tasks. These systems try to capture the relationships between the input instances and the target to be predicted. Input attributes can be of any type, as long as it is possible to find a convenient representation for them. For instance, we can represent images by matrices of pixels, or by a set of features that correspond to specific areas or patterns of the image. Many automated decision systems are based on very accurate classifiers such as deep neural networks. They are recognized to be “black box” models because of their opaque, hidden internal structure, whose complexity makes their comprehension for humans very difficult 

[5]. Thus, there is an increasing interest in the scientific community in deriving explanations able to describe the behavior of a black box [5, 22, 12, 6], or explainable by design approaches [19, 18]. Moreover, the General Data Protection Regulation111 has been approved in May 2018 by the European Parliament. This law gives to individuals the right to request “…meaningful information of the logic involved” when automated decision-making takes place with “legal or similarly relevant effects” on individuals. Without a technology able to explain, in a manner easily understandable to a human, how a black box takes its decision, this right will remain only an utopia, or it will result in prohibiting the use of opaque, but highly effective machine learning methods in socially sensitive domains.

In this paper, we investigate the problem of black box explanation for image classification (Section 3). Explaining the reasons for a certain decision can be particularly important. For example, when dealing with medical images for diagnosing, how we can validate that a very accurate image classifier built to recognize cancer actually focuses on the malign areas and not on the background for taking the decisions?

In the literature (Section 2), the problem is addressed by producing explanations through different approaches. On the one hand, gradient and perturbation-based attribution methods [27, 25] reveal saliency maps highlighting the parts of the image that most contribute to its classification. However, these methods are model specific and can be employed only to explain specific deep neural networks. On the other hand, model agnostic approaches can explain, yet through a saliency map, the outcome of any black box [24, 11]. Agnostic methods may generate a local neighborhood of the instance to explain and mime the behavior of the black box using an interpretable classifier. However, these methods exhibit drawbacks that may negatively impact the reliability of the explanations. First, they do not take into account existing relationships between features (or pixels) during the neighborhood generation. Second, the neighborhood generation does not produce “meaningful” images since, e.g., some areas of the image to explain in [24] are obscured, while in [11] they are replaced with pixels of other images. Finally, transparent-by-design approaches produce prototypes from which it should be clear to the user why a certain decision is taken by the model [18, 19]. Nevertheless, these approaches cannot be used to explain a trained black box, but the transparent model has to be directly adopted as a classifier, possibly with limitations on the accuracy achieved.

We propose abele, an Adversarial Black box Explainer generating Latent Exemplars (Section 5). abele is a local, model-agnostic explanation method able to overcome the existing limitations of the local approaches by exploiting the latent feature space, learned through an adversarial autoencoder [20] (Section 4), for the neighborhood generation process. Given an image classified by a given black box model, abele provides an explanation for the reasons of the proposed classification. The explanation consists of two parts: (i) a set of exemplars and counter-exemplars images illustrating, respectively, instances classified with the same label and with a different label than the instance to explain, which may be visually analyzed to understand the reasons for the classification, and (ii) a saliency map highlighting the areas of the image to explain that contribute to its classification, and areas of the image that push it towards another label.

We present a deep experimentation (Section 6) on three datasets of images, i.e., mnist, fashion and cifar10, and two black box models. We empirically prove that abele overtakes state of the art methods based on saliency maps or on prototype selection by providing relevant, coherent, stable and faithful explanations. Finally, we summarize our contribution, its limitations, and future research directions (Section 7).

2 Related Work

Research on black box explanation methods has recently received much attention [5, 22, 12, 6]. These methods can be characterized as model-specific vs model-agnostic, and local vs global. The proposed explanation method abele is the next step in the line of research on local, model-agnostic methods originated with [24] and extended in different directions by [9] and by [13, 11, 23].

In image classification, typical explanations are the saliency maps, i.e., images that show each pixel’s positive (or negative) contribution to the black box outcome. Saliency maps are efficiently built by gradient [27, 25, 30, 1] and perturbation-based [33, 7]

attribution methods by finding, through backpropagation and differences on the neuron activation, the pixels of the image that maximize an approximation of a linear model of the black box classification outcome. Unfortunately, these approaches are specifically designed for deep neural networks. They cannot be employed for explaining other image classifiers, like tree ensembles or hybrid image classification processes 

[12]. Model-agnostic explainers, such as lime [24] and similar [11] can be employed to explain the classification of any image classifier. They are based on the generation of a local neighborhood around the image to explain, and on the training of an interpretable classifier on this neighborhood. Unlike the global distillation methods [17], they do not consider (often non-linear) relationships between features (e.g. pixel proximity), and thus, their neighborhoods do not contain “meaningful” images.

Our proposed method abele overcomes the limitations of both saliency-based and local model-agnostic explainers by using AAEs, local distillation, and exemplars. As abele includes and extends lore [13], an innovation w.r.t. state of the art explainers for image classifiers is the usage of counter-factuals. Counter-factuals are generated from “positive” instances by a minimal perturbation that pushes them to be classified with a different label [31]. In line with this approach, abele generates counter-factual rules in the latent feature space and exploits them to derive counter-exemplars in the original feature space.

As the explanations returned by abele are based on exemplars, we need to clarify the relationship between exemplars and prototypes. Both are used as a foundation of representation of a category, or a concept [8]. In the prototype view, a concept is the representation of a specific instance of this concept. In the exemplar view, the concept is represented by means of a set of typical examples, or exemplars. abele uses exemplars to represent a concept. In recent works [19, 4], image prototypes are used as the foundation of the concept for interpretability [2]. In [19], an explainable by design method, similarly to abele, generates prototypes in the latent feature space learned with an autoencoder. However, it is not aimed at explaining a trained black box model. In [4]

a convolutional neural network is adopted to provide features from which the prototypes are selected.

abele differs from these approaches because is model agnostic and the adversarial component ensures the similarity of feature and class distributions.

3 Problem Formulation

In this paper we address the black box outcome explanation problem [12]. Given a black box model and an instance classified by , i.e., , our aim is to provide an explanation for the decision . More formally:

Definition 1

Let be a black box, and an instance whose decision has to be explained. The black box outcome explanation problem consists in finding an explanation belonging to a human-interpretable domain .

We focus on the black box outcome explanation problem for image classification, where the instance is an image mapped by to a class label . In the following, we use the notation as a shorthand for . We denote by a black box image classifier, whose internals are either unknown to the observer or they are known but uninterpretable by humans. Examples are neural networks and ensemble classifiers.We assume that a black box is a function that can be queried at will.

We tackle the above problem by deriving an explanation from the understanding of the behavior of the black box in the local neighborhood of the instance to explain [12]. To overcome the state of the art limitations, we exploit adversarial autoencoders [20] for generating, encoding and decoding the local neighborhood.

Figure 1: Left: Adversarial Autoencoder architecture: the turns the image into its latent representation , the re-builds an approximation of from , and the identifies if a randomly generated latent instance can be considered valid or not. Right: Discriminator and Decoder () module: input is a randomly generated latent instance and, if it is considered valid by the , it returns it together with its decompressed version .

4 Adversarial Autoencoders

An important issue arising in the use of synthetic instances generated when developing black box explanations is the question of maintaining the identity of the distribution of the examples that are generated with the prior distribution of the original examples. We approach this issue by using an Adversarial Autoencoder (AAE) [20], which combines a Generative Adversarial Network (GAN) [10] with the autoencoder representation learning algorithm. Another reason for the use of AAE is that, as demonstrated in [29], the use of autoencoders enhances the robustness of deep neural network classifiers more against malicious examples.

AAEs are probabilistic autoencoders that aim at generating new random items that are highly similar to the training data. They are regularized by matching the aggregated posterior distribution of the latent representation of the input data to an arbitrary prior distribution. The AAE architecture (Fig. 1-left) includes an , a and a where is the number of pixels in an image and is the number of latent features. Let be an instance of the training data, we name the corresponding latent data representation obtained by the . We can describe the AAE with the following distributions [20]: the prior distribution to be imposed on , the data distribution , the model distribution , and the encoding and decoding distributions and , respectively. The encoding function defines an aggregated posterior distribution of on the latent feature space: . The AAE guarantees that the aggregated posterior distribution matches the prior distribution , through the latent instances and by minimizing the reconstruction error. The AAE generator corresponds to the encoder and ensures that the aggregated posterior distribution can confuse the in deciding if the latent instance comes from the true distribution .

The AAE learning involves two phases: the reconstruction aimed at training the and to minimize the reconstruction loss; the regularization aimed at training the using training data and encoded values. After the learning, the decoder defines a generative model mapping to .

Figure 2: Latent Local Rules Extractor. It takes as input the image to explain and the black box . With the trained by the AAE, it turns into its latent representation . Then, the module uses and to generate the latent local neighborhood . The valid instances are decoded in by the module. Images in are labeled with the black box . and are used to learn a decision tree classifier. At last, a decision rule and the counter-factual rules for are returned.

5 Adversarial Black Box Explainer

abele (Adversarial Black box Explainer generating Latent Exemplars) is a local model agnostic explainer for image classifiers solving the outcome explanation problem. Given an image to explain and a black box , the explanation provided by abele is composed of (i) a set of exemplars and counter-exemplars, (ii) a saliency map. Exemplars and counter-exemplars shows instances classified with the same and with a different outcome than . They can be visually analyzed to understand the reasons for the decision. The saliency map highlights the areas of that contribute to its classification and areas that push it into another class.

The explanation process involves the following steps. First, abele generates a neighborhood in the latent feature space exploiting the AAE (Sec. 4). Then, it learns a decision tree on that latent neighborhood providing local decision and counter-factual rules. Finally, abele selects and decodes exemplars and counter-exemplars satisfying these rules and extracts from them a saliency map.

5.1 Encoding

The image to be explained is passed as input to the AAE where the returns the latent representation using latent features with . The number is kept low by construction avoiding high dimensionality problems.

5.2 Neighborhood Generation

abele generates a set of instances in the latent feature space, with characteristics close to those of . Since the goal is to learn a predictor on able to simulate the local behavior of , the neighborhood includes instances with both decisions, i.e., where instances are such that , and are such that . We name the decoded version of an instance in the latent feature space. The neighborhood generation of ( module in Fig. 2) may be accomplished using different strategies ranging from pure random strategy using a given distribution to a genetic approach maximizing a fitness function [13]. In our experiments we adopt the last strategy. After the generation process, for any instance , abele exploits the module (Fig. 1-right) for both checking the validity of by querying the 222In the experiments we use for the the default validity threshold to distinguish between real and fake exemplars. This value can be increased to admit only more reliable exemplars, or decreased to speed-up the generation process. and decoding it into . Then, abele queries the black box with to get the class , i.e., .

5.3 Local Classifier Rule Extraction

Given the local neighborhood , abele builds a decision tree classifier trained on the instances labeled with the black box decision . Such a predictor is intended to locally mimic the behavior of in the neighborhood . The decision tree extracts the decision rule and counter-factual rules enabling the generation of exemplars and counter-exemplars. abele considers decision tree classifiers because: (i) decision rules can naturally be derived from a root-leaf path in a decision tree; and, (ii) counter-factual rules can be extracted by symbolic reasoning over a decision tree. The premise of a decision rule is the conjunction of the splitting conditions in the nodes of the path from the root to the leaf that is satisfied by the latent representation of the instance to explain , and setting . For the counter-factual rules , abele selects the closest rules in terms of splitting conditions leading to a label different from , i.e., the rules such that is the conjunction of splitting conditions for a path from the root to the leaf labeling an instance with and minimizing the number of splitting conditions falsified w.r.t. the premise of the rule . Fig. 2 shows the process that, starting from the image to be explained, leads to the decision tree learning, and to the extraction of the decision and counter-factual rules. We name this module , as a variant of lore [13] operating in the latent feature space.

Figure 3: Left: (Counter-)Exemplar Generator: it takes a decision rule and a randomly generated latent instance , checks if satisfies and applies the module (Fig.1-right) to decode it. Right: abele architecture. It takes as input the image for which we require an explanation and the black box . It extracts the decision rule and the counter-factual rules with the module. Then, it generates a set of latent instances which are used as input with and for the module (Fig. 3-left) to generate exemplars and counter-exemplars . Finally, and are used by the module for calculating the saliency maps and returning the final explanation .

5.4 Explanation Extraction

Often, e.g. in medical or managerial decision making, people explain their decisions by pointing to exemplars with the same (or different) decision outcome [8, 4]. We follow this approach and we model the explanation of an image returned by abele as a triple composed by exemplars , counter-exemplars and a saliency map . Exemplars and counter-exemplars are images representing instances similar to , leading to an outcome equal to or different from . Exemplars and counter-exemplars are generated by abele exploiting the module (Fig. 3-left). It first generates a set of latent instances satisfying the decision rule (or a set of counter-factual rules ), as shown in Fig. 2. Then, it validates and decodes them into exemplars (or counter-exemplars ) using the module. The saliency map highlights areas of that contribute to its outcome and areas that push it into another class. The map is obtained by the saliency extractor module (Fig. 3-right) that first computes the pixel-to-pixel-difference between and each exemplar in the set , and then, it assigns to each pixel of the saliency map the median value of all differences calculated for that pixel. Thus, formally for each pixel of the saliency map we have:

In summary, abele (Fig. 3-right), takes as input the instance to explain and a black box , and returns an explanation according to the following steps. First, it adopts [13] to extract the decision rule and the counterfactual rules . These rules, together with a set of latent random instances are the input of the module returning exemplars and counter-exemplars. Lastly, the module extracts the saliency map starting from the image and its exemplars.

6 Experiments

dataset resolution rgb train test RF DNN mnist fashion cifar10
Table 1: Datasets resolution, type of color, train and test dimensions, and black box model accuracy.
dataset train test mnist fashion cifar10
Table 2: AAEs reconstruction error in terms of RMSE.

We experimented with the proposed approach on three open source datasets

333Dataset:,, (details in Table 2): the mnist dataset of handwritten digit grayscale images, the fashion mnist dataset is a collection of Zalando’s article grayscale images (e.g. shirt, shoes, bag, etc.), and the cifar10 dataset of colored images of airplanes, cars, birds, cats, etc. Each dataset has ten different labels.

We trained and explained away the following black box classifiers. Random Forest 

[3] (RF) as implemented by the scikit-learn Python library, and Deep Neural Networks (DNN) implemented with the keras library444Black box:, . For mnist and fashion we used a three-layer CNN, while for cifar10 we used the ResNet20 v1 network described in [16]. Classification performance are reported in Table 2.

For mnist and fashion we trained AAEs with sequential three-layer encoder, decoder and discriminator. For cifar10 we adopted a four-layer CNN for the encoder and the decoder, and a sequential discriminator. We used 80% of the test sets for training the adversarial autoencoders555

The encoding distribution of AAE is defined as a Gaussian distribution whose mean and variance is predicted by the encoder itself

[20]. We adopted the following number of latent features for the various datasets: mnist , fashion , cifar10 .. In Table 2 we report the reconstruction error of the AAE in terms of Root Mean Square Error (RMSE) between the original and reconstructed images. We employed the remaining 20% for evaluating the quality of the explanations.

We compare abele against lime and a set of saliency-based explainers collected in the DeepExplain package666Github code links:,, .: Saliency (sal[27], GradInput (grad[25], IntGrad (intg[30], -lrp (elrp[1], and Occlusion (occ[33]. We refer to the set of tested DeepExplain methods as dex. We also compare the exemplars and counter-exemplars generated by abele against the prototypes and criticisms777Criticisms are images not well-explained by prototypes with a regularized kernel function [18]. selected by the mmd and k-medoids [18]. mmd exploits the maximum mean discrepancy and a kernel function for selecting the best prototypes and criticisms.

Figure 4: Explain by saliency map mnist.
Figure 5: Exemplars & counter-exemplars.
Figure 6: Explain by saliency map fashion.
Figure 7: Exemplars & counter-exemplars.

6.1 Saliency Map, Exemplars and Counter-Exemplars

Before assessing quantitatively the effectiveness of the compared methods, we visually analyze their outcomes. We report explanations of the DNNs for the mnist and fashion datasets in Fig. 5 and Fig. 7 respectively888Best view in color. Black lines are not part of the explanation, they only highlight borders. We do not report explanations for cifar10 and for RF for the sake of space.. The first column contains the image to explain together with the label provided by the black box , while the second column contains the saliency maps provided by abele. Since they are derived from the difference between the image and its exemplars, we indicate with yellow color the areas that are common between and the exemplars , with red color the areas contained only in the exemplars and blue color the areas contained only in . This means that yellow areas must remain unchanged to obtain the same label , while red and blue areas can change without impacting the black box decision. In particular, with respect to , an image obtaining the same label can be darker in blue areas and lighter in red areas. In other words, blue and red areas express the boundaries that can be varied, and for which the class remains unchanged. For example, with this type of saliency map we can understand that a nine may have a more compact circle, a zero may be more inclined (Fig. 5), a coat may have no space between the sleeves and the body, and that a boot may have a higher neck (Fig. 7). Moreover, we can notice how, besides the background, there are some “essential” yellow areas within the main figure that can not be different from : e.g. the leg of the nine, the crossed lines of the four, the space between the two trousers.

The rest of the columns in Fig. 5 and 7 contain the explanations of the competitors: red areas contribute positively to the black box outcome, blue areas contribute negatively. For lime’s explanations, nearly all the content of the image is part of the saliency areas999

This effect is probably due to the figure segmentation performed by

lime.. In addition, the areas have either completely positive or completely negative contributions. These aspects can be not very convincing for a lime user. On the other hand, the dex methods return scattered red and blue points which can also be very close to each other and are not clustered into areas. It is not clear how a user could understand the black box outcome decision process from this kind of explanation.

Figure 8: Interpolation from the image to explain to one of its counter-exemplars .

Since the abele’s explanations also provide exemplars and counter-exemplars, they can also be visually analyzed by a user for understanding which are possible similar instances leading to the same outcome or to a different one. For each instance explained in Fig. 5 and 7, we show three exemplars and two counter-exemplars for the mnist and fashion datasets in Fig. 5 and 7, respectively. Observing these images we can notice how the label nine is assigned to images very close to a four (Fig. 5, row, column) but until the upper part of the circle remains connected, it is still classified as a nine. On the other hand, looking at counter-exemplars, if the upper part of the circle has a hole or the lower part is not thick enough, then the black box labels them as a four and a seven, respectively. We highlight similar phenomena for other instances: e.g. a boot with a neck not well defined is labeled as a sneaker (Fig. 7).

To gain further insights on the counter-exemplars, inspired by [28], we exploit the latent representations to visually understand how the black box labeling changes w.r.t. real images. In Fig. 8 we show, for some instances previously analyzed, how they can be changed to move from the original label to the counter-factual label. We realize this change in the class through the latent representations and of the image to explain and of the counter-exemplar , respectively. Given and , we generate through linear interpolation in the latent feature space intermediate latent representations respecting the latent decision or counter-factual rules. Finally, using the , we obtain the intermediate images . This convincing and useful explanation analysis is achieved thanks to abele’s ability to deal with both real and latent feature spaces, and to the application of latent rules to real images which are human understandable and also clear exemplar-based explanations.

Lastly, we observe that prototype selector methods, like mmd [18] and k-medoids  cannot be used for the same type of analysis because they lack any link with either the black box or the latent space. In fact, they propose as prototypes (or criticism) existing images of a given dataset. On the other hand, abele generates and does not select (counter-)exemplars respecting rules.

Figure 9: Box plots of fidelity. Numbers on top: mean values (the higher the better).

6.2 Interpretable Classifier Fidelity

We compare abele and lime in terms of fidelity [13, 5], i.e., the ability of the local interpretable classifier 101010A decision tree for abele and a linear lasso model for lime. of mimicking the behavior of a black box in the local neighborhood : . We report the fidelity as box plots in Fig. 9. The results show that on all datasets abele outperforms lime with respect to the RF black box classifier. For the DNN the interpretable classifier of lime is slightly more faithful. However, for both RF and DNN, abele has a fidelity variance markedly lower than lime

, i.e., more compact box plots also without any outlier

111111These results confirm the experiments reported in [13].. Since these fidelity results are statistically significant, we observe that the local interpretable classifier of abele is more faithful than the one of lime.

Figure 10: 1-NN exemplar classifier accuracy varying the number of (counter-)exemplars.

6.3 Nearest Exemplar Classifier

The goal of abele is to provide useful exemplars and counter-exemplars as explanations. However, since we could not validate them with an experiment involving humans, inspired by [18], we tested their effectiveness by adopting memory-based machine learning techniques such as the k-nearest neighbor classifier [2] (k-NN). This kind of experiment provides an objective and indirect evaluation of the quality of exemplars and counter-exemplars. In the following experiment we generated exemplars and counter-exemplars with abele, and we selected prototypes and criticisms using mmd [18] and k-medoids [2]. Then, we employ a 1-NN model to classify unseen instances using these exemplars and prototypes. The classification accuracy of the 1-NN models trained with exemplars and counter-exemplars generated to explain the DNN reported in Fig. 10 is comparable among the various methods121212The abele method achieves similar results for RF not reported due to lack of space.. In particular, we observe that when the number of exemplars is low (), abele outperforms mmd and k-medoids. This effect reveals that, on the one hand, just a few exemplars and counter-exemplars generated by abele are good for recognizing the real label, but if the number increases the 1-NN is getting confused. On the other hand, mmd is more effective when the number of prototypes and criticisms is higher: it selects a good set of images for the 1-NN classifier.

Figure 11: Relevance analysis varying the percentile threshold (the higher the better).
Figure 12: Images masked with black, gray and white having pixels with saliency for DNN lower than for the explanations of four and trouser in Fig. 5 and 7.

6.4 Relevance Evaluation

We evaluate the effectiveness of abele by partly masking the image to explain . According to [15], although a part of is masked, should remain unchanged as long as relevant parts of remain unmasked. To quantitatively measure this aspect, we define the metric as the ratio of images in for which the masking of relevant parts does not impact on the black box decision. Let be the set of explanations for the instances . We identify with the masked version of with respect to the explanation and a threshold mask . Then, the explanation relevance is defined as: . The masking is got by changing the pixels of having a value in the saliency map smaller than the percentile of the set of values in the saliency map itself. These pixels are substituted with the color 0, 127 or 255, i.e. black, gray or white. A low number of black box outcome changes means that the explainer successfully identifies relevant parts of the images, i.e., parts having a high relevance. Fig. 11 shows the relevance for the DNN131313The abele method achieves similar results for RF not reported due to lack of space. varying the percentile of the threshold from 0 to 100. The abele method is the most resistant to image masking in cifar10 regardless of the color used. For the other datasets we observe a different behavior depending on the masking color used: abele is among the best performer if the masking color is white or gray, while when the mask color is black, abele’s relevance is in line with those of the competitors for fashion and it is not good for mnist. This effect depends on the masking color but also on the different definitions of saliency map. Indeed, as previously discussed, depending on the explainer, a saliency map can provide different knowledge. However, we can state that abele successfully identifies relevant parts of the image contributing to the classification.

For each method and for each masking color, Fig. 12 shows the effect of the masking on a sample from mnist and another from fashion. It is interesting to notice how for the sal approach a large part of the image is quite relevant, causing a different black box outcome (reported on the top of each image). As already observed previously, a peculiarity of abele is that the saliency areas are more connected and larger than those of the other methods. Therefore, given a percentile threshold , the masking operation tends to mask more contiguous and bigger areas of the image while maintaining the same black box labeling.

dataset abele elrp grad intg lime occ sal
Table 3: Coherence analysis for DNN classifier (the lower the better).
dataset abele elrp grad intg lime occ sal
Table 4: Stability analysis for DNN classifier (the lower the better).
dataset abele lime cifar10 fashion mnist dataset abele lime cifar10 fashion mnist
Table 5: Coherence (left) and stability (right) for RF classifier (the lower the better).

6.5 Robustness Assessment

For gaining the trust of the user, it is crucial to analyze the stability of interpretable classifiers and explainers [14] since the stability of explanations is an important requirement for interpretability [21]. Let be the set of explanations for , and the corresponding saliency maps. We asses the robustness

through the local Lipschitz estimation 

[21]: with . Here is the image to explain and is the saliency map of its explanation . We name coherence the explainer’s ability to return similar explanations to instances labeled with the same black box outcome, i.e., similar instances. We name stability, often called also sensitivity, the capacity of an explainer of not varying an explanation in the presence of noise with respect to the explained instance. Therefore, for coherence the set in the robustness formula is formed by real instances, while for stability is formed by the instances to explain modified with random noise141414As in [21], in our experiments, we use for and we add salt and pepper noise..

Tables 3 and 4

report mean and standard deviation of the local Lipschitz estimations of the explainers’

robustness in terms of coherence and stability, respectively. As showed in [21], our results confirm that lime does not provide robust explanations, grad and intg are the best performers, and abele performance is comparable to them in terms of both coherence and stability. This high resilience of abele is due to the usage of AAE, which is also adopted for image denoising [32]. Table 5 shows the robustness in terms of coherence and stability for the model agnostic explainers abele and lime with respect to the RF. Again, abele presents a more robust behavior than lime. Fig. 13 and 14 compare the saliency maps of a selected image from mnist and fashion labeled with DNN. Numbers on the top represent the ratio in the robustness formula. Although there is no change in the black box outcome, we can see how for some of the other explainers like lime, elrp, and grad, the saliency maps vary considerably. On the other hand, abele’s explanations remain coherent and stable. We observe how in both nines and boots the yellow fundamental area does not change especially within the image’s edges. Also the red and blue parts, that can be varied without impacting on the classification, are almost identical, e.g. the boots’ neck and the sole in Fig. 13, or the top left of the zero in Fig. 14.

Figure 13: Saliency maps for mnist (left) and fashion (right) comparing two images with the same DNN outcome; numbers on the top are the coherence (the lower the better).
Figure 14: Saliency maps for mnist (left) and fashion (right) comparing the original image in the first row and the modified version with salt and pepper noise but with the same DNN outcome; numbers on the top are the stability (the lower the better).

7 Conclusion

We have presented abele, a local model-agnostic explainer using the latent feature space learned through an adversarial autoencoder for the neighborhood generation process. The explanation returned by abele consists of exemplar and counter-exemplar images, labeled with the class identical to, and different from, the class of the image to explain, and by a a saliency map, highlighting the importance of the areas of the image contributing to its classification. An extensive experimental comparison with state of the art methods shows that abele addresses their deficiencies, and outperforms them by returning coherent, stable and faithful explanations.

The method has some limitations: it is constrained to image data and does not enable casual or logical reasoning. Several extensions and future work are possible. First, we would like to investigate the effect on the explanations of changing some aspect of the AAE: (i) the latent dimensions , (ii) the rigidity of the in admitting latent instances,(iii) the type of autoencoders (e.g. variational autoencoders [26]). Second, we would like to extend abele to make it work on tabular data and on text. Third, we would employ abele in a case study generating exemplars and counter-exemplars for explaining medical imaging tasks, e.g. radiography and fMRI images. Lastly, we would conduct extrinsic interpretability evaluation of abele. Human decision-making in a specific task (e.g. multiple-choice question answering) would be driven by abele explanations, and these decisions could be objectively and quantitatively evaluated.

Acknowledgements. This work is partially supported by the EC H2020 programme under the funding schemes: Research Infrastructures G.A. 654024 SoBigData, G.A. 78835 Pro-Res, G.A. 825619 AI4EU and G.A. 780754 Track&Know. The third author acknowledges the support of the Natural Sciences and Engineering Research Council of Canada and of the Ocean Frontiers Institute.


  • [1] S. Bach, A. Binder, et al. (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10 (7), pp. e0130140. Cited by: §2, §6.
  • [2] J. Bien et al. (2011) Prototype selection for interpretable classification. AOAS. Cited by: §2, §6.3.
  • [3] L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §6.
  • [4] C. Chen, O. Li, A. Barnett, J. Su, and C. Rudin (2018)

    This looks like that: deep learning for interpretable image recognition

    arXiv:1806.10574. Cited by: §2, §5.4.
  • [5] F. Doshi-Velez and B. Kim (2017) Towards a rigorous science of interpretable machine learning. arXiv:1702.08608. Cited by: §1, §2, §6.2.
  • [6] H. J. Escalante, S. Escalera, I. Guyon, et al. (2018)

    Explainable and interpretable models in computer vision and machine learning

    Springer. Cited by: §1, §2.
  • [7] R. C. Fong and A. Vedaldi (2017) Interpretable explanations of black boxes by meaningful perturbation. In ICCV, pp. 3429–3437. Cited by: §2.
  • [8] M. Frixione et al. Prototypes vs exemplars in concept representation. KEOD, 2012. Cited by: §2, §5.4.
  • [9] N. Frosst et al. (2017) Distilling a neural network into a soft decision tree. arXiv:1711.09784. Cited by: §2.
  • [10] I. Goodfellow et al. (2014) Generative adversarial nets. In NIPS, Cited by: §4.
  • [11] R. Guidotti, A. Monreale, and L. Cariaggi (2019) Investigating neighborhood generation for explanations of image classifiers. In PAKDD, Cited by: §1, §2, §2.
  • [12] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, et al. (2018) A survey of methods for explaining black box models. ACM CSUR 51 (5), pp. 93:1–42. Cited by: §1, §2, §2, §3, §3.
  • [13] R. Guidotti et al. (2018) Local rule-based explanations of black box decision systems. arXiv:1805.10820. Cited by: §2, §2, §5.2, §5.3, §5.4, §6.2, footnote 11.
  • [14] R. Guidotti and S. Ruggieri On the stability of interpretable models. IJCNN, 2019. Cited by: §6.5.
  • [15] S. Hara et al. (2018) Maximally invariant data perturbation as explanation. arXiv:1806.07004. Cited by: §6.4.
  • [16] K. He et al. (2016) Deep residual learning for image recognition. In CVPR, Cited by: §6.
  • [17] G. Hinton et al. (2015) Distilling the knowledge in a neural network. arXiv:1503.02531. Cited by: §2.
  • [18] B. Kim et al. (2016) Examples are not enough, learn to criticize!. In NIPS, Cited by: §1, §1, §6.1, §6.3, §6, footnote 7.
  • [19] O. Li, H. Liu, C. Chen, and C. Rudin Deep learning for case-based reasoning through prototypes: a neural network that explains its predictions. In AAAI, 2018. Cited by: §1, §1, §2.
  • [20] A. Makhzani, J. Shlens, et al. (2015) Adversarial autoencoders. arXiv:1511.05644. Cited by: §1, §3, §4, §4, footnote 5.
  • [21] D. A. Melis and T. Jaakkola (2018) Towards robust interpretability with self-explaining neural networks. In NIPS, Cited by: §6.5, §6.5, footnote 14.
  • [22] C. Molnar (2018) Interpretable machine learning. LeanPub. Cited by: §1, §2.
  • [23] C. Panigutti, R. Guidotti, A. Monreale, and D. Pedreschi (2019) Explaining multi-label black-box classifiers for health applications. In W3PHIAI, Cited by: §2.
  • [24] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) Why should i trust you?: explaining the predictions of any classifier. In KDD, pp. 1135–1144. Cited by: §1, §2, §2.
  • [25] A. Shrikumar et al. (2016) Not just a black box: learning important features through propagating activation differences. arXiv:1605.01713. Cited by: §1, §2, §6.
  • [26] N. Siddharth, B. Paige, A. Desmaison, V. de Meent, et al. (2016) Inducing interpretable representations with variational autoencoders. arXiv:1611.07492. Cited by: §7.
  • [27] K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv:1312.6034. Cited by: §1, §2, §6.
  • [28] T. Spinner et al. (2018) Towards an interpretable latent space: an intuitive comparison of autoencoders with variational autoencoders. In IEEE VIS, Cited by: §6.1.
  • [29] K. Sun, Z. Zhu, and Z. Lin (2019) Enhancing the robustness of deep neural networks by boundary conditional gan. arXiv:1902.11029. Cited by: §4.
  • [30] M. Sundararajan et al. (2017) Axiomatic attribution for dnn. In ICML, Cited by: §2, §6.
  • [31] J. van der Waa et al. (2018) Contrastive explanations with local foil trees. arXiv:1806.07470. Cited by: §2.
  • [32] J. Xie et al. (2012) Image denoising with deep neural networks. In NIPS, Cited by: §6.5.
  • [33] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §2, §6.