Log In Sign Up

Generative Counterfactual Introspection for Explainable Deep Learning

In this work, we propose an introspection technique for deep neural networks that relies on a generative model to instigate salient editing of the input image for model interpretation. Such modification provides the fundamental interventional operation that allows us to obtain answers to counterfactual inquiries, i.e., what meaningful change can be made to the input image in order to alter the prediction. We demonstrate how to reveal interesting properties of the given classifiers by utilizing the proposed introspection approach on both the MNIST and the CelebA dataset.


page 4

page 5

page 6

page 7


ECINN: Efficient Counterfactuals from Invertible Neural Networks

Counterfactual examples identify how inputs can be altered to change the...

Conditional Generative Models for Counterfactual Explanations

Counterfactual instances offer human-interpretable insight into the loca...

OCTET: Object-aware Counterfactual Explanations

Nowadays, deep vision models are being widely deployed in safety-critica...

Domain aware medical image classifier interpretation by counterfactual impact analysis

The success of machine learning methods for computer vision tasks has dr...

Counterfactual Generative Networks

Neural networks are prone to learning shortcuts – they often model simpl...

Removing input features via a generative model to explain their attributions to classifier's decisions

Interpretability methods often measure the contribution of an input feat...

Bilateral Asymmetry Guided Counterfactual Generating Network for Mammogram Classification

Mammogram benign or malignant classification with only image-level label...

1 Introduction

The recent success of deep neural networks has lead to many breakthroughs in various application domains [1, 2, 3]. However, these advances have also introduced increasingly complex and opaque models with decision boundaries that are extremely hard to understand. Despite many recent developments in explainable AI, there are still enormous challenges for explaining deep neural networks. Most existing model introspection approaches [4, 5, 6] focus on studying the correlation between inputs and outputs (or predictions), e.g., by identifying regions of the input image that most contributed to the final model decision. However, these methods do not consider alternative decisions or identify changes to the input which could result in different outcomes – i.e., they are neither discriminative nor counterfactual [7]. To reliably address some of the most important introspection questions, the ability to reason about causal relationships beyond correlation is necessary.

Knowing causal reasoning behind a prediction is vital in fields such as drug or material discovery [8] where the aim is to map a known value from the output (i.e., property) space back to a set of input experimental parameters. More importantly, from a given input and output data pair, it is useful to understand how the input data could be changed to produce an output closer to their target. These necessary edits to the input data in the form of actionable knobs (implicit or explicit attribute changes) to achieve the desired results can provide a better understanding of complex decision boundaries.

A promising technique for investigating decision boundaries of a model is based on the prototype and criticism based explanations approach [9]. In this approach, given a query sample, a prototype is defined as a quintessential data sample that best represents the class that the query sample belongs to, while a criticism is the data sample from a different target class which lay closest to the decision boundary. Explainable AI can take advantage of these relationships, as both prototype and criticism examples help build an intuitive understanding of a model and elucidate the necessary changes in the input space to achieve different responses. However, the current prototype and criticism based explanation approaches are not counterfactual in nature and cannot provide actionable feedback. Existing counterfactual explanation techniques [10] are limited to generating criticisms by intervening the original data space. Specifically, they generate criticisms by replacing part of the query image with specific regions of a ‘distractor’ image that the classifier predicts as class . However, making changes in the original data space (e.g., square tiles of the image) likely will not provide actionable feedback, which is essential for many use cases, e.g., experimental knobs in a scientific application. Furthermore, such changes may not be semantically meaningful and the solution space of potential explanations is restricted by the number of semantically meaningful changes in the original data space.

To overcome these limitations, in this work, we develop a generative counterfactual introspection framework to produce inherently interpretable and actionable counterfactual visual explanations in the form of prototypes and criticisms. The counterfactual explanation generation problem is given as follows:

Given a ‘query’ image for which a classifier predicts class , a counterfactual visual explanation identifies what aspects (or attributes) of should be changed such that the classifier would output a different target class (i.e., the criticism) or provide a more confident classification to for modified image (i.e., the prototype).

To solve this problem, we propose to employ powerful generative models along with an attribute (or actionable latent feature) editing mechanism [11] to develop Generative Counterfactual Explanation: generative and actionable counterfactual explanations generation framework (see Figure 1). To the best of our knowledge, this is the first approach exploring the decision boundaries between classes and their relationship to the input data by providing actionable feedback and generating counterfactual prototypes and criticism based explanations.

Figure 1: The illustration of the generative counterfactual introspection concept.

2 Related Work

Recently, quite a few model introspection methods have been proposed to allow for interpretability of a given prediction. Many CNN interpretation methods [4, 12, 5, 6, 13], such as GradCAM [12]

, utilize backpropagation to conduct sensitivity analysis by attributing the prediction to the input domain (e.g., image pixels). Alternatively, we can build a simpler localized model to approximate the complex nonlinear model 

[14, 15]. In the LIME [14] work, the authors create a linear model to approximate the neural network around a specific prediction to directly attribute the prediction result into the input domain. As proposed in [15]

, the decision process of the neural work can also be modeled as a partition tree in the feature space. To understand how components of the network work, a variety of the methods have been introduced to visualization the feature (or pattern) the given neuron or layer aim to capture 

[16, 17, 18] or examining the representation of the high-level concept in the latent representations [19].

With the pressing need to obtain causal understanding of model behavior, interpretation approaches [20, 21, 22, 23] focusing on counterfactual reasoning have been proposed. In [20], the counterfactual query is utilized as the fundamental tool for evaluating the fairness of the high impact social application. In the counterfactual visual explanation [22] work, a patch based editing of input image is optimize in order to satisfy the intended changes in the prediction. In the ground visual explanation [23] work, text based explanation are generated to provide counterfactual explanation for image classification task. Beside the causal interpretation methods, as demonstrated in [9], examining the relationship between the trained model and training dataset can also help interpret model behavior.

The safety of deep neural nets have been challenged by the existence of adversarial samples [24, 25, 26], in which the appearance of small but intentionally worst-case perturbations will lead to change in the prediction. Several specialized optimization approaches have been proposed, such as the fast gradient sign method [24], to resolve the optimization challenges. Conceptually, the adversarial examples can also be considered as an answer to a counterfactual query, as it reveals a modification to the input that lead to change of the prediction. However, as the adversarial changes are imperceptible, they cannot reveal the potential bias to humans. We address this problem by utilizing generative adversarial networks (GANs) [11, 27] to generate modification of the input, which ensures a meaningfully edited image rather than an adversarial example.

3 Method

In order to explain a query image with respect to decision boundaries of some trained classifier on image set , we aim to produce counterfactual prototypes and criticisms. Next we formalize this problem and then present our solution.

3.1 Minimal Change Counterfactual Example Generation

Given a query image for which the classifier predicts class , we seek to identify the key attribute changes in such that making these changes in would lead the network to either change its decision about the query to the target class (i.e., criticism) or make it more confident about the query class. We consider both of these following cases: 1) attributes are known and given for , or 2) attributes are unknown in which case will be learned from . Furthermore, these attributes are expected to be actionable, i.e., we should be able to change these attributes and generate corresponding changes in the query image. To enable this, we employ a powerful generative machine learning model called “generative adversarial network (GAN)” [27]

. GANs transform vectors of generated noise (or latent factors) into synthetic samples resembling data gathered in the training set. GANs (and corresponding latent space) are learned in an adversarial manner, i.e., a concept taken from the game theory which assumes two competing networks, a discriminator

(differentiating real vs. synthetic samples) and a generator (learning to produce realistic synthetic samples by transforming latent factors). This adversarial learning is shown to learn salient attributes of the data in an unsupervised manner which can later be manipulated using the generator . GANs can also be used for simultaneously generating and manipulating the images with known and desired attributes [11]. We use both of these formulations in our framework depending on whether actionable attributes are known or unknown, where the latter uses the latent representations as our attributes.

Generative editing models are denoted as or depending on whether actionable attributes are known or unknown respectively. The goal is to manipulate single or multiple attributes of an image , i.e., to generate a new image with desired attributes while preserving other details , or to manipulate a latent vector in a similar fashion. Given these generative editing mechanism, we formulate minimal change counterfactual explanation generation problem given image , image attribute , and a target attribute vector , where and can be used in place of and , as follows:


where and is the target criticism class. When the goal is to generate prototypes, we set

as the original class label of the query image and formulate an alternating loss function to promote solution which maximize class confidence instead of having a trivial solution, i.e.,


3.2 Approximate Solution

Most deep neural network based models make formulation (1) non-linear and non-convex, making it hard to find a closed-form solution. Thus, we formulate a relaxed version of this optimization problem which can be solved efficiently using gradient descent algorithms. The proposed approach relaxes the optimization problem 1 as follows:


where loss is cross-entropy loss for predicting image to label using classifier . Note that both classifier and generator are differentiable. The gradient of the objective function is computed by back-propagation, and the minimal change counterfactual example generation problem is solved using gradient descent. Furthermore, to generate an explanation with minimum change , one can repeatedly solve this optimization problem using gradient descent, continually updating using bisection search or any other method for one-dimensional optimization.

4 Experiments

Here we demonstrate the effectiveness of the proposed counterfactual explanation generation approach on two datasets (one with known attributes and another one with unknown). The proposed method outputs modified images to satisfy counterfactual queries along with actionable attribute values to achieve these results, in turn, providing a comprehensive understanding of decision boundaries of the classifier .

4.1 MNIST dataset

In this experiment, we consider the problem of classifying a given image of a handwritten digit into one of 10 classes (0 to 9). We use the MNIST dataset [28] which contains 60,000 training and 10,000 test images of handwritten digits. The classifier [29] is trained on MNIST training set and achieves accuracy on the test set. We utilize a pretrained DCGAN architecture [30] (with a 10D latent space) as our image generator. Given 10D latent vector (), the generator produces a digit image. The proposed optimization method will update the to generate meaningful modification of the image that answers the counterfactual query.

Figure 2: Finding criticism of the digit class.

As shown in Figure 2, we illustrate meaningful changes to the image of digit to alter its prediction. We start from the same image in each row and illustrate the optimization path from the original image to the images that altered the classified label to a predefined target label. Compared to a direction optimization in the image space [24] that leads to an adversarial example, the utilization of a GAN guarantees that we end up exploring the “manifold” of all possible meaningful images. As a result, these edits provide us with valuable insights regarding classifier decision boundary, i.e., what are the boundary image patterns between different classes of digits, and what kind of changes are most likely to alter the prediction. Interestingly, we see that for certain target labels (->, ->), the image first change to a different digit (in this case ) before morphing into the target digits. Alternatively, as shown in Figure 3, we also utilize a similar optimization to find the prototype for each digit by“walking” toward the center of the class on the digit image manifold. We can see the starting digits morphed into a more “regular” handwriting style, which are easier for human to recognize. These observations not only help in revealing the inherent structure of the digit image manifold but also indicate the preference of the classifier regarding similarity between digits.

Figure 3: Finding prototypes of different digits.
Figure 4: The effect of regularization on the optimization path for finding criticism of 9 in the direction of 7.

For better interpretability, it is desirable to make sure the modification of the image is relatively small and consistent. As discussed in Section 3, we include a regularization term that measure the distance between the original image and the edited ones. In Figure 4, we can see that this regularization ensure the optimization process is smooth and the modifications to the images are keep to the minimal.

4.2 CelebA dataset

In several case, attributes are known explicitly, thus, optimization can be carried out in the attribute space that a generator is conditioned on (), where explicitly defined physical attributes can provide actionable feedback. We use CelebFaces Attributes dataset (CelebA) [31], which is a large-scale face attributes dataset with more than 200K celebrity images, each with explicit attribute annotations. We consider a classification problem of classifying a celebrity face image in CelebA dataset into young or old. The classifier [11] is trained on CelebA dataset and achieves average accuracy of on CelebA testing set. Next, we use the AttGAN [11] as our generative editing method to generate modification to the query face image. The AttGAN can make edits to the original query image based on additional attributes (e.g., hair color, glass, bang, bald). See Figure 5(a)(c), we can make the image in (a) looks older by setting the old/young attribute when generate the new image (c), where features such as wrinkle are added to make the subject appears to appear older. Such a generator allows us to only change a given person’s superficial appearance without alter facial features and identify, which also allow us to obtain more meaningful counterfactual explanation for probing the behavior of the given classifier. In the following experiments, we focus on exploring the behavior of a classifier trained for predicting whether a person is young or old based on the given image.

Figure 5: Illustrate different editing scheme for the input image. The original image is shown in (a). In (c), we show the edited image that predicted as "old" by altering the "old/young" attribute of the AttGAN. In (b), we show the modification of the same image driven by the preference of the classier (without modifying the "old/young" attribute).

Since the AttGAN generator has a young/old input attribute, a direct optimization in the entire attribute space will likely lead to the degenerate case, in which the young/old attribute is used to edit the image (to make it appears older for the classifier). Therefore, in our experiment, we fixed the young/old attribute to the original label and only make changes to rest of the attributes (12 in total). In other words, we ask what kind of attributes changes (beside the young/old attribute) will make a given image appear older or younger for the given classifier.

In Figure 6, we have three female celebrity faces (query images) which are classified as “young”. Here, we show the optimization path that eventually leads to an “old” classification. The right most column shows the top five most changed attributes and their relatively changes. This result is particularly interesting as all three examples show eyeglasses in the modified images that result in an “old” classification. One possible explanation for such an observation is that the classifier learns these patterns from the training data. To investigate this hypothesis, we explore the distributions of attributes across the training data. As shown in Figure 7, we can see a clear difference regarding eye glass frequency between the young and old population. This result demonstrate that counterfactual query can be an very powerful tool to reveal unexpected behaviors of classifiers and highlight the potential bias in the training data.

Figure 6: Illustrate attributes changes (beside the young/old attribute) that will make the images appear older for the given classifier. The left most column shows the original image. The right most column shows the top five most changed attributes and their relatively changes.
Figure 7: The potential bias in the CelebA dataset. The percentage of people having eye glass is much higher in the population labeled as “old”.
Figure 8: Prototype and criticism for the images with ground truth label “old”. The left most column shows the original image. The right most column shows the top five most changed attributes and their relatively changes.

To further illustrate how counterfactual examples help explain the behavior of the classifier, in Figure 8

, we investigate the prototype and criticism examples for two male celebrity faces that both have a ground truth label “old”. When searching for the prototypes (i.e., making them older), we see a minimal changes for the first face (second row) while observe significant change for the second face (fourth row). This distinction indicates that the first person seems to have a prototypical look for the “old” class, whereas the second person does not. For the criticisms (row one and three), the opposite holds true, which indicates the image of the second person is an outlier for “old” samples, and is closer to a typical “young” image. Finally, the right most column provides “actionable insights” to achieve these changes. The top five most changed attributes are reasonable with hair features being most important factors in discriminating the age group.

5 Discussion and Future Work

In this work, we present preliminary results on utilizing generative models to obtain counterfactual explanations for a given classifier. Despite the simplicity of the optimization, we demonstrate that the effectiveness of the proposed approach for revealing insights regarding the behavior of deep neural network models. For future directions, we plan to explore the potential application of such interpretation method for scientific application, where explainability are essential for model validation and domain discovery.


This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.