Generative Counterfactuals for Neural Networks via Attribute-Informed Perturbation

by   Fan Yang, et al.
Texas A&M University

With the wide use of deep neural networks (DNN), model interpretability has become a critical concern, since explainable decisions are preferred in high-stake scenarios. Current interpretation techniques mainly focus on the feature attribution perspective, which are limited in indicating why and how particular explanations are related to the prediction. To this end, an intriguing class of explanations, named counterfactuals, has been developed to further explore the "what-if" circumstances for interpretation, and enables the reasoning capability on black-box models. However, generating counterfactuals for raw data instances (i.e., text and image) is still in the early stage due to its challenges on high data dimensionality and unsemantic raw features. In this paper, we design a framework to generate counterfactuals specifically for raw data instances with the proposed Attribute-Informed Perturbation (AIP). By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently. Instead of directly modifying instances in the data space, we iteratively optimize the constructed attribute-informed latent space, where features are more robust and semantic. Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework, and show the superiority over other alternatives. Besides, we also introduce some practical applications based on our framework, indicating its potential beyond the model interpretability aspect.


An Empirical Study of Explainable AI Techniques on Deep Learning Models For Time Series Tasks

Decision explanations of machine learning black-box models are often gen...

Contrastive Explanations for Model Interpretability

Contrastive explanations clarify why an event occurred in contrast to an...

Interpretable ECG classification via a query-based latent space traversal (qLST)

Electrocardiography (ECG) is an effective and non-invasive diagnostic to...

Software for Dataset-wide XAI: From Local Explanations to Global Insights with Zennit, CoRelAy, and ViRelAy

Deep Neural Networks (DNNs) are known to be strong predictors, but their...

Towards Explanation of DNN-based Prediction with Guided Feature Inversion

While deep neural networks (DNN) have become an effective computational ...

A Generic and Model-Agnostic Exemplar Synthetization Framework for Explainable AI

With the growing complexity of deep learning methods adopted in practica...

Convolutional Neural Network Interpretability with General Pattern Theory

Ongoing efforts to understand deep neural networks (DNN) have provided m...

1. Introduction

The past decade has witnessed the success of deep neural networks (DNN) in a wide range of application domains (Pouyanfar et al., 2018). Despite the superior performance, DNN models have been increasingly criticized due to its black-box nature (Doshi-Velez and Kim, 2017)

. Interpretable machine learning techniques 

(Du et al., 2020) are thus becoming significantly vital, especially in those high-stake scenarios, such as medical diagnosis. To effectively interpret black-box DNNs, most approaches investigate the feature attributions between input instances and output predictions through correlation analysis, so that humans can have a sense of which part of the instance contributes most to the model decision. A typical example is the heatmaps employed for image classification (Selvaraju et al., 2017), where related saliency scores are capable of indicating the feature importance for one particular prediction label.

However, existing correlation-based explanations are neither discriminative nor counterfactual (Pearl and others, 2009), since they are not able to help understand why and how particular explanations are relevant to model decisions. Thus, to further explore the decision boundaries of black-box DNN, counterfactuals have gradually come to the attention of researchers, as an emerging technique for model interpretability. Counterfactuals are essentially some synthetic samples within data distribution, which can flip the model prediction. With counterfactuals, humans can understand how input changes affect the model and conduct reasoning under the “what-if” circumstances. Take a loan applicant who got rejection for instance. Correlation-based explanations may simply indicate those most contributed features (e.g., income and credit) for rejection, while counterfactuals are capable of showing how the application could be accepted with certain changes (e.g., increase the monthly income from to ).

Recent work have already made some initial attempts on conducting such counterfactual analysis. The first line of research (Kim et al., 2016; Chen et al., 2019) employed the prototype and criticism samples in the training set as the raw ingredients for counterfactual analysis, even though those selected samples are not counterfactuals in nature. Prototypes indicate the set of data samples that best represent the original prediction label, while criticisms are the samples with the desired prediction label which are close to the decision boundary. Some other work (Goyal et al., 2019; Agarwal et al., 2019)

further utilized feature replacement techniques to create hypothetical instances as counterfactuals, where a query instance and a distractor instance are typically needed for counterfactual generation. The key of this kind of methodologies lies in the effective feature extraction and efficient replacement algorithm. Besides, contrastive intervention 

(Dhurandhar et al., 2018; White and Garcez, 2019) on the query instance is another common way to generate counterfactuals regarding to the desired label. By reasonably perturbing input features, counterfactuals can be obtained in the form of modified data samples.

Despite the existing efforts, generating valid counterfactuals for raw data instances is still challenging due to the following reasons. First, effective counterfactuals for certain label are not guaranteed to be existed in training set, so the selected prototypes and criticisms are not always sufficient for counterfactual analysis. The related sample selection algorithms are highly possible to select some “unexpected” instances due to data constraints (Kim et al., 2016), which would largely limit the reasoning on model behaviors. Second, efficient feature replacement for raw data instances could be very hard and time-consuming (Goyal et al., 2019). Also, relevant distractor instances for replacement may not be available in particular scenarios considering privacy and security issues, such as loan applications. Third, modifying query samples with intervention can simply work on a limited types of data, such as tabular data (White and Garcez, 2019) and naive image data (Dhurandhar et al., 2018). For general raw data like real-world texts or images, intervention operation in data space can be extremely complicated and intractable, which makes it difficult to be used in practice.

To handle the aforementioned challenges of counterfactual generation for raw instances, the high-dimension data space and unsemantic raw features are the two obstacles ahead. To this end, in this paper, we design a framework to generate counterfactuals specifically for raw data instances with the proposed Attribute-Informed Perturbation (AIP) method. By utilizing the power of generative models, we can obtain useful hypothetical instances within the data distribution for counterfactual analysis. Essentially, our proposed AIP method can guide a well-trained generative model to generate valid counterfactuals by updating its parameters in the attribute-informed latent space, which is a joint embedding space for both raw features and data attributes. Compared with the original input space, attribute-informed latent space has two significant merits for counterfactual generation: (1) raw features are embedded as low-dimension ones which are more robust and efficient for generation; (2) data attributes are modeled as joint latent features which are more semantic for conditional generation. As for the construction of attribute-informed latent space, we typically employ two types of losses to conduct the training of generative models, where the reconstruction loss is used to guarantee the quality of raw feature embedding and the discrimination loss is used to ensure the correct attribute embedding. Through the gradient-based optimization, the proposed AIP method can iteratively derive the valid generative counterfactuals which are able to flip the prediction of target model. In the experiments, although we simply consider the DNN as the target prediction model, due to its general good performance for raw data instances, our proposed framework can also be easily applied with other different prediction models. The main contributions of this paper are summarized as follows:

  • [leftmargin=*]

  • We design a general framework to derive counterfactuals for raw data instances by employing generative models, aiming to facilitate the reasoning on model behaviors of black-box DNN;

  • We develop AIP to iteratively update the parameters of generative models in the attribute-informed latent space, according to the counterfactual loss with regards to the desired prediction label;

  • We evaluate the designed framework with AIP on several real-world datasets including raw texts and images, and demonstrate the superiority both quantitatively and qualitatively.

2. Preliminaries

In this section, we briefly introduce some related contexts to our problem, as well as some basics of the employed techniques.

2.1. Counterfactual Explanation

Counterfactual explanation is essentially a natural extension under the framework of example-based reasoning (Rissland, 1991), where particular data samples are provided to promote the understandings on model behaviors. Nevertheless, counterfactuals are not common examples for model interpretation, since they are typically generated under the “what-if” circumstances which may not necessarily exist. According to the theory proposed by J. Pearl (Pearl and Mackenzie, 2018), three distinct levels of cognitive ability are needed to fully master the behaviors of a particular model, i.e., seeing, doing and imagining from the easiest to the hardest. In fact, counterfactual explanation is just raised to meet the imagining-level cognition for model interpretation.

Within the contexts of this paper, we only discuss counterfactuals under the assumption of “closest possible world” (Wachter et al., 2017), where desired outcomes can be obtained through the smallest changes to the world. To be specific and simple without loss of generality, consider a binary classification model , where and respectively indicate the undesired and desired output. The model input is further assumed to be sampled from data distribution . Then, given a query instance with the undesired model output (i.e., ), the corresponding counterfactual can be mathematically represented as:


where indicates a distance measure defined in the input space, and denotes the threshold which quantifies how likely the sample is under the distribution . The obtained counterfactual

is regarded to be valid if it can effectively flip the target classifier

to the desired prediction.

Although finding counterfactuals is somewhat similar to generating adversarial examples (in terms that both tasks aim to flip the model decision by minimally perturbing the input instance), they are essentially different in nature. Following the previous settings, the adversarial sample for model , with query instance , can be generally indicated by:


where denotes the adversarial perturbation on the query, represents the norm operation and . Comparing with Eq. 1, we note that counterfactual example has two significant differences from adversarial sample. First, counterfactual generation process is subject to the original data distribution, while adversarial samples are not constrained by the distribution. This difference brings about the fact that counterfactuals are all in-distribution samples, but adversarial examples are mostly out-of-distribution (OOD) samples. Second, counterfactual changes on the query need to be human-perceptible, while adversarial perturbations are usually inconspicuous (Sen et al., 2019). Therefore, the key problem of counterfactual explanation actually lies in how to generate such in-distribution sample, with human-perceptible changes on the query, to flip the model decision as desired.

2.2. Generative Modeling

Generative modeling is a typical task under the paradigm of unsupervised learning. Different from discriminative ones, which involves discriminating input samples across classes, generative modeling aims to summarize the data distribution of input variables and further create new samples that plausibly fit into that distribution 

(Murphy, 2012)

. In practice, a well-trained generative model is capable of generating new examples that are not only reasonable, but also indistinguishable from real examples in the problem domain. Conventional examples of generative modeling include Latent Dirichlet Allocation (LDA) and Gaussian Mixture Model (GMM).

As emerging families of generative modeling, Generative Adversarial Network (GAN) 

(Goodfellow et al., 2014) and Variational Auto-Encoder (VAE) (Kingma and Welling, 2013)

have been attracting lots of attentions due to their exceptional performance in a myriad of applications, especially for the task on image and text generation 

(Van den Oord et al., 2016; Hu et al., 2017). By taking full advantage of their power on raw data with high dimensionality, we are able to better investigate how those data samples were created in the first place, which potentially benefits the generation of certain hypothetical example. To this end, we specifically employ some advanced generative models (i.e., GAN and VAE) to study the counterfactual explanation for black-box DNN on raw data instances, providing effective generative counterfactuals for better model understanding.

3. Counterfactual Generation

In this section, we first introduce the designed generative counterfactual framework for raw data instances. Then, we present how to specifically construct the attribute-informed latent space with generative models. Finally, we show the details of our proposed AIP method on how to effectively obtain such counterfactuals.

3.1. Generative Counterfactual Framework

Figure 1. Designed framework for counterfactual sample.

We design a framework to create counterfactual samples for raw data instances, as illustrated by Fig. 1. To effectively handle the high dimensionality and unsemantic features, we utilize the generative modeling techniques to aid the counterfactual generation process. Consider a target DNN , which is the black-box model for counterfactual analysis, where is the input data space and denotes the model prediction space with different outputs. Given a query instance ,

outputs a one-hot vector. To effectively generate a valid counterfactual sample

that can flip the decision to as desired, a generative model is trained to achieve this in the framework. The applied generative modeling actually plays two important roles in the counterfactual generation process: (1) generative modeling guarantees that all created instances are in-distribution samples, since it can be regarded as a stochastic procedure that generates samples under the particular data distribution ; (2) generative modeling generally assumes that underlying latent variables can be mapped to the data space under certain circumstances, which ensures the sufficient feasibility for hypothetical examples. Thus, a well-trained generative model is the basis for high-quality counterfactuals within the designed framework.

The employed generative model specifically serves two sub-tasks for counterfactual generation, i.e., data encoding and decoding. For raw data instances like images, the input space could be extremely large, which makes it difficult and inefficient to directly create counterfactuals for the query. In our designed framework, data encoding is conducted to map the input data space to a low-dimension attribute-informed latent space, which is formulated as a joint embedding space for both raw features and data attributes. In this way, each data sample can be effectively encoded through the function , where is the latent space for raw feature embeddings, indicates the data attribute space, and represents a concatenation operator. Reversely, the mapping for decoding is from the attribute-informed latent space to the original data space. The decoder function can be similarly indicated by . Although and typically have two different focuses, they are jointly trained as a whole generative model in an end-to-end manner. The issues about how to derive and will be particularly discussed in Sec. 3.2.

To finally obtain the counterfactual sample for model with query , we further need to modify the attribute-informed latent space of the deployed generative model. Specifically, we use the proposed AIP method to update the attribute-informed latent vector of , according to the counterfactual loss calculated. Assuming (), AIP method can jointly update and , so as to minimize the corresponding loss counter-factually. The overall counterfactual loss consists of two parts, i.e., prediction loss and perturbation loss. Prediction loss is set to ensure the flip of model decisions, and perturbation loss is involved to guarantee the “closest possible” changes on the query, which are both indispensable for counterfactual generation. For the prediction loss, we simply follow the common cross-entropy term, expressed as . For the perturbation loss, we employ two norms respectively on and , indicated by (), to restrain the query changes, which can also be regarded as a regularization term as well. Further, the overall counterfactual loss can thus be represented as follows:


where is a balance coefficient between the two loss terms. With the proposed AIP method, the designed framework can generate the valid counterfactual example with the aid of optimized through the decoder function (i.e., ). The details of the proposed AIP method will be introduced in Sec. 3.3.

3.2. Attribute-Informed Latent Space

Figure 2. General illustration of the attribute-informed latent space in generative models. Particularly, blue arrows indicate the forward flow of computations, while orange arrows indicate the back-propagation flow of gradients. The dash lines denote the losses for generative model training.

Constructing an appropriate attribute-informed latent space is the key part for generative modeling in our designed framework, which has direct influences on the quality of generated counterfactuals. To achieve this, we need to well train a generative model, better capturing the raw data features as well as relevant data attributes, where embedded features can bring about more robust bases for counterfactual analysis, and incorporated attributes are able to provide more semantics for conditional generation. Here, the data attributes mainly indicate those extra information from humans along with raw instances, such as annotations or labels, which can usually be represented as one-hot vectors.

In practice, it is common that different generative models are employed for different tasks or data. Since different models typically involve disparate architectures, their training schemes can totally differ from each other. Take the GAN and VAE for example, where GAN is usually trained to obtain an equilibrium between a generator and a discriminator function, while VAE is typically trained to maximize a variational lower bound of the data log-likelihood. Therefore, to better introduce how to specifically construct the attribute-informed latent space with generative models, we present a general illustration shown by Fig. 2, although it may not be fully representative for all kinds of models.

We generally introduce the modeling process with an encoder-decoder structure, which corresponds to the data encoding and decoding in our designed framework. Essentially, the attribute-informed latent space can be regarded as an extended code space of auto-encoders. By concatenating attribute vector to raw feature embedding , the decoder function aims to achieve the conditional generation based on . To ensure the attribute consistency between original sample and generated sample , discriminator is particularly employed, which is trained separately and used to classify the attributes of . To effectively train such generative model, two basic loss terms are required, which are the reconstruction loss and discrimination loss. The overall training can be indicated by:


where denotes the -th attribute in , and indicates the prediction of on the -th attribute. After sufficient training, and can be effectively obtained, and the attribute-informed latent space can be further constructed with the aid of . For specific tasks and architectures, the generative modeling process could be further enhanced with more specific losses or other advanced tricks.

3.3. Attribute-Informed Perturbation

With the obtained and for generative modeling, we then introduce the proposed AIP method to finally derive the counterfactual for target DNN with the query . To guarantee the quality of the generated counterfactuals, AIP needs to find the sample that can minimize the counterfactual loss indicated by Eq. 3. Under the “closest possible world” assumption, the corresponding counterfactual sample can be denoted as:


To effectively solve Eq. 5, the proposed AIP method utilizes an iterative gradient-based optimization algorithm with dynamic step sizes (controlled by a decaying factor ), which helps the iteration process converge faster. In each iteration, the updated and can be derived as follows:


where indicates the iteration index, and respectively denotes the step sizes of updates on and . Specifically, the proposed AIP method can be summarized in Algorithm 1. It is important to note that AIP only works on the optimization of , and does not involve the parameter update on , , . Thus, the proposed AIP method should be less time-consuming and easily deployed for counterfactual generation task, compared with those generative frameworks which need extra model training (Samangouei et al., 2018; Singla et al., 2020).

Input: , , , , , , , , ,
Output: Counterfactual sample
1 Initialize , , , ;
2 Initialize ;
3 Construct the latent space with ;
4 while  do
5       Update and according to Eq. 6 ;
6       Update step sizes with and ;
8Reconstruct the optimized sample with ;
9 if  then
10       Return as the counterfactual for with query ;
12       Return None -- No valid counterfactual exists;
Algorithm 1 Attribute-Informed Perturbation (AIP)

4. Experiments

In this section, we evaluate the designed counterfactual generation framework with the proposed AIP method on several real-world datasets, both quantitatively and qualitatively. Overall, we mainly conduct two sets of experiments respectively on text and image counterfactual generation, by utilizing different data modeling techniques. With all conducted experiments, we aim to answer the following four key research questions:

  • [leftmargin=*]

  • How effective is the designed framework in generating counterfactuals with AIP, regarding to different types of raw data?

  • How is the quality of created counterfactuals from our designed framework aided by AIP, comparing with other methods?

  • How efficient is the counterfactual generation under the designed framework with AIP, comparing with other potential ways?

  • Can we further benefit other practical tasks with the counterfactuals generated from our design framework with AIP method?

4.1. Experimental Settings

4.1.1. Real-World Datasets.

Throughout the whole experiments, we employ three real-world datasets to evaluate the performance of the designed framework with AIP method, including both raw texts and images. The relevant data attributes depend on the particular tasks, which are collected either from labels or annotations. The statistics of the involved datasets are shown in Table 1.

Datasets #Instance #Attribute Type Domain
Yelp 1 Raw Texts Sentiment
Amazon 1 Raw Texts Sentiment
CelebA 13 Raw Images Face Feature
Table 1. Dataset statistics in experiments.
  • [leftmargin=*]

  • Yelp User Review Dataset111 (Asghar, 2016): This dataset consists of user reviews from the Yelp associated with relevant rating scores. We involve a tailored and modified version of this data for our experiments on text counterfactual generation. Specifically, we consider the reviews with ratings higher than three as positive samples and regard the others as negative ones, and we further use these sentiment labels as the relevant attribute for data modeling. The vocabulary of our involved Yelp data contains more than words, and the average review length is around words.

  • Amazon Product Review Dataset222 (He and McAuley, 2016): This dataset is also involved as a raw textual dataset for our experiments. Similar to the Yelp data, we modify the original rating information of reviews into the sentiment categories (i.e., positive and negative), and further model these labels as an sentiment attribute of the raw textual reviews. Amazon dataset has more than words in vocabulary, and the average length is around .

  • CelebFaces Attributes (CelebA) Dataset333 (Liu et al., 2015): This is a large-scale face attributes dataset, containing tons of raw face images with human annotations. We employ this dataset for our experiments on image counterfactual generation, and select out representative face attributes (out of 40) for data modeling along with raw face images. The involved thirteen attributes include: Male, Young, Blond_Hair, Pale_Skin, Bangs, Mustache and etc.

4.1.2. Target Model for Interpretation.

Since we mainly discuss the counterfactuals of raw data instances, DNN is a better choice as our target model. For the target DNN in our experiments, we employ some regular structures for the corresponding tasks. Particularly, for the text sentiment classification, we use a common convolutional architecture in (Kim, 2014) to pre-train a DNN classifier for further counterfactual analysis. For the image attribute classification task, similarly, we utilize a simple convolutional network (Guo et al., 2017) to prepare a target classifier, where the model is trained with one of those attributes as the label. During the evaluations on counterfactual generation, target DNNs are fixed without further training.

4.1.3. Employed Generative Modeling Techniques.

In our experiments, different data modeling techniques are employed for different types of data. In particular, we use different generative models to construct the corresponding attribute-informed latent space, regarding to text and image data. For textual reviews (Yelp, Amazon), we utilize the modeling techniques introduced in (Hu et al., 2017), and build a transformer-based VAE to effectively formulate the relevant attribute-informed latent space. For face images (CelebA), we mainly follow the modeling method of AttGAN (He et al., 2019), where more complicated training schemes are employed, compared with the general one shown in Sec. 3.2, for better visual quality of the generated images. Both of the employed generative models should be well-trained on the corresponding datasets before the counterfactual generation process, so as to guarantee the high quality of generated counterfactuals.

4.1.4. Alternative Methods and Baselines.

To effectively evaluate the performance of the designed framework with AIP, we incorporate following alternative methods and baselines for comparison.

  • [leftmargin=*]

  • TextBugger (Li et al., 2018): This is a general method for adversarial text generation, which is built based on the word attribution and bug selection. The created text samples can effectively flip the prediction of the target classifier. We employ this method as a baseline to specifically compare with our generated text counterfactuals.

  • DeepWordBug (Gao et al., 2018): This is another method focusing on the adversarial text generation, where a token scoring strategy is utilized to guide the character-level adversarial perturbation. This method is employed as a baseline for text counterfactuals as well.

  • FGSM (Goodfellow et al., 2015): Fast gradient sign method is a common way to generate image adversarial samples, by using the gradients of the loss with respect to the input. The sample created by this method can effectively maximize the loss, so as to flip the original prediction. We employ this method as a baseline specifically for our generated image counterfactuals.

  • Counter_Vis (Goyal et al., 2019): This is a recent method in generating image counterfactuals, where particular image regions are replaced to flip the model decision. We employ this method as an alternative method for image counterfactual generation.

  • CADEX (Moore et al., 2019): This is a state-of-the-art method for counterfactual generation, where the gradient-based method is directly applied to modify the input space of query. This method is originally proposed for tabular data, and we modify it simply as an alternative for image counterfactuals, due to the particularity of texts.

  • xGEMs (Joshi et al., 2018): This is a state-of-the-art method for generating counterfactuals, which also employs the generative modeling technique for sample generation. This method only involves the latent space modeling and cannot achieve the conditional generation with semantic attributes. We employ this method as an important alternative for both text and image counterfactuals.

  • AIP_R: This is the random version of our proposed AIP method, which updates all parameters in a random way.

4.2. Implementations

In this part, we introduce the implementation details for our experiments on text and image counterfactual generation, corresponding to the following Sec. 4.3 and Sec. 4.4.

4.2.1. Details for Text Counterfactuals


We implement all related algorithms and models by the PyTorch framework

444 Specifically for text counterfactuals, we set the balance coefficient in Eq. 3 during the main evaluations. The decaying factor is set as , and the maximum iteration is set as . As for the initial step sizes for optimization, we set and in the related experiments.

Target DNN model for texts.

We train two CNN classifiers respectively on Yelp and Amazon datasets as our target models, following the architectures introduced in work (Kim, 2014). In particular, both CNN employ the CNN-non-static version of training, as illustrated in the paper. The deployed target CNN for Yelp data has testing accuracy, and the deployed CNN for Amazon has testing accuracy.

Generative model for texts.

We employ a transformer based VAE to conduct the relevant data modeling. Specifically, the overall structure of the employed Encoder and Decoder can be illustrated by the following Tab. 2. The size of transformer and latent space are both set as . As for the training phase, we use another classifier, trained separately with , as the discriminator . The relevant batch size is , embedding dropout rate is , and learning rate is .

Encoder (from top to down) Decoder (from top to down)
Embedding Layer Multi-Head Attention Layers
Multi-Head Attention Layers Addition & Normalization
Addition & Normalization Dense Layer
Dense Layer Addition & Normalization
Addition & Normalization Multi-Head Attention Layers
Multi-Head Attention Layers Fully-Connected Layer
GRU Layer Softmax
Summation /
Table 2. Structure for the employed transformer-based VAE.

4.2.2. Details for Image Counterfactuals


Similar as text counterfactuals, we also use PyTorch to implement relevant models and algorithms. In particular, we set in Eq. 3. The decaying factor and maximum iteration in Algorithm 1 is respectively set as and . Besides, the initial optimization step sizes is set as and .

Target DNN model for images.

We train a basic CNN on the CelebA dataset as the target according to the architecture in (Guo et al., 2017), specifically using the “Male” annotation as the label for training. Essentially, the target CNN here is a binary gender classifier which predicts an input image as either “Male” or “Female”. The deployed CNN has the testing accuracy with .

Generative model for images.

The generative modeling for CelebA data largely follows the training schemes in (He et al., 2019). Our modeling attributes include: ”Pale_Skin”, ”Bangs”, ”Black_Hair”, ”Blond_Hair”, ”No_Beard”, ”Brown_Hair”, ”Bushy_Eyebrows”, ”Male”, ”Eyeglasses”, ”Young”, ”Mustache”, ”Bald”, ”Mouth_Slightly_Open”. Particularly, we present our employed structure of Encoder and Decoder in the following Tab. 3

. Here, DeConvolution indicates the transposed operation on convolution, and (64,4,2) respectively denotes the dimension, kernel size and stride, for example. From the structure, we note that the latent size is

. For the training phase, we specifically set the batch size as , and learning rate as . The discriminator is a multi-class classifier, trained separately from .

Encoder (from top to down) Decoder (from top to down)
Convolution Layer (64,4,2) DeConvolution Layer (1024,4,2)
Normalization Normalization
Convolution Layer (128,4,2) DeConvolution Layer(512,4,2)
Normalization Normalization
Convolution Layer (256,4,2) DeConvolution Layer(256,4,2)
Normalization Normalization
Convolution Layer (512,4,2) DeConvolution Layer(128,4,2)
Normalization Normalization
Convolution Layer (1024,4,2) DeConvolution Layer(3,4,2)
Normalization /
Table 3. Structure for the employed AttGAN.

4.3. Text Counterfactual Evaluations

In this part, we evaluate the experimental results of the designed framework with AIP in generating text counterfactuals, regrading to a convolutional neural network (CNN) built for sentiment classification. The involved raw texts for target DNN come from the user/product reviews in Yelp and Amazon datasets, where

are used for training, for development and for testing.

4.3.1. Effectiveness Evaluation.

Figure 3. Effectiveness evaluation for text counterfactuals.

In order to evaluate the effectiveness for text counterfactuals, we employ the metric Flipping Ratio (FR) to measure the relevant performance, which reflects how likely the generated text samples would flip the model decision to . Specifically, FR can be calculated as follows:


where indicates the set of query samples with which new flipping instances can be generated by particular methods, and denotes the set of all testing queries. In our experiments, there are testing queries in total (i.e., ), which are randomly selected from the test set. Fig. 3 illustrates our experimental results on both Yelp and Amazon datasets. According to the numerical results, we note that our designed framework with AIP can work well on both datasets, and has competitive performance among all other alternatives as well as baselines, although TextBugger achieves the highest FR score with better robustness. Besides, we also observe that AIP_R does not effectively work for generating flipping samples, which indicates that random optimization in attribute-informed latent space cannot help for counterfactual sample generation.

4.3.2. Quality Evaluation.

Figure 4. Quality evaluation for text counterfactuals.

As for the quality assessment of counterfactual samples, we employ the Latent Perturbation Ratio (LPR) metric to measure the latent closeness between the generated sample and original query instance . Since high-quality counterfactual samples typically need to ensure sparse changes in the robust feature space, thus the smaller the LPR is, the better the counterfactual we have. To be specific, the LPR can be calculated by:


where indicates the norm operation, and are the raw feature embeddings respectively for and . To make a fair comparison, we use the same encoder function for all generated samples to obtain the corresponding latent representation vectors. In this set of experiments, the latent dimension is (i.e., ), and the final LPR value for particular method is recorded with the average over testing queries. The relevant numerical results are presented in Fig. 4. From the experiments, it is noted that xGEMs and the proposed AIP method significantly outperform other baselines, indicating that the corresponding generated samples actually maintain more robust features regarding to the query. Furthermore, the proposed AIP is noted to be slightly better than xGEMs, which may partially result from the conditional generation brought by attribute vector . This set of results also validate a fact that adversarial samples typically utilize some artifacts to flip the model decisions, instead of using some robust features.

4.3.3. Efficiency Evaluation.

Figure 5. Efficiency evaluation for text counterfactuals.

To compare the efficiency, we record the time consumption for each method over testing queries in the generation phase on the same machine. Specifically, we calculate the average time cost for one query, and further employ this as the metric to access the efficiency for particular methods. Fig. 5 shows the relevant experimental results. Based on the statistics, it is observed that adversarial related methods (i.e., TextBugger and DeepWordBug) typically consume less time per query in average, compared with the counterfactual generation methods, which is mainly due to the fact that adversarial methods do not need to conduct encoding computations before sample generation. As for our proposed AIP method, the time efficiency is roughly the same as the alternative xGEMs, but it is significantly better than its random version AIP_R which needs more iterations to converge.

4.3.4. Qualitative Case Studies.

Counterfactual on Negative sentiment (Yelp)
Query: this is the worst walmart neighborhood market out of any of them
TextBugger: this is the worst wa1mart neighborho0d market out of a ny of them
DeepWordBug: this id the wosrt walmart neighobrhood market out of any of htem
xGEMs: that is good walmart market out of any neighborhood
AIP: this is the best walmart neighborhood market for all of them
Counterfactual on Positive sentiment (Amazon)
Query: this item works just as i thought it would
TextBugger: this item w0rks just as i tho ught it wou1d
DeepWordBug: this item wroks just ae i thought it wolud
xGEMs: this item works out poorly just as i thought disappointed
AIP: this item works bad just as i thought it would not play
Table 4. Case studies on generated text samples.
Figure 6. Evaluations for image counterfactual generation.
Figure 7. Qualitative case studies on generated image samples.

Here, we present several representative case studies from different methods, shown in Tab. 4, aiming to provide a qualitative comparison for generated text samples. Based on the Tab. 4, we can see that adversarial texts typically provide limited insights for humans on counterfactual analysis, since they mainly make use of the model artifacts to flip the prediction. Nevertheless, with the samples generated by xGEMs and AIP, we can easily observe some sentiment variation regarding to the query instance, which sheds light on model behaviors and facilitates further human reasoning on black-box models. Besides, compared with xGEMs, the proposed AIP method usually can generate more sensible counterfactuals with the aid of attribute conditions.

4.4. Image Counterfactual Evaluations

In this part, we specifically evaluate the designed framework with AIP on image counterfactual generation. Instead of simply considering one attribute for conditional generation in texts, we take multiple attributes into account for image counterfactuals. In this set of experiments, our target DNN follows the common CNN architecture and is trained as a gender classifier, which can classify an input image as Male or Female. All involved raw images for target DNN come from the CelebA dataset, and we use data for training, for development, for testing. The relevant quantitative results are all illustrated by Fig. 6.

4.4.1. Effectiveness Evaluation.

For the effectiveness assessment, we still use the FR metric indicated by Eq. 7. In the experiments, we set , and aim to test how many of them can be effectively flipped with particular methods. Fig. 6(a) illustrates the relevant numerical results, where adversarial method FGSM performs the best on FR and can flip nearly every testing query. We note that the proposed AIP method ranks the second, and outperforms other counterfactual generation methods. Besides, it is also observed that CADEX and AIP_R performs relatively bad for the image counterfactual task within certain iterations, even though CADEX is proved to work well for tabular instances (Moore et al., 2019).

4.4.2. Quality Evaluation.

Similar to text counterfactual scenario, we employ the LPR metric, shown as Eq. 8, to measure the quality of the generated image counterfactuals. In experiments, the latent dimension constructed by is (i.e., ), and the corresponding LPR for particular method is recorded by calculating the average over testing queries. Relevant experimental results are shown by Fig. 6(b). Based on the LPR comparison, we note that the samples generated by FGSM and CADEX change a lot in the latent feature space, because both methods directly rely on the input perturbation for sample generation. As for the proposed AIP, it achieves the lowest LPR among all the alternatives and baselines, and it is significantly better than its random version AIP_R.

4.4.3. Efficiency Evaluation.

We similarly employ the average time consumption per query to evaluate the efficiency aspect for image counterfactual generation. Specifically, the average time is obtained over the testing queries randomly selected from the test set. Fig. 6(c) shows the relevant experimental results. According to the statistics and comparison, we note that FGSM is the most efficient one, and xGEMs consumes the least time in average among all other counterfactual-based methods. As for the proposed AIP, a competitive efficiency performance is observed, and is remarkably superior compared with that of Counter_Vis, CADEX and AIP_R.

4.4.4. Qualitative Case Studies.

To facilitate a qualitative comparison among different methods, we specifically show some case studies, illustrated by Fig. 7. We select several query instances whose model predictions are female, and then employ different methods to generate the corresponding image samples which flip the model decisions for counterfactual purpose. According to the results, we note that the samples generated by FGSM and CADEX do not have salient visual changes regarding to the query instances, which largely limits the human reasoning on model behaviors. Among other alternative methods, it is observed that the proposed AIP is capable of generating counterfactuals with better visual quality, which present much smoother transitions from female to male.

4.5. Influence of Hyper-parameter

In this part, we show some additional results on the influence of hyper-parameter in Eq. 3. Other experimental settings keep unchanged. The relevant results are shown by Fig. 8.

Figure 8. Influence of on FR and LPR metrics.

Based on the results, we observe that serves as a knob to control the effectiveness and sample quality of the designed framework. To select an appropriate , we actually need to strike a balance between FR and LPR, where the larger the is, the lower the effectiveness is and the higher the sample quality is. Different data types may also have different trade-off curves.

4.6. Applications

In this part, we focus on some practical scenarios which may benefit from the counterfactual samples generated by our designed framework. In particular, we show the applications of the framework respectively on feature interaction and data augmentation.

4.6.1. Feature Interaction

Figure 9. Feature interactions for the decision change.

Understanding the feature interaction could be very important in lots of real-world domains. A typical example is the bias detection task, where humans aim to find out a related set of features which can significantly influence the correctness or fairness of model decision. Utilizing our designed framework for counterfactual analysis can partially help this practical task. By observing the perturbation scale on attribute vector of the generated counterfactual, humans can have a sense on which semantic features contribute significantly to the flipping of model decision. To illustrate the point, we show another case result from the designed framework with AIP in Fig. 9. Here, we train an age classifier on the CelebA dataset as our target DNN, and aim to analyze the feature interaction of a query prediction as “Old”. Based on the attribute perturbations of the generated sample, we note that the top semantic attributes are “Male”, “Bushy_Eyebrows”, “Black_Hair” and “Bangs”, besides the target attribute. This result directly demonstrates the fact that the “Male” attribute has a strong interaction with the predicted attribute for this particular query, and the target DNN exists potential gender bias for its age predictions.

4.6.2. Data Augmentation

Dataset CNN (Kim, 2014) VDCNN (Conneau et al., 2017)
Yelp Initial 82.33% ( 0.61%) 88.79% ( 0.53%)
Augmented 83.16% ( 0.57%) 89.95% ( 0.46%)
Amazon Initial 81.96% ( 0.52%) 88.55% ( 0.63%)
Augmented 82.41% ( 0.49%) 88.76% ( 0.55%)
Dataset CNN (Guo et al., 2017) ResNet (He et al., 2016)
CelebA Initial 87.32% ( 0.22%) 90.96% ( 0.27%)
Augmented 88.85% ( 0.21%) 91.35% ( 0.25%)
Table 5. Model performance with data augmentation.

Another application of the designed framework is the data augmentation for model training. By taking full advantage of the generated counterfactual samples as new training instances, we aim to obtain better DNN models with higher performance and robustness. Specifically, to test the improvement, we train several DNN models on relatively smaller training sets, which are essentially the subsets of original data. For the sentiment classifiers on Yelp and Amazon, our initial training size is , containing positive and negative reviews. The extra counterfactual training size is whose queries are randomly selected from the initial training set. For the binary age classifier on CelebA, we employ a similar setting for training, where each class includes initial samples, and generated counterfactual samples are further incorporated for augmentation. Relevant experimental results are shown in Tab. 5

. Based on the statistics, we note that the augmented training with counterfactual samples typically achieves higher classification accuracies with smaller variances, which can also be observed under some advanced DNN structures.

5. Related Work

Generating counterfactual explanation is just one of many interpretation methods for black-box models, which generally belongs to the family of interpretable machine learning. According to the particular problems they focus on, interpretation methods can be divided into the following three categories in general.

The first category of methods aims to answer the “What”-type questions, i.e., what part of the input mostly contribute to the model prediction. A representative work in this category is LIME (Ribeiro et al., 2016), where authors specifically employ linear models to approximate the local decision boundary and further formulate it as a sub-modular optimization problem for model interpretation. The feature importance in LIME is essentially obtained by observing the prediction changes after perturbing input samples. Similar related methods can also be found in Anchors (Ribeiro et al., 2018) and SHAP (Lundberg and Lee, 2017). Another common methodology under this category is to utilize the model gradient information, where gradients are typically regarded as an indicator for perturbation sensitivity. Related methods can be found in GradCAM (Selvaraju et al., 2017), Integrated Gradients (Sundararajan et al., 2017), and SmoothGrad (Smilkov et al., 2017).

The second category aims to answer the “Why”-type questions, i.e., why the input is predicted as label A instead of B. The methods under this category can be quite different from the previous ones, since these methods basically need to consider two labels simultaneously. There are several different methodologies proposed for this problem. For example, the authors in (Dhurandhar et al., 2018) design a contrastive perturbation method to derive related positive and negative features of input regarding to the concerned label. Besides, a general method based on structural causal models is proposed in (Miller, 2018) to tackle the problem specifically in classification and planning scenarios. Also, a generative framework CDeepEx is designed in (Feghahati et al., 2018) to particularly investigate this problem for images by utilizing GAN.

The third category lies in the “How”-type questions, i.e., how to particularly modify the input so as to flip the model prediction to the preferred label. This problem is a natural extension of the “Why”-type, and it can somewhat to be handled by the second category of methods under some simple scenarios. However, for problems with high-dimension space, previous categories of methods typically fail due to the intractable computation for sample modification. Several particular methods are raised to solve this issue. For example, authors in (Goyal et al., 2019) propose a straightforward solution with image region replacement, which is essentially a feature replacement process for input with the aid of a distractor. In work (Agarwal et al., 2019), authors novelly use the input itself as the distractor for feature replacement by utilizing GAN for inpainting. Besides, generative modeling is another potential way for this problem, and related methods can be found in (Singla et al., 2020; Joshi et al., 2018; Liu et al., 2019). Our work belongs to this branch of methodology.

6. Conclusion And Future Work

In this paper, we design a framework to generate counterfactual explanation for black-box DNN models specifically with raw data instances. By taking advantage of the generative modeling techniques, we effectively construct an attribute-informed latent space for particular data, and further utilize this space for counterfactual generation. To guarantee the validity of the generated samples, we propose the AIP method to iteratively optimize the specific attribute-informed latent vectors according to the counterfactual loss term, from which the counterfactuals can be finally obtained through data reconstruction. We evaluate the designed framework with AIP on several real-world datasets, including both texts and images, and demonstrate its effectiveness, sample quality as well as efficiency. Future extension of this work may possibly include the investigation under the “close possible worlds” assumption, where the goal is to find an optimal set of counterfactuals for a query instead of a single sample. Besides, employing causal models for counterfactual generation is another promising direction to explore.


  • C. Agarwal, D. Schonfeld, and A. Nguyen (2019) Removing input features via a generative model to explain their attributions to classifier’s decisions. arXiv preprint arXiv:1910.04256. Cited by: §1, §5.
  • N. Asghar (2016) Yelp dataset challenge: review rating prediction. arXiv preprint arXiv:1605.05362. Cited by: 1st item.
  • C. Chen, O. Li, D. Tao, A. Barnett, C. Rudin, and J. K. Su (2019)

    This looks like that: deep learning for interpretable image recognition

    In Advances in Neural Information Processing Systems, pp. 8928–8939. Cited by: §1.
  • A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun (2017) Very deep convolutional networks for text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1107–1116. Cited by: Table 5.
  • A. Dhurandhar, P. Chen, R. Luss, C. Tu, P. Ting, K. Shanmugam, and P. Das (2018) Explanations based on the missing: towards contrastive explanations with pertinent negatives. In Advances in neural information processing systems, pp. 592–603. Cited by: §1, §1, §5.
  • F. Doshi-Velez and B. Kim (2017) Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608. Cited by: §1.
  • M. Du, N. Liu, and X. Hu (2020) Techniques for interpretable machine learning. Commun. ACM 63 (1), pp. 68–77. Cited by: §1.
  • A. Feghahati, C. R. Shelton, M. J. Pazzani, and K. Tang (2018) CDeepEx: contrastive deep explanations. Cited by: §5.
  • J. Gao, J. Lanchantin, M. L. Soffa, and Y. Qi (2018) Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 50–56. Cited by: 2nd item.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR, Cited by: 3rd item.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.2.
  • Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, and S. Lee (2019) Counterfactual visual explanations. In International Conference on Machine Learning, pp. 2376–2384. Cited by: §1, §1, 4th item, §5.
  • T. Guo, J. Dong, H. Li, and Y. Gao (2017) Simple convolutional neural network on image classification. In 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA)(, pp. 721–724. Cited by: §4.1.2, §4.2.2, Table 5.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: Table 5.
  • R. He and J. McAuley (2016) Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pp. 507–517. Cited by: 2nd item.
  • Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen (2019) Attgan: facial attribute editing by only changing what you want. IEEE Transactions on Image Processing 28 (11), pp. 5464–5478. Cited by: §4.1.3, §4.2.2.
  • Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017) Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine Learning, pp. 1587–1596. Cited by: §2.2, §4.1.3.
  • S. Joshi, O. Koyejo, B. Kim, and J. Ghosh (2018) XGEMs: generating examplars to explain black-box models. arXiv preprint arXiv:1806.08867. Cited by: 6th item, §5.
  • B. Kim, R. Khanna, and O. O. Koyejo (2016) Examples are not enough, learn to criticize! criticism for interpretability. In Advances in neural information processing systems, pp. 2280–2288. Cited by: §1, §1.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In Proceedings of the Conference on EMNLP, pp. 1746–1751. Cited by: §4.1.2, §4.2.1, Table 5.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.2.
  • J. Li, S. Ji, T. Du, B. Li, and T. Wang (2018) Textbugger: generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271. Cited by: 1st item.
  • S. Liu, B. Kailkhura, D. Loveland, and H. Yong (2019) Generative counterfactual introspection forexplainable deep learning. Technical report Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States). Cited by: §5.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738. Cited by: 3rd item.
  • S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in neural information processing systems, pp. 4765–4774. Cited by: §5.
  • T. Miller (2018) Contrastive explanation: a structural-model approach. arXiv preprint arXiv:1811.03163. Cited by: §5.
  • J. Moore, N. Hammerla, and C. Watkins (2019) Explaining deep learning models with constrained adversarial examples. In

    Pacific Rim International Conference on Artificial Intelligence

    pp. 43–56. Cited by: 5th item, §4.4.1.
  • K. P. Murphy (2012) Machine learning: a probabilistic perspective. MIT press. Cited by: §2.2.
  • J. Pearl and D. Mackenzie (2018) The book of why: the new science of cause and effect. Basic Books. Cited by: §2.1.
  • J. Pearl et al. (2009) Causal inference in statistics: an overview. Statistics surveys 3, pp. 96–146. Cited by: §1.
  • S. Pouyanfar, S. Sadiq, Y. Yan, H. Tian, Y. Tao, M. P. Reyes, M. Shyu, S. Chen, and S. Iyengar (2018) A survey on deep learning: algorithms, techniques, and applications. ACM Computing Surveys (CSUR) 51 (5), pp. 1–36. Cited by: §1.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ” Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §5.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2018) Anchors: high-precision model-agnostic explanations. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §5.
  • E. L. Rissland (1991) Example-based reasoning. Informal reasoning in education, pp. 187–208. Cited by: §2.1.
  • P. Samangouei, A. Saeedi, L. Nakagawa, and N. Silberman (2018) ExplainGAN: model explanation via decision boundary crossing transformations. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 666–681. Cited by: §3.3.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §1, §5.
  • A. Sen, X. Zhu, L. Marshall, and R. Nowak (2019) Should adversarial attacks use pixel p-norm?. arXiv preprint arXiv:1906.02439. Cited by: §2.1.
  • S. Singla, B. Pollack, J. Chen, and K. Batmanghelich (2020) Explanation by progressive exaggeration. In 8th International Conference on Learning Representations, ICLR 2020, Cited by: §3.3, §5.
  • D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg (2017) Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §5.
  • M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3319–3328. Cited by: §5.
  • A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. (2016) Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pp. 4790–4798. Cited by: §2.2.
  • S. Wachter, B. Mittelstadt, and C. Russell (2017) Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harv. JL & Tech. 31, pp. 841. Cited by: §2.1.
  • A. White and A. d. Garcez (2019) Measurable counterfactual local explanations for any classifier. arXiv preprint arXiv:1908.03020. Cited by: §1, §1.