explanationsbyminimizinguncertainty
Code for "Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties"
view repo
Counterfactual explanations (CEs) are a practical tool for demonstrating why machine learning classifiers make particular decisions. For CEs to be useful, it is important that they are easy for users to interpret. Existing methods for generating interpretable CEs rely on auxiliary generative models, which may not be suitable for complex datasets, and incur engineering overhead. We introduce a simple and fast method for generating interpretable CEs in a whitebox setting without an auxiliary model, by using the predictive uncertainty of the classifier. Our experiments show that our proposed algorithm generates more interpretable CEs, according to IM1 scores, than existing methods. Additionally, our approach allows us to estimate the uncertainty of a CE, which may be important in safetycritical applications, such as those in the medical domain.
READ FULL TEXT VIEW PDFCode for "Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties"
The growing number of decisions influenced by machine learning models drives the need for explanations of why a system makes a particular prediction (Sartor and Lagioia, 2020). Explanations are necessary for users to understand what factors influence a decision and understand what changes they could make to alter it. One important application for such explanations is recourse, where the explanations allow users to understand what adjustments they could make to the input to change the classification given by the model (Spangher et al., 2018)
A common approach is to generate a counterfactual explanation (CE) of the form “If X had not occurred, then Y would not have occurred” (Wachter et al., 2017). Consider the following binary classification problem: “Given the current specifications of my house (e.g., location, number of bedrooms, etc.), am I likely or unlikely to sell it for $300,000?”. On inputting the details of their apartment, the user might receive the classification “unlikely”. In this example, a CE could be the same house with upgraded furnishings to increase the desirability, resulting in the classification “likely”.
Methods for generating CEs focus on finding an alternate input that is close to the original input, but with the desired classification (Molnar, 2019). However, this highlights a fundamental difficulty in designing CEs, namely their similarity to adversarial examples. Both CEs and adversarial examples search for a minimal perturbation to add to the original input that changes the classification. The distinguishing conceptual feature is interpretability: while CEs should be interpretable, adversarial examples need not be ^{1}^{1}1Although there is a common conception that adversarial attacks generate imperceptible changes, the term ‘perceptible’ is illdefined, and many adversarial perturbations are visible to the human eye (see e.g. Papernot et al., 2016; Sharif et al., 2016; Brown et al., 2017). However, interpretability is an ambiguous term, with varying definitions in existing literature (Lipton, 2018).
We propose defining an interpretable CE as one that is realistic, i.e., a likely scenario for the user in question, and unambiguous, i.e., not a pathological ‘borderline’ case. Figure 1
provides an illustration of these two properties for an MNIST image
(LeCun et al., 2010). Here, we want to find a minimal change to to alter the original image of a so that it is . Second from the left is an example of a CE that is not realistic – it doesn’t resemble a “normal” . Third from the left is an example of an ambiguous counterfactual; it is unclear whether it depicts a or . In the final image is a CE that is both realistic and unambiguous, which is clearly preferable. We give a more extensive definition of realism and unambiguity in Section 2.Existing work largely focuses on generating realistic CEs, and does not consider ambiguity (Wachter et al., 2017; Dhurandhar et al., 2018; Joshi et al., 2019). Additionally, many of these approaches rely on using an auxiliary generative model, in addition to the classifier, to either generate realistic CEs or evaluate the realism of CEs in order to guide a search process. This may impose a bottleneck, as generative models are illsuited for some datasets, and incur engineering and maintenance overhead.
In this work, we propose capturing realism and ambiguity using the predictive uncertainty of the classifier. We consider two types of uncertainty: epistemic and aleatoric uncertainty (Kendall and Gal, 2017). Epistemic uncertainty is uncertainty due to a lack of knowledge, stemming from observing only a subset of all possible data points. We propose that CEs for which the classifier has low epistemic uncertainty are more realistic, because they are more likely under the data distribution. Aleatoric uncertainty captures inherent stochasticity in the dataset, for example due to points that lie on the decision boundary between two classes. Therefore, CEs with lower aleatoric uncertainty will have lower ambiguity. In Section 3 we discuss both concepts in more depth.
Based on these insights, we introduce a novel method for generating interpretable CEs by using a classifier that offers estimates of epistemic and aleatoric uncertainty. This method does not require an auxiliary generative model and requires less hyperparameter tuning than existing methods. Existing neural network classifiers can be easily extended to represent uncertainty, for example, by using Monte Carlo dropout
(Gal and Ghahramani, 2016), thus this approach has a low engineering cost. Additionally, for many applications where it is necessary to offer an explanation, it may also be essential to quantify the uncertainty in the predictions. Thus, uncertainty estimates might already be available and could readily be used for generating CEs.Our contributions are that we:
link the concepts of aleatoric and epistemic uncertainty to the concepts of unambiguous and realistic CEs (Section 3),
introduce a new method for generating interpretable CEs based on implicit minimisation of both epistemic and aleatoric uncertainty (Section 3),
demonstrate empirically, from both a qualitative and quantitative perspective, that our method generates more interpretable CEs than existing methods, despite not requiring an auxiliary model (Section 4.3).
We release an implementation of our algorithm, and the experiments, at github.com/oscarkey/explanationsbyminimizinguncertainty.
In this section we define the desirable properties of CEs, including those which make a CE interpretable.
Before doing this, we clarify the term ‘counterfactual explanation’. Consider an initial input which is to be explained. We can write the alternative input, , found as the explanation as , where is the minimal change. From here on we will use counterfactual explanation to refer to , and counterfactual perturbation (CP) to refer to .
Explanation desiderata are subjective, and some are not mentioned below. Our goal is not to define a complete list of all possible desiderata, but simply to make explicit the framework and targets we consider in this work. If interested, the reader can refer to Lipton (2018) for a more in depth discussion.
For each desideratum below, we illustrate it using the example given in the introduction: a landlord has a two bedroom, one bathroom, one garage house in Boston with a small garden. A classifier answers the question “Is this property likely to sell for $300,000?” with False. The goal is to generate explanations of the form “If the property had X, then the classifier would return True”.
The CE should be as similar as possible to the original instance, i.e. there should be as few changes as possible between and (Huysmans et al., 2011; Wachter et al., 2017; Molnar, 2019; Laugel et al., 2019; Van Looveren and Klaise, 2019). By making as few changes as possible, we produce concise explanations that are more interpretable and avoid information overload (Lahav et al., 2018). For example, consider the following two CPs that both change the classification of the aforementioned problem to True:
repainting the kitchen
repainting both the kitchen and the bathroom
As both obtain the desired outcome, is more desirable as it is more concise.
The suggested explanation must be from a “possible world” (Wachter et al., 2017). This is important because the explanation must represent a concept that the user understands in order for it to be informative to them. For example, the explanation “if the garage was rebuilt into small rooms, then it is likely the house could be sold for ,” is clearly unrealistic and not informative to the user. In comparison, “if the garage was rebuilt into an ensuite bedroom, then it is likely the house could be sold for ,” would be a reasonable explanation. In addition, the feature values must be realistic when considered together (Joshi et al., 2019). For example, a onebedroom house with bathrooms would not be a realistic explanation as most real houses have a higher bedroom to bathroom ratio.
CEs should be unambiguous to be informative. In this context, we take informative to mean explanations that humans can understand and learn from. For example, doctors may be interested in informative explanations from a breast cancer detection model.
Ambiguous inputs may be classified with a low confidence score, result in ‘borderline’ cases or inputs that resemble multiple classifications. For example, an ’ambiguous’ house specification is one that one buyer might value over ,, but another buyer might value under ,. For a visual example, see Figure 1, where the input resembles both a and .
It must be possible for the user to apply the suggested CP in practice. While the ‘Realistic Explanation’ property ensures that the explanation is a possible instance, it will only provide the user with recourse if it is possible for them to apply the suggested perturbation to transition from their original input to the explanation. For example, while having an identical house to the original but in New York city would be a realistic counterfactual, it is not an actionable perturbation because the user cannot move their house to a different city.
The algorithm must generate CEs sufficiently quickly for the use case (Van Looveren and Klaise, 2019). While other computational properties of the algorithm, such as memory usage, are also important, we highlight run time because recourse is often offered in a user facing application, so the algorithm must be able to generate CEs sufficiently quickly for this interactive setting. Many generation algorithms involve nonconvex optimisation and repeated evaluations of a potentially expensive model, thus run time is a significant concern.
In our approach, we will target all of the above desiderata. We explicitly target the desiderata unambiguous and realistic
, through our design of the loss function. We believe these desiderata are particularly important as they distinguish CEs from adversarial examples. The remaining desiderata are targeted implicitly through the design of the optimisation procedure of our CE generation algorithm. In the next section, we will introduce our method and discuss how each desideratum is addressed.
In this section we introduce a method for generating interpretable CEs. In particular, we introduce and motivate using epistemic and aleatoric uncertainty to capture realism and unambiguity. Next, we show that minimizing both types of uncertainty can be implemented efficiently by minimizing the crossentropy objective of specific model classes. Based on these insights, we present a fast, greedy algorithm that generates minimal perturbations that minimize both types of uncertainty, resulting in interpretable explanations. Note that our method is a posthoc – this method is used on trained classifiers to generate CEs.
We begin by following Wachter et al. (2017) in framing the task of generating CEs as an optimisation problem. Given an input , we can generate an explanation in class by solving
(1) 
where is the classifier, is a loss function, is a hyperparameter, and is a measure of intepretability (for which lower is better). Intuitively, we want to generate an explanation in class , which is encouraged by , and is interpretable, as encouraged by . The main difficulty is the definition of , the measure of interpretability. As previously introduced, we define by considering two key aspects of interpretability: realism and unambiguity.
First we consider how to generate realistic CEs. As we discuss in Section 4, existing literature has revealed that this property is the most difficult to achieve, thus improving on it is the primary focus our work. We suggest that, when generating a CE in target class , we should maximise , where is the training data distribution. Our justification for this builds on the work of Dhurandhar et al. (2018); Joshi et al. (2019); Van Looveren and Klaise (2019), as we discuss in detail in Section 4.1. In short, explanations which are likely under the distribution of the training data will appear familiar to the user and thus realistic. Specifically, we should consider the distribution for the target class, i.e. , in order to generate examples which look realistic for the particular target class . For example, it would not be realistic for a house classified as expensive to be very small and in a cheap area.
Given this definition of realistic, Bayes’ rule gives us the following expression for the unnormalized density,
(2)  
(3) 
If we use a standard classification model with a softmax output, then is estimated by the output of the model. To compute , the likelihood of under the training data distribution, we have several choices. One option would be to use a separate generative model to estimate . This would lead us to a similar objective to that introduced by Dhurandhar et al. (2018) and Van Looveren and Klaise (2019).
Instead, we note that we can approximate without the need for an additional model by using a classifier that offers estimates of uncertainty over its predictions (Smith and Gal, 2018; Grathwohl et al., 2020). In particular, we can use the estimate of epistemic uncertainty. This is uncertainty about which function is most suitable to explain the data, because there are many possible functions which fit the finite training data available. Considering the input space, a Bayesian classifier will have lower epistemic uncertainty on points which are close to the training data, and the uncertainty will increase as we move away from it. Thus epistemic uncertainty should be negatively correlated with . Gal and Smith (2018) show empirically that this is in fact the case for Bayesian neural networks implemented using deep ensembles. Thus, given a classifier which offers estimates of epistemic uncertainty we can compute an unnormalized value for .
Second, we consider how to generate unambiguous CEs. To capture ambiguity, we use aleatoric uncertainty. This type of uncertainty arises due to inherent noisiness, or stochasticity, in the data distribution (Smith and Gal, 2018). To generate unambiguous CEs, we generate explanations in areas of the input space where the classifier has low aleatoric uncertainty.
There are several different approaches for obtaining classifiers that offer estimates of epistemic and aleatoric uncertainty. For the experiments in this paper we choose to use an ensemble of deep neural networks, as this is a simple method for computing high quality uncertainty estimates (Lakshminarayanan et al., 2017)
. Contrary to other methods for estimating uncertainty in deep learning, deep ensembles place no constraints on the architecture class of the classifier. Note that our approach will work with any model that offers uncertainty estimates.
We define as the predictive entropy of the classifier when evaluated on input . Predictive entropy captures both aleatoric and epistemic uncertainty, and both are low when the predictive entropy is low. Specifically, the predictive entropy estimated using ensembles is
(4)  
(5) 
where we have models in the ensemble (Smith and Gal, 2018). Here, is the softmax output of the th model in the ensemble.
Having defined as the predictive entropy, we note that the term in Equation 1
is redundant. This is because a counterfactual that minimizes the crossentropy (i.e., that maximizes the probability to be assigned to a class) must also minimize the predictive entropy (i.e., be likely under our approximation of the data distribution). Formally,
For a classification model , , where is crossentropy and is predictive entropy.
We provide a formal derivation in Appendix A. The intuition behind this proposition is that the crossentropy is minimized (equal to ) when the target class has a probability and all other classes have probability . In this scenario, predictive entropy will also be minimized at .
As a result, we drop the term from Equation 1 and objective becomes simply
(6) 
This simplification of the objective makes it cheaper and easier to generate CEs. We avoid the minimax optimization of the parameter , which might otherwise increase the computational cost of the optimization. Additionally, it eliminates the hyperparameter , which would otherwise need to be tuned. As we discuss in detail in Section 4.1, this is an improvement over existing approaches such as Wachter et al. (2017), which requires both minimax optimization and tuning of , and Van Looveren and Klaise (2019), which uses an objective with several hyperparameters.
Above, we propose an objective function for generating realistic CEs, and show that it can be implicitly minimized by minimizing the crossentropy loss. If we optimize the loss function directly, we will generate a sample in class . However, this does not incorporate the minimality or realistic perturbation properties. Our approach to ensuring both of these properties are satisfied is to constrain the optimization process through the optimization algorithm. Specifically, we extend the Jacobianbased Saliency Map Attack (JSMA), originally introduced by Papernot et al. (2016) for the purpose of generating adversarial examples. We adapt this algorithm to generate meaningful perturbations.
JSMA is an iterative algorithm that updates the most salient feature, i.e. the feature that has the largest influence on the classification, by at each step. To generate realistic CEs rather than adversarial examples, we replace the original definition of saliency by defining the most salient feature as that which has the largest gradient with respect to the objective in Equation 6:
(7) 
where denotes the partial derivative and denotes the th feature of . Updating each feature iteratively by
acts as a heuristic for minimising the
distance between the original input and the CE (Papernot et al., 2016). The algorithm terminates when the input is classified as the target class with high confidence, or after reaching the maximum number of iterations. Alternatively, the algorithm can be configured to fail if the explanation does not reach the predefined confidence level. This enforces that generated explanations are those on which the classifier has low uncertainty, which may be important for certain applications. We give pseudocode in Algorithm 1.This is a fast algorithm for generating realistic and unambiguous explanations using minimal perturbations. We also want to ensure that the perturbation is realistic and actionable. In many cases, we can manually identify the features a user cannot change, and lock these features to prevent the algorithm from perturbing them. For example, we might prevent the algorithm from changing the location of a house. This simple approach assumes that we are explicitly aware of factors that can be changed, which is often but not always the case. We leave a detailed investigation into other approaches for generating realistic perturbations for future work.
Our proposed method works with any classifier that both offers uncertainty estimates and for which we have access to the gradients (of crossentropy loss with respect to some input). However, if it is possible to retrain the classifier then the realism of the generated explanations can be improved by applying adversarial training, as we demonstrate empirically in Section 4.3. Specifically, we augment the dataset during training using adversarial examples generated by FGSM (Goodfellow et al., 2015), see Appendix C for details.
We suggest that adversarial training might improve the realism of the generated CEs for two reasons. First, Lakshminarayanan et al. (2017) demonstrate that adversarial training improves uncertainty estimation, both on indistribution and outofdistribution inputs. This should improve the performance of our method, as we generate CEs in areas of input space where the classifier has low uncertainty.
Second, adversarial training can lead to learning more robust features (Tsipras et al., 2019; Ilyas et al., 2019). Augmenting the training set with adversarial examples during training ensures that the model does not focus on noise when learning features for classification. As such, the model is more likely to learn features that are not noise, and therefore are more interpretable (Tsipras et al., 2019). An example of the effect of adversarial training is shown in Figure 2. The saliency of an adversarially trained model, as shown by the middle image, is more aligned with human interpretation than the saliency of a regular model (shown by the right image).
We discuss these two effects further in Appendix C.
Below we summarize the different methods used to generate CEs. We begin with Wachter et al. (2017), who frame the task of finding a CE in target class for initial input as the optimization problem
(8) 
where is the classifier, is a loss function (the authors use MSE loss), and is some measure of distance (the authors use a weighted distance), and is a hyperparameter. This is equivalent to the objective function we use in our approach, as given in Equation 1, if is defined as distance to the original input. In comparison to our approach, this definition of does not give any consideration to ensuring that is realistic, and Wachter et al. (2017) note that it risks generating adversarial examples.
Various approaches adapt Equation 8 in an attempt to generate realistic CEs:
Dhurandhar et al. (2018)
include an additional penalty in the objective to encourage CEs to lie on the training data manifold. The authors fit an auxiliary autoencoder model to the training data. In the objective, they then include the reconstruction loss of applying this autoencoder to the CE. The assumption is that reconstruction loss will be higher for CEs which are not likely under the training distribution, which will encourage the approach to generate realistic CEs.
Van Looveren and Klaise (2019) note that the approach introduced by Dhurandhar et al. (2018) does not take into account the data distribution of each class, for example a very large house is unlikely to also be very cheap. Thus, the authors include an additional loss term which guides the search process towards a ‘prototype’ instance of the relevant class. The prototype for each class is defined as the average location in a latent space of all the training points in that class. Again, the authors use an autoencoder to map inputs into the latent space. One disadvantage of this approach is that the objective function contains several terms, and a hyperparameter must be specified for each in order to scale them appropriately.
The methods above generate CEs by searching in input space. In contrast, Joshi et al. (2019) search in a latent space, and use a generative model to map instances from this latent space into the input space in order to evaluate them. The objective is
(9) 
where is a distribution over the latent space, is the generative model mapping the latent space to input space, and is a cost function (which we can view the same as the distance function). One limitation of this approach is that the CEs are produced by the generative model, thus suffer from the pathologies of that model. For example, a VAE is likely to generate blurry explanations.
We claim that our approach has several advantages over these methods. First, we avoid the engineering, and potential computational, overheads of implementing, training, and maintaining an auxiliary generative model. Second, we have a simple objective function which does not involve minimax optimization or specifying hyperparameters, both of which incur additional computational cost.
The weakness of our method is that it requires a classifier which offers uncertainty estimates. This is likely to have higher computational cost, and in particular the ensemble of classifiers that we use in our experimental work is more expensive to train and evaluate than a single model. However, our method can be used with any classifier that offers both epistemic and aleatoric uncertainty, and several fast approaches are available for deep learning models (Gal and Ghahramani, 2016; Liu et al., 2020; Van Amersfoort et al., 2020). Additionally, we argue that many applications where the machine learning system must offer the user recourse, estimates of the uncertainty in the classification will also be required, and so will already be available from the classifier. For example, when using a machine learning tool to make a decision, estimates of the uncertainty in the predictions are very important to be able to act cautiously, or defer to a human expert, when the model is unsure.
Counterfactual examples are closely related to adversarial examples. Adversarial examples are crafted by finding the minimal perturbations required to change the classification of an image Szegedy et al. (2013). Mathematically, this can be formulated as
(10) 
where is the original input, is the adversarial example, is the target class, is a distance metric and is the classifier.
This is very similar to the mathematical formulation used to generate CEs in Equation 8. In literature, the distinguishing feature between the two fields is interpretability. While we want counterfactual examples to be interpretable, adversarial examples need not be. Our work focuses on this distinguishing feature; we design an algorithm that leverages uncertainty to generate interpretable CEs.
To evaluate CE generation algorithms, we evaluate the realism and minimality of the CEs generated. To measure minimality, we report the distance between the original input and the explanation. Realism is more difficult to measure because it is poorly defined. In the literature there are several approaches:
Dhurandhar et al. (2018) use subject experts to evaluate the CEs generated by their approach by hand. This provides groundtruth data on human interpretability. We choose not to use this approach because it is not automated, and so difficult to perform at scale and not suitable for frequent evaluation when tuning hyperparameters.
This approach is based on the concept of justification: a CE is justified if there is a path in input space between the CE and a point in the training set that does not cross the decision boundary between classes. The authors introduce an algorithm which evaluates what fraction of the CEs generated by an algorithm are justified. We choose not to use this approach because the algorithm does not scale to high dimensional input spaces, as it relies on populating an epsilon ball around the explanation with points. Additionally, it is not clear if the notion of justification relates to the same definition of human interpretability as we use in this work.
Two metrics based on the reconstruction losses of autoencoders are
(11)  
(12) 
where is an autoencoder trained only on instances from class , and is an autoencoder trained on instances from all classes. We can see that is the ratio of the reconstruction loss of an autoencoder trained on the counterfactual class divided by the loss of an autoencoder trained on all classes. is the normalized difference between the reconstruction of the CE under an autoencoder trained on the counterfactual class, and one trained on all classes.
We choose to evaluate the realism of the explanations generated by our method using . We omit IM2 because it fails to pass a sanity check. In particular, we find that IM2 scores are not significantly different for outofdistribution data (i.e., ‘junk data’) than indistribution data. See Appendix D for further details.
Appendix E gives full details of the configuration we use in each experiment.
We perform our analysis on three datasets: MNIST (LeCun et al., 2010), the Breast Cancer Wisconsin Diagnostic dataset (Dua and Graff, 2017), and the Boston Housing dataset (Dua and Graff, 2017). We choose MNIST as it is easy to visualize, which allows nonexperts to evaluate the interpretability of the generated CEs. We consider the two tabular datasets because this type of data is frequently used in the interpretability literature^{2}^{2}2We could not find one consistently used benchmark; a similar conclusion is drawn by Verma et al. (2020)..
This dataset contains grayscale images of handwritten digits ranging between and . The goal of the CE explanation is to find a perturbation that changes the image classification from, e.g., a to a . We consider MNIST, as imagebased data allows us to visually inspect the quality of the CEs. Our classifiers obtain an accuracy of on the test set.
A tabular dataset where each row contains various measurements of a cell sample from a tumour, alongside a binary diagnosis of whether the tumour is benign or malignant. A CE for a particular input changes the classification from benign to malignant, or viceversa. Our classifiers obtain an accuracy of on the validation set.
A tabular dataset where each row contains statistics about a suburb of Boston, alongside the median house value. To construct a classification problem, we divide the dataset into suburbs where the price is below the median, and those where it is above. A CE for a particular input changes the classification from below the median to above, or viceversa. Our classifiers obtain an accuracy of on the validation set.
We benchmark the performance of our method against Van Looveren and Klaise (2019), a stateoftheart approach for generating CEs. We also compare against JSMA, the adversarial attack from which we draw inspiration for our algorithm. JSMA provides a baseline for interpretatability. We include JSMA for two reasons: [1] to determine that we are able to generate more interpretable counterfactuals, and [2] to validate that JSMA can efficiently create minimal perturbations.
In subsection 5.3 we compare our method to Van Looveren and Klaise (2019) and JSMA, reporting both the realism of the CEs and the size of the perturbation. We find that our approach generates more realistic CEs than Van Looveren and Klaise (2019), despite not requiring an auxiliary generative model, as can be seen from the lower IM1 scores.
Method  Realism (IM1)  Minimality ()  

mean  std  mean  std  
MNIST 

JSMA  ()  ()  
VLK  ()  ()  
Ours  ()  ()  
Breast Cancer Diagnosis 

JSMA  ()  ()  
VLK  ()  ()  
Ours  ()  ()  
Boston Housing 

JSMA  ()  ()  
VLK  ()  ()  
Ours  ()  ()  

seeds, and in brackets the standard deviation over the seeds. See
Appendix E for details. VLK is the method introduced by Van Looveren and Klaise (2019). Note, the reported scores for VLK are based on our experiments and differ from those reported in their paper. We improve on their reported results for MNIST, but find worse performance for the Breast Cancer dataset. We discuss the steps we took to reproduce their results in Appendix E.2.Comparing our method to JSMA, we note that our method generates larger perturbations but with better IM1 scores. This demonstrates that our adapted loss function is successfully trading off the size of the perturbation for the realism of the explanation, as desired. JSMA is able to obtain the lowest as it is an adversarial attack designed to generate minimal perturbations. We emphasise that we only report the distance of JSMA to show that it can efficiently create minimal perturbations, and it does not generate realistic explanations. This can be seen in Figure 3, which shows qualitative examples of the explanations generated by the three methods.
Figure 3 also shows one failure mode of our proposed algorithm: the strokes in the counterfactuals are less smooth than the strokes in real images. This is due to the algorithm design, which changes single pixels iteratively. Our model does not capture stylistic properties, which would be important if we want to employ our method as a generative model. However, our model is able to grasp highlevel changes required, such as adding a white stroke to to turn it into a . This aspect is more important for explanatory purposes. In Appendix F, we show more examples of CEs generated by our method on both MNIST and tabular data, and provide further insight into which features are altered.
Next, we perform an ablation study to investigate the effects of adversarial training, and the number of models in the ensemble, on the quality of generated CEs for MNIST images. The results are shown in Figure 4. Initially, the interpretability of CEs tends to improve as the number of ensemble components increases – this can be seen from the initial downward slopes of IM1 in the top graph of Figure 4. However, after ensembles, the improvement in performance saturates; likely because uncertainty estimation does not improve further. Adversarial training improves the interpretability scores, however leads to less sparse explanations.
We have introduced a fast method for generating realistic, unambiguous, and minimal CEs. In the process, we collect, define, and discuss the properties which CEs should have. In comparison to existing methods, our algorithm does not rely on an auxiliary generative model, reducing the engineering overhead. Nevertheless, we demonstrate empirically that our approach is able to match or exceed the performance of existing methods, with respect to the realism of the CEs generated. In future work, methodological developments could be explored by adapting the proposed method to work for blackbox models (Afrabandpey et al., 2020).
We would like to thank Mizu NishikawaToomey, Andrew Jesson, and Jan Brauner, as well as other members of OATML, for their feedback on the paper. Lisa Schut and Oscar Key acknowledge funding from Accenture Labs.
Your classifier is secretly an energy based model and you should treat it like one.
In International Conference on Learning Representations.What uncertainties do we need in bayesian deep learning for computer vision?
In Advances in Neural Information Processing Systems, pages 5574–5584., pages 2801–2807. International Joint Conferences on Artificial Intelligence Organization.
Accessorize to a crime: Real and stealthy attacks on stateoftheart face recognition.
In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 1528–1540. ACM.Robustness may be at odds with accuracy.
In International Conference on Learning Representations.Your classifier is secretly an energy based model and you should treat it like one.
In International Conference on Learning Representations.What uncertainties do we need in bayesian deep learning for computer vision?
In Advances in Neural Information Processing Systems, pages 5574–5584., pages 2801–2807. International Joint Conferences on Artificial Intelligence Organization.
Accessorize to a crime: Real and stealthy attacks on stateoftheart face recognition.
In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 1528–1540. ACM.Robustness may be at odds with accuracy.
In International Conference on Learning Representations.Below, we prove Proposition 1, which states
See 1
We start by introducing the notation, definitions, and our assumptions.
Let denote the softmax probability assigned to class for input by the classifier . Let and denote the domains of and , respectively. We assume ; ; and .
For simplicity, we provide the derivation for a single input with target class . However, the proof can be easily extended to multiple observations. When generating a CE targeted to class we minimize the crossentropy loss, defined as
(13) 
We observe that the function obtains a minimum at for which . This is a unique minimum because [1] crossentropy is bounded below by and [2] it is monotonically decreasing in .
Predictive entropy, , is defined as
(14) 
If , then . This is a minimum because predictive entropy is also bounded below by .
Recent work in adversarial literature has linked adversarial robustness to improving model interpretability (Tsipras et al., 2019). Improving adversarial robustness can be achieved through adversarial training, corresponding to minimizing the loss:
(15) 
where are the model parameters, is the loss function, is the original input, is the original class, is a perturbation, and is set of possible perturbations.
In practice, adversarial training is often implemented by augmenting the dataset with adversarial examples during training. These additional images ensure that the model does not focus on noise when learning features for classification. Thus, the model is more likely to learn features that are not noise, and therefore are more interpretable (Tsipras et al., 2019; Ilyas et al., 2019). This means that adversarial training can also be used to improve the performance of models, outside of the adversarial literature setting.
Further, Chalasani et al. (2020) show a connection between featureattribution explanations and adversarial training, finding that it leads to more sparse and stable explanations from [1] an empirical perspective for image data and [2] a theoretical perspective for singlelayer models.
Lastly, adversarial training can be used to improve uncertainty estimation. Lakshminarayanan et al. (2017) show that adversarial training is a computationally efficient solution for smoothing the predictive distribution, and can improve the accuracy and calibration of classifiers in practice.
The aforementioned work motivates the use of adversarial training to generate more interpretable, stable explanations.
We omit IM2 evaluation as we found that it failed to pass a sanity check that compares the metric for in and outof distribution images. We perform the check using MNIST, as it made it easier to visually verify test results.
We use MNIST as indistribution data, and EMNIST as outofdistribution data. This means that the autoencoders are trained on MNIST images. For the ‘indistribution’ CEs, we take MNIST training data in the target class – these images can be considered as ‘gold standard’ CEs. For the ‘outofdistribution’ CEs, we select a random image from EMNIST. We expect to see a clear different in the IM2 scores for the in and outofdistribution data as we are comparing the gold standard CEs with random images from a different dataset.
However, the righthand plot of Figure 5 demonstrates that the IM2 scores for outofdistribution did not differ significantly than IM2 scores of indistribution data at a
significance level. On the contrary, for IM1, we find a significant difference (measured using a ttest) for in and outof distribution IM1 score at a
significance level. Visually, the difference can be observed in the left boxplot in Figure 5.Below, we summarize the experiment setup for the different experiments. All experiments are implemented in PyTorch. The code repository contains the instructions for reproducing each result and generating example CEs, alongside the details of the environment setup including the versions of dependencies.
For our method and JSMA we normalize the inputs to . For Van Looveren and Klaise (2019) we normalize to , as suggested. We train the classifier on the training set, and generate CEs on the test set. We tune the hyperparameters of our method by generating CEs on the training set. We did not tune the hyperparameters of Van Looveren and Klaise (2019), as we could select them from the original paper.
For each point in the test set, we select the target class for the explanation by randomly selecting from a set of classes specific to the class of the input image. This allows us to avoid selecting target classes which are difficult for a particular input class, for example transforming a into a . For a complete list of the possible target classes for each input class, see the code release.
We use a threelayer MLP with , and
nodes per layer, respectively. The first two layers have ReLU activations and batch normalization after the activation. The final layer has a softmax activation. We use an ensemble of
models.We implement adversarial training as follows: For each iteration, perform a training step a (clean) batch of data followed by a training step on an adversariallyaugmented batch of data. To create the latter, we use FGSM (Goodfellow et al., 2015) to generate adversarial examples . The augmented dataset is . We choose as the perturbation is large enough to fool a trained classifier of the time (and therefore will improve robustness), however does not change the true classification.
The maximum number of permitted changes is and the maximum number of iterations is ,. The confidence level, , is set to . The perturbation ( in the pseudocode) is set to (which is equal to , where is the number of times each feature can be changed).
We implement evaluation following Van Looveren and Klaise (2019), and include the configuration here for completeness. Section E.1 shows the architecture of the all class and single class autoencoders. We train both autoencoders using the meansquared objective and Adam with a batch size of . For the all class autoencoder we train for epochs, for the single class autoencoder we train for epochs.
All Class  

Layer  Parameters 
Encoder  
2D Convolution  
ReLU activation  
2D Convolution  
ReLU activation  
Maxpool 2D  
2D Convolution  
Decoder  
2D Convolution  
ReLU activation  
Upsample  , mode = “nearest” 
2D Convolution  
ReLU activation  
2D Convolution 
Single Class  

Layer  Parameters 
Encoder  
2D Convolution  
ReLU activation  
2D Convolution  
ReLU activation  
Maxpool 2D  
2D Convolution  
Decoder  
2D Convolution  
ReLU activation  
Upsample  , mode = “nearest” 
2D Convolution  
ReLU activation  
2D Convolution 
, padding
, stride
, scale .The results in Section 5.3 are aggregated as follows:
For each method and dataset pair we choose points from the test set.
For random seeds
Generate a CE for each of the test points
Compute the mean IM1 score and distance
Compute the mean of the means, and the standard deviation of the means.
We use the implementation released by the authors in the ALIBI library (Klaise et al., 2020), largely in its default configuration as given in the documentation. We use the same classifier architecture as above. As the classifier is a whitebox model, we follow the ALIBI documentation and use loss function from Van Looveren and Klaise (2019). Thus, the hyperparameter configuration is , , , , , . As in Van Looveren and Klaise (2019) we use the encoder to find the class prototypes, and following the ALIBI documentation set .
We generally use the same setup for both the Wisconsin Cancer and the Boston datasets, except where we clearly indicate differences below.
For our method and JSMA we normalize the inputs to . For Van Looveren and Klaise (2019) we standardize the inputs to mean and standard deviation , as suggested. We randomly select of each dataset as the train set, as the validation set, and as the test set. We train the classifier on the training set, tune hyperparameters on the validation set, and generate CEs on the test set.
We use a threelayer MLP with , and nodes per layer, respectively. The first two layers have ReLU activations and batch normalization after the activation. The final layer has a softmax activation. We use an ensemble of models.
We follow the same setup as for MNIST.
We perform adversarial training similarly to MNIST. Contrary to MNIST, each feature has a different scale and distribution. A list of perturbation sizes can be found in the code in the configuration file in demo_data, after the word perturbation.
The maximum number of permitted changes is and the maximum number of iterations is . The confidence level, , is set to . The perturbation sizes ( in the pseudocode) are featurespecific and can be found in the file breast_cancer_config.txt in the code.
For the Wisconsin Breast Cancer dataset we follow Van Looveren and Klaise (2019), and include the configuration here for completeness. We use the same procedure for the Boston Housing Dataset. We use the same autoencoder architecture for both the all class and single class autoencoders, and Table 3 shows the architecture. We train both autoencoders using the meansquared objective and Adam with a batch size of for epochs.
Layer  Parameters 

Encoder  
Linear  
ReLU activation  
Linear  
ReLU activation  
Linear  
Decoder  
Linear  
ReLU activation  
Linear  
ReLU activation  
Linear 
The same as for MNIST.
We use the implementation released by the authors in the ALIBI library (Klaise et al., 2020), largely in its default configuration as given in the documentation. We use the same classifier architecture as above. As the classifier is a whitebox model, we follow the ALIBI documentation and use loss function from Van Looveren and Klaise (2019) for both datasets. For both datasets we use the kdtree approach to find the prototypes. For the Wisconsin Breast Cancer dataset the hyperparameter configuration is , , , , , , . We set by examining the grid search shown in Van Looveren and Klaise (2019, Figure 7). We set as this is the default value in ALIBI. For the Boston Housing dataset we use the same configuration as for the Wisconsin Breast Cancer dataset. The exception are and for which we perform a grid search similar to that run by Van Looveren and Klaise (2019, Appendix A) for the Breast Cancer dataset, based on which we choose and .
We note that the IM1 scores we report for Van Looveren and Klaise (2019) are worse than those in their paper. They report an IM1 score which is not significantly different to ours. However, their results are not directly comparable because we use a test set of points while the original paper uses one of points. To try and ensure our results are correct, we took the following steps:
Used the implementation of the generation algorithm provided by the authors
Ensured inputs to the generation algorithm are standardized as in the original paper
Ensured we use the same hyperparameters as in the original paper
Visually examined the CEs generated by the method to ensure they were reasonable
We repeated the experiment using a test set of points rather than , but this also did not reproduce the results in the original paper.
Calibration is important as we assume that our classifier outputs . We investigate the effect of the number of components in the ensemble by performing a similar experiment to (Lakshminarayanan et al., 2017, Figure 6), and considering accuracy as a function of confidence. We train our network (same configuration as in Appendix E) on MNIST. We test the network on a dataset formed by combining MNIST and FashionMNIST, where we increase the proportion of the dataset taken from FashionMNIST until the confidence of the classifier falls to the desired level. Ideally the network will have a lowconfidence for incorrect predictions. Figure 6 shows the performance of a single classifer, and ensembles with between and components. We can see that the ensembles perform better than a single classifier as their accuracy is higher.
Our method is reliant on the quality of the uncertainty estimates offered by the classifier, thus we investigate the effect of ensembling and adversarial training on the uncertainty estimates. We measure epistemic and aleatoric uncertainty using predictive entropy. Ideally, we observe a high predictive uncertainty for outofdistribution data and a low predictive uncertainty for inputs similar to the training data.
We consider indistribution to be test samples from the Breast Cancer Dataset, and outofdistribution is tabular data sampled from a normal distribution. Figure
7 shows the effect of using adversarial training and ensembling on predictive uncertainty. While a single model cannot distinguish between in and outofdistribution inputs, when using ensembling and adversarial training we see a clear separation of the predictive entropy of both sets of inputs. This goes some way to explain why we see improved performance in the ablation study as we increase the number of models in the ensemble and include adversarial training.Figure 8 shows more qualitative examples of counterfactuals generated by our algorithm on MNIST. The first column shows the original class; the second column shows the counterfactual; the third column shows the proposed change. In the third image in each tuple, black denotes “painting the pixels black in the original image”, white denotes “painting the pixels white in the original image” and gray denotes “no change”. For example, consider the first row in Figure 8. The goal is to change a into . Our model proposes:
adding an additional stroke in the middle (shown by the white pixels in the third image),
removing some pixels so that the from the left and right part of the (shown by the black pixels in the their images).
Both changes are required to create a realistic . We observe that, in general, our model can grasp highlevel changes that are required. These changes suffice for explanatory purposes. However, our model does not capture stylistic properties, which be seen from the examples in the third row, left side in Figure 8. Further work is required if we want to employ our method as a generative model. An example of a ‘failure case’ is on left side of the last row – this example is particularly challenging for our model. However, on a highlevel we can see that the algorithm roughly understands the required changes.
Figure 9 shows more qualitative examples of counterfactuals generated by our algorithm on FashionMNIST. Again, we observe that our model is able to grasp highlevel changes required, such as adding a split to the dress or a zipper to the coat.
The importance of a feature can be roughly estimated by , where denotes the observation. Using this, we determine which features are most frequently changed. In Figure 11, we show examples for MNIST. Here features denote pixels that are changed.
For each triple, the left image shows the average input image from the training dataset – this represents a prototype of the MNIST digit. The blurriness is caused by the natural variation of the class within the dataset. The second image in each triplet is the average CE generated for the target class shown above the image. The third row shows the most average pixel change, i.e. . In the third image, black denotes “painting the pixels black in the original image”, white denotes “painting the pixels white in the original image” and gray denotes “no change”.
Overall, the proposed changes appear to be realistic. Let consider a specific example: a counterfactual explanation that changes the predicted class from to , as shown in the first row of Figure 11. In Figure 10, we highlight the key changes that can be read from Figure 11. The red circles show the parts that have been removed from the original input (left image) to create the counterfactual (shown in the middle image). To change a zero into a 6, we need to paint some pixels black at the top right – these pixels are shown in black in the average change plot (i.e., right image). The green circles show the parts that have been added to the original input (left image) to create the counterfactual (shown in the middle image). To change a zero into a 6, we need to add a white diagonal stroke – these pixels are shown in white in the average change plot (i.e., right image).
Figure 13 shows the most frequently changed cell nuclei properties, and Figure 13 shoes the average perturbation size per changed feature. Further interpretation of these results requires expert knowledge, which we intend to look into for future work.