1 Introduction
Much of the recent staggering success of machine learning is due to large and complex models whose inner workings are in many cases elusive. The prime examples of such ’black box’ models are deep neural networks. Despite the drawbacks that come with a poor understanding of their inner workings, such models are already widely deployed in practice due to their stateoftheart performance in many areas. But this poor understanding can make it difficult to interpret the decision making process of such models, and thus to identify the problem when a model makes an error.
To explain erroneous classifications, it can therefore be helpful to investigate the relevant decision boundary. An intuitive way to do so is to look for alternative inputs that are similar but classified differently; this is the realm of counterfactual explanations. For debugging, we usually want to generate realistic counterfactuals that are just across the model’s classification decision boundary (fig. 1). We want a counterfactual that stays as close to the original input as possible but results in a correct classification, in order to understand the system’s decision making. At the same time, the counterfactual should stay on the manifold of realistic images (CF2 in fig. 1) since we are interested in inputs that can be interpreted. Being a nascent area of research, there is so far no algorithm providing such counterfactual explanations for large models at low cost. In this work, we introduce a fast novel algorithm using specific properties of residual networks (ResNets), a widely used model, to find counterfactuals which are very close to the original input.
The contributions presented in this work are the following:

We introduce and implement a new, scalable algorithm for generating counterfactual explanations in residual networks trained with spectral normalisation. In comparison to previous approaches, this algorithm has a low engineering overhead as well as low computational requirements; it does not rely on auxiliary generative models and can run on single forward pass residual networks rather than on ensembles of models.

We propose and implement a way to assess realism in counterfactual explanations, drawing on the literature on anomaly detection.

We evaluate the novel algorithm on the MNIST dataset and compare its performance against three baselines from the literature.
2 Background
2.1 Counterfactual explanations
Counterfactual explanations for machine learning models have been introduced in [Wachter et al., 2017].^{1}^{1}1A similar notion of counterfactuals in ML has also been explored by [Kusner et al., 2017] in the context of algorithmic fairness rather than explainability. They are used as posthoc explanations for individual decisions. In general, a counterfactual explanation is understood as the presentation of an alternative input , the counterfactual, which is in some way similar to the original yet leads to a different prediction. Due to their similarity (both conceptually and w.r.t. the algorithms that generate them), counterfactuals are often introduced in juxtaposition with adversarial examples (AEs).^{2}^{2}2This is only partly helpful since there is also no consensus on the definition of AEs. While scholars generally agree that AEs are necessarily misclassified, there is no consensus on whether they need to be generated by imperceptible perturbations. Another, less operational definition could be that they are not detected to be outside the distribution of realistic datapoints – although the notion of ‘misclassification’ could already be interpreted as requiring the datapoint to be indistribution as it implies that there is a correct classification. Freiesleben [2021] provide an illuminating discussion of various definitions in the literature and come to the conclusion that while AEs are necessarily misclassified, counterfactuals need not be (and often should not be). Furthermore, in agreement with [Molnar, 2020], they define counterfactuals as the closest alternative input (on some suitable metric) that changes the prediction (to a predefined target, if applicable). This seems to be too restrictive in general as there are applications that do not require such a strong focus on proximity. Furthermore, the definition then depends too much on the choice of the similarity metric. Consequently, we will generally call an alternative input a counterfactual to an input under model if it is similar to on a suitable similarity relation but changes the prediction of the model.^{3}^{3}3One might drop the requirement of changing the prediction for applications beyond explanations, e.g. for assessing fairness [Kusner et al., 2017].
Counterfactual explanations can be useful for debugging: They can be used to answer questions like ‘Why did the selfdriving car misidentify the fire hydrant as a stop sign?’ [Goyal et al., 2019]. More precisely, they can answer the questions ’What would the image need to look like in order to be classified as a fire hydrant?’ and ’What changes would need to happen in other fire hydrant (stop sign) images in order to be classified as a stop sign (fire hydrant)?’. Depending on the model and the application, we might only be interested in counterfactuals that look realistic. In particular, we learn more about the decision boundary when we understand the changes made to the image. If, on the other hand, the classification is changed by adding unrealistic noise, then we only learn that the classifier is not robust to this. In terms of Lipton [2018], such counterfactuals are less informative. Counterfactual explanations are often required to be realistic [Schut et al., 2021], likely [Molnar, 2020], or plausible [Karimi et al., 2021]. Perhaps the most useful way to formulate it is to require counterfactuals to be ‘likely states in the (empirical) distribution of features’ which are ‘close to the data manifold’ [Karimi et al., 2021]. Another oftenmentioned desideratum for counterfactual explanations is sparsity: in general, the less features are changed in input space, the better. Sparse perturbations are usually more interpretable as the change in classification can be attributed to a smaller part of the input which is easier to grasp both in terms of the model behaviour and in terms of the input itself. If all parts of the input change a little, it might be harder to understand what the perturbation means and how it affects the model.
2.2 Epistemic uncertainty
Previous work has shown that epistemic uncertainty can be used as a proxy for realism Smith and Gal [2018], Schut et al. [2021]
. There are two kinds of uncertainty relevant to machine learning models. Aleatoric uncertainty, on the one hand, is inherent in the data distribution and cannot be reduced. It is high when there is no clear groundtruth label and maximised in the extreme case of random labels. Epistemic uncertainty, on the other hand, is due to a lack of knowledge on the part of the model. It is a quantity that can be reduced for a given input by including it in the training set. Given this characterisation, epistemic uncertainty estimates are used for active learning (selecting samples that are particularly useful to train on) as well as for detecting outofdistribution inputs.
It has also been observed that neuron activations in late layers of a neural network, which we call
features, can be used to estimate epistemic uncertainty when two properties called sensitivity and smoothness are satisfied by the mapping into the feature space Van Amersfoort et al. [2020]. Sensitivity can be seen as a lower Lipschitz bound, ensuring that the features remain sensitive to differences between inputs, thereby preventing ‘feature collapse’: Otherwise, outofdistribution (OoD) inputs might not be distinguishable from indistribution (iD) inputs as they could be mapped to the same area in feature space. Conversely, smoothness can be seen as an upper Lipschitz bound: Similar inputs are guaranteed not to be too far from each other in feature space, such that distances in this space remain meaningful. Building on Bartlett et al. [2018], Liu et al. [2020] show that applying spectral normalisation (SN, Miyato et al. [2018]) with a coefficient to ResNets He et al. [2016] is enough to enforce both sensitivity and smoothness. Using such models, Mukhoti et al. [2021]fit a probability distribution to the feature space after the last ResNet block, using the feature representations of the training data. To estimate the epistemic uncertainty of an input, they then calculate the negative loglikelihood of its feature representation under the learned distribution. For the probability distribution, they use a Gaussian mixture model (GMM). We will make use of this approach, called Deep Deterministic Uncertainty (DDU), in the algorithm we propose below.
2.3 Related work
In this section, we provide a (nonexhaustive) survey of algorithms that provide counterfactual explanations. In the initial paper, Wachter et al. [2017] propose to simply optimise the objective
(1) 
where denotes the model, the desired model prediction, a predefined distance metric, and is a hyperparameter. They suggest to iterate through increasing values of , always solving for for fixed , until a counterfactual sufficiently close to the original input is found. This is quite a minimal approach, which comes at the cost of not being constrained to lie on the data manifold (which might be required, depending on the task).
Van Looveren and Klaise [2020] build on this (and on Dhurandhar et al. [2018]), but focus on generating more interpretable counterfactuals by optimising a more complex objective function. Their prototypeguided approach minimises the loss
(2) 
Here, where is the softmax output of the classifier on the counterfactual for the original class , i.e. its confidence that belongs to class . and refer to the corresponding distances between and in input space. where
is an autoencoder trained on the training data. Lastly,
where is the latent prototype defined by the nearest neighbours of in target class and is an encoder. For untargeted counterfactuals, is chosen as . While enforces a change of classification, andare included to encourage realism, measured by a low reconstruction loss under the autoencoder and similarity to similar training samples in the encoded latent space. The additional use of an autoencoder and the additional loss terms generate a computational overhead that slows down the approach. Furthermore, the approach comes with many hyperparameters which might require a lot of tuning when applying it to a new task. The interplay of the different loss terms is not straightforward to analyse either formally or conceptually, so it is hard to predict and improve the performance on new applications.
A very different route to provide interpretable counterfactuals is taken by Schut et al. [2021], based on the notion of uncertainty [Gal, 2016]. They suggest that counterfactuals are realistic if the model classifies them with low epistemic uncertainty and unambiguous when the model has low aleatoric uncertainty. Consequently, they propose to minimise overall uncertainty in models that provide accurate uncertainty estimates through their softmax output, such as deep ensembles [Lakshminarayanan et al., 2017]. They show that maximising the ensemble’s target class prediction is sufficient to minimise its predictive entropy and hence both epistemic and aleatoric uncertainty. Rather than using an offtheshelf optimisation algorithm as the two previously mentioned approaches, Schut and colleagues compute the gradient of the classification loss in input space and identify the most salient pixel for reducing this loss. Similar to JSMA [Papernot et al., 2016], they iteratively change the most salient pixel until the target prediction is above 99%, when the uncertainty is sufficiently low. This works well in practice but using an ensemble of models (for their MNIST experiments they use 50 models) is computationally expensive, which might hinder its deployment in practice.
Another approach that has been suggested both for providing counterfactual explanations and for algorithmic recourse is REVISE [Joshi et al., 2019]. This algorithm requires the availability of a generative model, such as the decoder of a VAE trained on the training data. Similar to [Wachter et al., 2017], the overall idea is to minimise the function
(3) 
where is the classifier, is the target,
is some loss function, and
is the generative model. To find a that minimises the loss, is initialised to the encoding of the original input ; then the gradient of the loss in the latent space is computed and the algorithm iteratively takes small steps in latent space until the prediction changes to the target. Since the resulting counterfactual is produced by the generative model, it can be dissimilar to : Although the norm is known to encourage sparsity, the algorithm cannot be expected to provide sparse solutions, as the changes are not taken in the input space. Another disadvantage of using a generative model is, of course, the need to train it beforehand which can pose difficulties of varying degree depending on the data domain. Several recent works utilise GANs for generating counterfactuals, such as Kenny and Keane [2021].3 Novel algorithm: DeDUCE
For large image datasets, ResNets are often the model of choice. When ResNets are trained with spectral normalisation, we can use DDU (section 2.2) to estimate their epistemic uncertainty. DDU achieves stateoftheart results in OoD detection such as MNIST vs. FashionMNIST. The authors also demonstrate that it can be used for active learning; this implies that it is particularly useful to train on inputs whose representations have low likelihood under the GMM, i.e. such inputs are substantially different from the previous training data. This suggests that DDU’s measure of epistemic uncertainty can be a useful target when aiming to generate counterfactuals that are similar to the original training data. The idea of the novel algorithm presented here is to cross the model’s decision boundary while keeping the epistemic uncertainty as low as possible, using DDU. Instead of maximising the whole GMM density for minimising epistemic uncertainty (i.e. maximising featurespace density), we propose to only maximise the target class density (fig. 2). Otherwise, one would also maximise the original label’s class density, which could lead to unstable behaviour. As we also want to cross the decision boundary quickly, we suggest to change the pixels that are most salient for the gradient of the loss
(4) 
where is the model, the target class, the crossentropy, the target class density, and the feature extractor, i.e. the part of the model that maps to the feature space after the last ResNet block. This means we are trying to change the input in a way that quickly changes the classification (first term) and makes it more similar to the target class in feature space (second term). The first term typically has values in , while the second term can have values as low as . Instead of changing the pixels that minimise the weighted loss for some , we select the pixels which make the largest relative difference to either of the two loss terms. Therefore, we use the alternative gradient
(5) 
instead of the gradient of the weighted loss given in equation (4). Note that the crossentropy loss is nonnegative whereas can be negative or positive, depending on whether the density is above or below one. Despite their similarity, including both terms in the objective indeed improves the quality of generated counterfactuals; we also find that using the alternative gradient without further weighting () works better than the gradient of the loss for any value of (see appendix A).
The novel approach that we call DeDUCE (Deep Deterministic Uncertaintybased Counterfactual Explanations) is described in algorithm 1: In order to only make small and sparse changes to the original input, DeDUCE iteratively perturbs only few pixels at a time. At each iteration, it determines the most salient pixels for maximising the objective, by computing the gradient in input space, and then perturbs them by a fixed step size . This is similar to how the Jacobianbased Saliency Map Attack (JSMA) [Papernot et al., 2016] generates adversarial examples and the approach of [Schut et al., 2021] generates counterfactuals. Thereby, DeDUCE iteratively perturbs the input in small steps in a way that makes it more and more similar to the target class until it crosses the decision boundary. The algorithm stops when the softmax output for the target class is above 50% as this corresponds to the model choosing ‘in target class’ over ‘not in target class’. Following [Schut et al., 2021], DeDUCE limits the number of times each pixel can be updated^{4}^{4}4This is achieved by counting the number of updates per pixel in and applying a mask to the gradient that sets to 0 if . The expression in line 7 denotes the procedure of first applying this mask and then selecting the positions of the largest values of . and clips to the input domain bounds. Similar to some work on adversarial examples Dong et al. [2018], we also add momentum to the gradient, replacing the expression for by
(6) 
with referring to the state of the input after iterations. For the experiments reported below, we change one pixel at a time and use a momentum of 0.6. Adding momentum of this size often does not make a difference; in our experiments, it only affected around 1.3% of the generated counterfactuals.
4 Experiments
4.1 Dataset and metrics
We perform experiments on the MNIST dataset LeCun et al. [1998], so far the only widely used image dataset in the literature on counterfactual explanations. We use the and distances in input space to assess similarity and sparsity, respectively. We also want to measure how realistic the generated counterfactuals are, as this is usually helpful (cf. section 2.1). There is no consensus in the field on how to measure realism (or ’plausibility’ Keane et al. [2021]), and for images, this proves to be quite difficult. Therefore, we turn to the literature on anomaly detection and use the approach that performed best on an anomaly detection task for MNIST in a recent study Ruff et al. [2021]. This approach, called AnoGAN Schlegl et al. [2017]
, uses a pretrained generative adversarial network (GAN,
Goodfellow et al. [2014]) to compare the similarity of a given input with the closest image that the GAN can generate. AnoGAN uses gradient descent in latent space to minimise the loss(7) 
where is the generator and is a mapping to a later layer of the discriminator. The first term gives the distance between the generated sample and the input whereas the second term measures how similar their feature representations are in the discriminator model. We use a Wasserstein GAN Arjovsky et al. [2017] trained on MNIST. To reduce the dependence on the initial , we perform gradient descent three times from different, randomly sampled starting points. We use , an initial learning rate of , and bound the number of iterations to 4000. We tested how well the resulting metric allows to distinguish actual MNIST images from EMNIST character as well as FashionMNIST images. Our AnoGAN method achieves an AUROC of 0.913 and 0.998, respectively. Note that these comparisons are quite different to the generated counterfactuals and thus only provide general sanity checks rather than actual performance tests.
4.2 Baselines
To assess the performance of the novel algorithm, we compare it with three baselines applied to ResNets without spectral normalisation. The prototypeguided approach that we shall call ‘VLK’ [Van Looveren and Klaise, 2020] as well as REVISE [Joshi et al., 2019] were discussed in section 2.3. Their selection is largely based on a general scarcity of algorithms that provide counterfactual explanations, were demonstrated on image data, and are applicable to ResNets. Although also demonstrated on MNIST, the mentioned approach of [Schut et al., 2021] is not included since it cannot be applied to single ResNets. In order to get the required calibrated uncertainty outputs, one would need an ensemble of around ten models [Lakshminarayanan et al., 2017], which would make the results much less comparable. In addition to VLK and REVISE, we include JSMA [Papernot et al., 2016] as a third baseline.
It should be noted that JSMA was introduced to generate (perceptible) adversarial examples rather than counterfactual explanations, so the comparison with regard to realism is not a fair one. However, since DeDUCE is loosely based on JSMA, the comparison is interesting as it shows how the modifications affect the results. While the original paper recommends changing two pixels at a time, we change one as this makes it perform better in our setting and more comparable to the used DeDUCE algorithm.
To generate counterfactuals with VLK, we use the authors’ implementation in the alibi package [Klaise et al., 2021] in order to be as faithful as possible. As the algorithm is already tuned to MNIST, we only make one change to the default setting, namely setting for nearest (encoded) neighbours in the term to 5. This is recommended in the paper and generally improves the quality of the generated counterfactuals. Note that VLK uses an optimisation algorithm that includes the model’s target confidence in the
loss term. This means that, contrary to the other three algorithms, we cannot prescribe the generated counterfactuals to have a target confidence just above 50%. In fact, they have a mean confidence of 0.42 and standard deviation of 0.44, with many values being close to 0 or 1.
The third baseline REVISE requires more tuning, as it has not been demonstrated on MNIST before. REVISE is designed to be applicable to image data, with the authors providing a demonstration on the CelebA dataset. Sample reconstructions of the used VAE are shown in appendix B. Note that a more powerful generative model than the one employed here could improve the quality of the generated counterfactuals; this might, however, come with even higher engineering efforts and computational costs. Appendix B provides more details on the implementation of REVISE.
4.3 Results
With the four algorithms tuned to the dataset and base model, we can look at their performance. We use five sets of 100 original MNIST images from the test set and try to find counterfactuals for all 9 potential target classes, resulting in runs overall for each algorithm. The quantitative metrics are the and distance to the original, and the rate at which the algorithms fails to output a candidate counterfactual. For the evaluation of how realistic/anomalous the images are, we use the anomaly detection metric ‘AGAN’ using AnoGAN. The results are reported in table 1. In order to ensure a fair comparison, only imagetarget pairs are included for which all algorithms found a counterfactual.
algorithm  AGAN  failure in %  

DeDUCE  22.51 (0.72)  21.16 (0.46)  10.72 (0.10)  0.00 (0.00) 
JSMA  23.90 (0.41)  25.65 (0.65)  12.63 (0.27)  3.09 (0.28) 
VLK  22.95 (0.93)  155.43 (4.15)  38.95 (1.19)  0.09 (0.20) 
REVISE  20.09 (0.88)  752.46 (4.98)  53.86 (0.58)  26.80 (1.03) 
REVISE achieves the best results on the AnoGAN metric, but their scores are not as good as for the original images, which get a score of . VLK and DeDUCE do not show significant differences on this metric. We note that the directly reconstructed images under the REVISE VAE (without applying REVISE) are judged to be more realistic than the original images, achieving a score of . This shows that the metric has a bias towards VAEgenerated images, which benefits REVISE and perhaps also VLK, as the latter minimises an autoencoder loss. DeDUCE achieves better results than JSMA on all four metrics, with these differences are all being highly significant (
on paired ttests). Both VLK and REVISE perform much worse than the other two on the sparsity as well as the similarity metric. This is also expected since both DeDUCE and JSMA take pixelwise steps in the input space. The standard deviations on all metrics are generally fairly low, especially for DeDUCE and JSMA.
There are great differences with respect to the required computation times, not only because the provided implementation of VLK does not support the generation of counterfactuals in batches. Even when all approaches are used to generate counterfactuals individually, VLK is the slowest approach (table 4.3). JSMA is slightly faster than DeDUCE, which is still significantly faster than REVISE. REVISE has a large standard deviation because some runs take particularly long. Note that REVISE could work with a slightly larger step size and with a lower number of iterations, at the expense of quality. All computations were performed on NVIDIA Tesla T4 GPUs.
algorithm  time in sec 

DeDUCE  2.99 (1.69) 
JSMA  1.01 (0.52) 
VLK  109.86 (1.54) 
REVISE  46.66 (84.33) 
As pointed out above, the GANbased realism metric is only partly reliable. This can already be seen from the observation that VAE reconstructions are generally judged to be more realistic than the original image. Therefore, a qualitative examination of the generated counterfactuals is also warranted, although a conclusive verdict would require a comprehensive human evaluation study. Individual examples generated by the different algorithms for randomly drawn images and targets are presented in figure 3.^{5}^{5}5For both the image ids and the target classes, eight integers from 0 to 9 were drawn at random. One pair was drawn twice while for another pair, the target was equal to the label, resulting in six idtarget pairs. Note that REVISE failed to output a counterfactual in the fourth column. Column 2 and 4 seem generally difficult, while all approaches arguably find good counterfactuals for column 1 and 5, with the exception of JSMA on the former. Column 3 and 6 are met with varying success. Overall, DeDUCE and VLK might be seen to provide the best counterfactuals. The individual metric scores are presented in table 4.3. Despite the lack of a user study, it seems clear that the NLL and AGAN scores do not always reflect human judgement. Because of this and the much higher and scores, the low realism scores of counterfactuals generated by VLK and REVISE clearly do not imply that they are more interpretable than the ones generated by DeDUCE.
5 Discussion
As demonstrated in the experiments, DeDUCE is an efficient algorithm performing small and sparse perturbations that often result in realistic counterfactuals. In particular, it provides counterfactuals that are much more similar to the original image than the other considered approaches. This allows to give more precise explanations for the model’s decision making. As discussed, DeDUCE is only applicable to classifiers that satisfy sensitivity and smoothness assumptions. It is sufficient to have a ResNets with some loose spectral normalisation, which might be desired anyway. Still, this clearly restricts the applicability of DeDUCE. Overall, DeDUCE could prove to be a viable technique for a considerable number of use cases, namely image classification tasks that require the deployment of large neural networks.
One limitation of this work is the lack of human evaluation studies. In order to conclusively assess how interpretable the generated counterfactuals are and how helpful the explanations are for debugging, such studies will eventually be necessary. Another limitation is the lack of demonstrations on more complex datasets. While MNIST allows a first proof of concept, the algorithm is designed to be scalable to larger datasets and should also be assessed there.
Lastly, future work could also try to improve the algorithm itself. In particular, it might be possible to use a different latent density model instead of the one used here. For example, the GMM could be fitted to ambiguous data (using multiple labels and a probabilistic fit) to improve density estimation on such inputs. This could make it necessary to also train the classification model on ambiguous data. Another option could be to use confidenceweighting of the datapoints for fitting the classwise Gaussians. We leave these ideas as avenues for future research.
Acknowledgements
We would like to thank Andreas Kirsch for helpful discussions. B.H. was supported through a DAAD scholarship. L.S. and J.M.B. were supported by DeepMind and Cancer Research UK, respectively. Both L.S. and J.M.B. were also supported by the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines and Systems (EP/S024050/1).
References
 Wasserstein generative adversarial networks. In ICML, pp. 214–223. Cited by: §4.1.
 Representing smooth functions as compositions of nearidentity functions with implications for deep network optimization. arXiv preprint arXiv:1804.05012. Cited by: §2.2.
 Explanations based on the missing: towards contrastive explanations with pertinent negatives. arXiv preprint arXiv:1802.07623. Cited by: §2.3.
 Boosting adversarial attacks with momentum. In CVPR, pp. 9185–9193. Cited by: §3.
 The intriguing relation between counterfactual explanations and adversarial examples. Minds & Machines, pp. 1–33. Cited by: §2.1.

Uncertainty in deep learning
. Ph.D. Thesis, University of Cambridge. Cited by: §2.3.  Generative adversarial nets. NIPS 27. Cited by: §4.1.
 Counterfactual visual explanations. In International Conference on Machine Learning, pp. 2376–2384. Cited by: §2.1.
 Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §2.2.
 Towards realistic individual recourse and actionable explanations in blackbox decision making systems. arXiv preprint arXiv:1907.09615. Cited by: §2.3, §4.2.
 A survey of algorithmic recourse: definitions, formulations, solutions, and prospects. arXiv preprint arXiv:2010.04050. Cited by: §2.1.
 If only we had better counterfactual explanations: five key deficits to rectify in the evaluation of counterfactual xai techniques. arXiv preprint arXiv:2103.01035. Cited by: §4.1.

On generating plausible counterfactual and semifactual explanations for deep learning.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 35, pp. 11575–11585. Cited by: §2.3.  Alibi: algorithms for explaining machine learning models External Links: Link Cited by: §4.2.
 Counterfactual fairness. arXiv preprint arXiv:1703.06856. Cited by: footnote 1, footnote 3.
 Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS, Cited by: §2.3, §4.2.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
 The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery.. Queue 16 (3), pp. 31–57. Cited by: §2.1.
 Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. arXiv preprint arXiv:2006.10108. Cited by: §2.2.
 Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §2.2.
 Interpretable machine learning. Lulu.com. Cited by: §2.1, §2.1.
 Deterministic neural networks with appropriate inductive biases capture epistemic and aleatoric uncertainty. arXiv preprint arXiv:2102.11582. Cited by: §2.2.
 The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), pp. 372–387. Cited by: §2.3, §3, §4.2.
 A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE. Cited by: §4.1.
 Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In International Conference on Information Processing in Medical Imaging, pp. 146–157. Cited by: §4.1.
 Generating interpretable counterfactual explanations by implicit minimisation of epistemic and aleatoric uncertainties. AISTATS. Cited by: §2.1, §2.2, §2.3, §3, §4.2.
 Understanding measures of uncertainty for adversarial example detection. arXiv preprint arXiv:1803.08533. Cited by: §2.2.
 Uncertainty estimation using a single deep deterministic neural network. In ICML, pp. 9690–9700. Cited by: §2.2.
 Interpretable counterfactual explanations guided by prototypes. arXiv preprint arXiv:1907.02584. Cited by: §2.3, §4.2.
 Counterfactual explanations for machine learning: a review. arXiv preprint arXiv:2010.10596. Cited by: Figure 1.
 Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harvard Journal of Law & Technology 31, pp. 841. Cited by: §2.1, §2.3, §2.3.
Appendix A Effect of different gradient expressions
setting  failure  

27.36  13.47  0.1%  
27.29  13.45  0.1%  
27.16  13.40  0.1%  
25.88  12.92  0%  
24.58  12.38  0%  
25.10  12.53  0%  
25.38  12.65  0%  
25.45  12.75  0%  
24.47  12.32  0%  
24.92  12.46  0.1% 
Appendix B REVISE implementation
Recall that REVISE takes small steps in the latent space of the VAE, guided by the gradient of the loss function where is the classifier, is the target, is the generative model, and is the original image. As in the original paper, we use the crossentropy loss function for . In addition to the VAE, it is therefore necessary to tune as well as the gradient step size that we denote by . I perform a grid search on and , considering both qualitative and quantitative performance. For the three setting of , we limit the number of iterations to 50,000, 10,000, and 5,000, respectively.^{6}^{6}6In all three cases, less than 5% of the runs terminated in the last 60% of the iterations, i.e. after step 20,000, 4,000, and 2,000, respectively. This shows that to significantly decrease the failure rate, the iteration limits would need to be raised by more than an order of magnitude, if that helps at all. This is taken as a justification for keeping them at their present values. As a comparison, recall that we limit DeDUCE (and JSMA) to 700 iterations. For , the algorithm hardly ever terminates: in all settings for , it has a failure rate of over . Comparing all settings on the same imagetarget pairs would then mean to leave out the vast majority of counterfactuals, so I report quantitative evaluations only for the other six settings in table B. Most notably, for , the values are very high; a look at the generated images confirms that these are too far from the original inputs to be useful. Given , the setting with performs clearly the best overall, so we adopt this for the evaluation on the testset.
failure  

0.1  754.34  88.79  16.1%  
0.1  758.12  91.43  12%  
0.1  764.93  84.92  11.8%  
1  767.75  58.66  23%  
1  771.78  54.72  26.3%  
1  771.89  54.11  22.2% 