1 Introduction
With the increased popularity of machine learning applications, a need for understanding machine learning models arises. Many methods have been proposed to explain predictions of machine learning models. To name but a few, some are based on heatmaps that identify salient input features
lime; lrp; deeptaylor; Chang2019, others rely on transparent surrogate models Guidotti2019;self-explaining-neural-networks
, and some produce counterfactual examples Wachter2017; Dhurandhar2019; Singla2019. In this work, we consider the latter group, focusing on the image domain.Counterfactual examples identify specific changes to inputs, such that the predicted outcome of a machine learning model changes. Such examples allow interactions with the model to gain insights into its behavior. For example, surveillance images of candidates picked out for screening can be assessed for biases by identifying features to change for the system to ignore the candidates goyal19a.
Work on counterfactual explanations has become increasingly popular stepin2021. For images, counterfactuals should convey realistic changes to the input that are minimal and necessary while being valid, sparse, and proximal mothilal2020explaining; Dhurandhar2018; VanLooveren2019; Rodriguez2021. Various metrics have been proposed to quantify aspects of the quality of counterfactual examples. However, most metrics are used in isolation to evaluate a method proposed in the corresponding paper and to compare the method to others that are evaluated on different metrics. Furthermore, there exists little or no research on what properties the different metrics actually capture. Lacking standard metrics and knowledge about what metrics capture makes it difficult to compare methods, which potentially slows down scientific progress within the field.
In this work, we analyze and evaluate existing metrics to understand what each metric expresses in terms of realistic and minimal changes. Through experiments, we find that most metrics have the intended behavior for image datasets of lower complexity. For a more complex dataset, our experiments show how multiple existing metrics fail to distinguish between good and bad counterfactuals. We also expose vulnerabilities of different metrics and propose to account for such vulnerabilities by reporting multiple scores in combination. Counterfactual examples comprising tiny unrealistic changes are found to often yield unintended good scores. To mitigate this issue, we propose two new metrics that are less susceptible to tiny changes and align well with qualitative evaluations. We argue that when presenting a proper evaluation of a counterfactual method, the evaluation would need a metric for quantifying how realistic counterfactuals are, e.g., the Fréchet Inception Distance, and a metric like our Latent Variable Score to assert that the validity of the counterfactuals generalize. Upon publication, we will also publish an evaluation framework for easy comparisons of methods.
2 Counterfactual Examples
The counterfactual question seeks to find necessary and minimal
changes to an input to obtain an alternative outcome from a classifier
Wachter2017. An answer often comes in the form “Had values been and all other values remained the same, then outcome Y would have been Z.” For images, an answer would be a new image similar to the input but with specific features changed.Naturally, counterfactual examples come in various forms and might not convey the information that we would expect. For example, adversarial attacks, which adds imperceptible noise to inputs in order to change predictions adversarial-attacks, are of little or no relevance in terms of interpretability. As such, multiple additional criteria have been proposed that counterfactuals need to possess to be intuitive for humans.
Realistic changes.
To be useful for humans, counterfactual examples should look realistic Dhurandhar2018; Schut2021. The criterion has also been described as counterfactuals being likely to stem from the same data distribution as the training data Poyiadzi2020; Dhurandhar2019. In the image domain, realistic changes can be hard to quantify. How do you, for example, distinguish an adversarial attack from a proper counterfactual, when the attack may be closer to the input in terms of, e.g., Euclidean distance? Methods for quantifying how realistic counterfactuals are typically rely on a form of connectedness Mahajan2019; Pawelczyk2020 or on embedding spaces of deep learning models VanLooveren2019; fid. Although being important for consolidating the field of counterfactual explanations, we find experimentally that metrics have unintended behaviors when quantifying how realistic tiny adversarial-like changes are.
Minimal changes.
For counterfactual examples to be more useful to humans, input features need to change minimally for the prediction to change Wachter2017. Associated properties are sparsity and proximity, which relates to changing only few features and changing features such that the counterfactual stays in the proximity of the input mothilal2020explaining. When only few features are changed, the counterfactuals are said to be more interpretable Wachter2017. In high-dimensional domains like the image domain, quantifying minimal changes with, e.g., Euclidean distance may yield undesired results. For example, we demonstrate with an experiment that tiny adversarial-like changes can be deemed better (smaller) compared to realistic changes that naturally need to change more pixels. In turn, a method that performs well only on minimal changes may be producing unrealistic adversarial-like counterfactuals that are of little or no value. For a good performance on minimal changes to be meaningful, a method must thus also perform well on metrics quantifying how realistic the counterfactuals look.
Additional properties.
In an interactive setting, computational efficiency is important VanderWaa2018; VanLooveren2019; ecinn. If computations are too slow, interactions with a system will be poor. Although computation time is important, we do not study it here, as it does not quantify the quality of the counterfactuals. It is also important that humans can use the generated counterfactuals. Consequently, multiple works have done human studies of their methods goyal19a; Dhurandhar2018; Singla2019. Such tests are typically domain specific and thus prohibit a generalized test. Therefor, we do not include them for further evaluation here.
3 Evaluating Counterfactuals Quantitatively
In this section, we present those quantitative metrics which have been applied to images in at least two publications and analyze their applicability in terms of how they measure changes. To mitigate an observed issue with tiny adversarial-like changes, we additionally propose two new metrics. We find that each metric reflects specific aspects of counterfactual quality and need to be reported in combination with other metrics to avoid isolated drawbacks of the metrics.
3.1 Existing Metrics
Simple distance metrics.
A natural first approach to measuring changes between inputs and counterfactuals is to use metrics like and
-norms even though such norms are known to work poorly on high-dimensional data like images
Kang2020. We include those metrics because they are present in objective functions for gradient based counterfactual methods Wachter2017; Dhurandhar2018; VanLooveren2019 and are in turn a natural first choice for measuring minimal changes. In our experiments, we include a hybrid metric denoted the elastic net distance (EN), which is defined as , where is the input and is the counterfactual. We use this metric because it combines the and the -norm.Target-class validity.
The Target Class Validity () Mahajan2019 quantifies the percentage of the generated counterfactuals that are predicted to be of the target class by the classifier under consideration:
(1) |
In Equation (1), is the indicator function, is the test set, is the predictive function, and is the function that generates the counterfactual examples.
The score quantifies how effective a method is in creating counterfactual examples that successfully change the class. It does not quantify the quality of the counterfactuals in terms of neither minimal nor realistic changes. In turn, it should be reported along with other metrics quantifying those properties.
Im1.
VanLooveren2019 introduces the score, which employs auto-encoders to approximate how well counterfactual examples follow the training data distribution. The score shows the ratio between how well the counterfactual example of target class can be reconstructed by an auto-encoder trained on data from the target class and an auto-encoder trained on the data of the input class :
(2) |
A lower value means that follows the distribution of the class better than that of class VanLooveren2019.
As argued in the previous section, it is important to measure how well counterfactual examples follow the distribution of the training data. The score is a valuable tool for assessing such property. As the score is quantitative, it also allows comparing different methods across publications. Furthermore, the score is somewhat established as a metric, as multiple papers report the score VanLooveren2019; Mahajan2019; Schut2021.
The
score can, however, be deceiving. We find experimentally that methods which make tiny changes to the input can get an undesired good score, presumably because tiny changes yields almost no error even if the changes are not preserved by the auto-encoders. Some classes may also be easier to reconstruct than others, resulting in skewed scores for different target classes. One target class may simply yield a lower numerator in eq:IM1 than another target class, just because one is easier to reconstruct than the other. Finally, we know of no publicly available pre-trained auto-encoders for computing the score. When new auto-encoders need to be trained for each publication, results may not be comparable across publications. Through experiments, we demonstrate the issue by showing how,
e.g., differences in normalization yield incomparable scores.Im2.
The score is also introduced in VanLooveren2019. It utilizes the discrepancy between reconstructions made by a class specific auto-encoder and an auto-encoder trained on the entire training set :
(3) |
According to the authors, a low value of indicates an interpretable counterfactual because the counterfactual follows the distribution of the target class as well as the distribution of the whole data set. The applicability of the score is however debatable. Schut2021 demonstrate that the score fails to identify out-of-sample images. Mahajan2019 also argue that both and are better reported by displaying both the denominator and numerator of each score. For both and , we further find experimentally that when the complexity of the dataset increases, the computed scores get close to statistically insignificant amongst three different counterfactual methods. In turn, the two metrics may be best suited for datasets of lower complexity.
Fréchet Inception Distance.
The Fréchet Inception Distance (FID) is a metric used for evaluating generative models fid. The metric compares how similar two datasets are by comparing statistics of embeddings from the Inception V3 network inceptionv3. For counterfactuals, the score has been used to evaluate how well counterfactuals align with the original dataset Rodriguez2021; Singla2019. FID is defined from mean Inception embeddings and , and covariance matrices and of the test set and associated counterfactuals, respectively:
(4) |
In this work, we consider images that are smaller ( pixels) compared to inputs of the Inception V3 model (
pixels). Consequently, we compute the score for a different network. We use embeddings from the last hidden layer of a convolutional neural network, which is identical to the model being explained by the counterfactual methods. The last hidden layer has
output neurons, so we denote the score
, to avoid any misconceptions. Although the score depends on the embedding network, we believe that our results will extend to the Inception V3 network.The score is a good fit for evaluating whether generated counterfactuals follow the distribution of the training data, as it is currently the standard metric for evaluating generative models. It does, however, not take into account the relation between each specific input and its associated output. As such, the metric could, e.g., be “fooled” by a high performing generative model producing realistic samples independent of the inputs. Consequently, the metric should be reported in combination with another metric which evaluates the validity of each counterfactual. We also find in experiments that methods generating tiny changes to the input may be deemed of high quality; maybe because the tiny changes are either filtered out by the embedding network or do not affect summarizing statistics of the score.
3.2 New Proposed Metrics.
Through experiments, we find that tiny changes similar to adversarial attacks often yield undesirable good scores. To mitigate this issue, we introduce two new metrics. Both metrics rely on the assumption that tiny adversarial-like counterfactuals are very model specific Liu2017. Under this assumption, evaluating counterfactuals on other classifiers should be less susceptible to tiny changes and more effective if the changes are semantically correct.
Label Variation Score.
For datasets where each data point is associated with multiple class labels, individual classifiers for each class label can give insights into how each class is affected by a counterfactual change. Naturally, the class targeted by the counterfactual should be affected, while unrelated classes should not. At a high level, we use individual classifiers for each class label as a proxy for how much the concept related to the given class has changed in the counterfactual image.
We propose the Label Variation Score () to monitor predicted outcomes over different class labels. computes average Jensen Shannon (JS) divergences, denoted , between predictions on inputs and counterfactuals. Let be an “oracle” trained on the class label
, which outputs a discrete probability distribution over the labels. then,
is defined as(5) |
As the score is based on individual classifiers for each class label, the score should be affected less by adversarial attacks. Intuitively, non-related labels should not be affected by counterfactuals and thus have a low , while labels that correlate with the counterfactual label may co-vary and get a higher . For example, if an image of a face without makeup is changed to one with makeup, the face in the counterfactual should be predicted to smile as much as before, but the prediction of “wearing lipstick” may follow the prediction of “wearing makeup” as lipstick is a subset of makeup.
yields a rich picture of which features are changed by the counterfactuals, and it allows human judgement of which features are allowed to be changed, as with the makeup and lipstick example. Using the score, it becomes easier for humans to detect biases in the predictive model by identifying features that are changed unintendedly. On the contrary, has the drawback that it needs multiple class labels to be applicable. Also, attributes that are not labeled will not be possible to monitor. In our experiments, yields scores that align well with human interpretation on two different datasets. We further verify the underlying hypothesis described above by finding that more realistically looking counterfactuals get better scores than, e.g., examples with tiny adversarial-like changes.
The oracle score.
For datasets where is not applicable, we propose to use a simpler metric which is based on training an additional classifier – the oracle – that is used to classify the counterfactuals examples. The score is the percentage of counterfactuals that are classified to the target class by both the classifier being explained and the oracle :
(6) |
The score is similar to , but it is intended to avoid giving good scores to tiny adversarial-like changes. The oracle score depends on the additional oracle, which could tell more about the oracle than the predictive model itself. For example, it may be that adversarial attacks working on also work on , which would wrongly yield a good score for such attacks. However, we find experimentally that the score gives better scores for realistic counterfactual examples than tiny adversarial-like changes.
4 Experiments
In this section, we study the above described metrics for different types of counterfactual methods to characterize what properties different metrics capture. We demonstrate through multiple experiments that no metric can express all desirable properties, and thus they should be used in combination.
Methods.
Throughout the experimental section, we compare three different methods for producing counterfactual explanations. The methods were chosen to represent a spectrum of methods ranging from gradient based methods producing sparse but less realistic changes in one extreme to methods based on generative models generating more realistic but larger changes in the other extreme.
In one end of the spectrum, Wachter2017
present a gradient based method (denoted GB). Counterfactuals are generated through gradient descent on the input to minimize a loss-function comprising an
-norm “distance” term which encourages minimal and sparse changes and a squared “prediction” loss on the predicted label, which encourages valid counterfactual examples.111We do not normalize each feature by the median absolute deviation, as images have identical value ranges. Another method that lies in this end of the spectrum is Dhurandhar2019 which follows a similar loss-function as Wachter2017, but with a more complex distance function. It should be mentioned that both methods were originally introduced for tabular data. We here study it in the image domain as a simple method that produce counterfactuals with minimal changes that are looking less realistic.At the other end of the spectrum, we include the method proposed by ecinn as a representative for methods based on generative models (denoted GEN). The method is based on conditional invertible neural networks (INNs) which are generative models that can also do classification ibinn. Counterfactual embeddings are found by correcting embeddings of inputs such that the predicted class provably change. Counterfactual examples are successively generated by inverting the embeddings with the INN. We find the method from ecinn to be the most extreme case in this end of the spectrum, compared to, e.g., Rodriguez2021; Singla2019, because it uses the same neural network for both predictions and for generating counterfactuals. In contrast, Rodriguez2021 and Singla2019 train surrogate generative models, which are used for sampling counterfactuals. We study the method as a more complex method which produces more realistic counterfactuals but with larger amounts of change.
At the middle of the spectrum, methods use gradients to compute counterfactuals similar Wachter2017, but where gradient optimizations are guided by derivatives of generative models or other more sophisticated loss terms to enhance the quality of the counterfactuals Dhurandhar2018; VanLooveren2019. In our experiments, we use the method proposed by VanLooveren2019 as representative (denoted GL). The method uses embeddings from an auto-encoder to optimize a class-prototype loss. Such method should produce counterfactuals where both the visual quality and the amount of changes is in between GB and GEN. However, we find that in most cases, the visual quality is on par with GB in practice.
Experimental details.
The methods from Wachter2017 and VanLooveren2019 were implemented using the alibi
framework,222https://docs.seldon.io/projects/alibi (v.0.5.9), default parameters. Apache License 2.0. and ecinn was adopted from the official code.333https://github.com/fhvilshoj/ECINN, default parameters. MIT license.
The former two methods are used to identify counterfactuals for the same “vanilla” convolutional neural network, identical to the one described in VanLooveren2019.
The latter method is based on a conditional INN as predictive model, identical to that of ecinn.
In turn, the presented results may be contributed to differences in architectures and not methods as such.
However, the goal of the experiments is not to identify a superior method, but to demonstrate properties of metrics for evaluating counterfactual explanations on images.
We note that the method from ecinn generates counterfactuals for all classes different from the input class, so throughout experiments, we choose one target class uniformly at random for each input.
Additional experimental details on, e.g
., hyperparameters for training are provided in the supplementary material.
Throughout the experiments, we report mean scores over the entire test set and 95% confidence intervals in parentheses. Except from
, we report scores only on valid counterfactual examples from the test sets, i.e., we do not include counterfactuals that did not change the predicted class.4.1 FakeMNIST
ecinn
propose FakeMNIST; an artificial dataset which dictates the relationship between pixels and labels. To generate the dataset, MNIST images
mnist are shuffled and assigned new random labels. The top-left pixels are colored according to the new labels, see first row of fig:fake_mnist. The digits present in the images are independent of the labels while only the top-left pixels are label-dependent. The dataset can be used to test whether counterfactual methods change only class-related features. There are, however, no metrics associated with the dataset. We apply the to further the evaluation protocol for the dataset and test if detects methods that change label-independent features.
![]() |
![]() |
Column 1 of fig:fake_mnist displays four samples from the FakeMNIST test set. Smaller rectangles magnify the top-left pixels for increased readability. The first column displays inputs with labels 7, 8, 6, and 6, respectively (cf. labels or top-left dot locations). The following three columns are counterfactuals with target class , generated by the three representative methods. In fig:fake-js, we show for both the FakeMNIST labels and on the original MNIST labels. As intended, the finds that GEN most successfully produces counterfactuals that change the predicted class (high on FakeMNIST) and leaves the digit related pixels untouched (zero on MNIST). Furthermore, the reveals that both GB and GL produces less effective counterfactuals, as their on FakeMNIST are lower. Through the high on MNIST, the metric also finds that the two methods wrongly alters digit related pixels when generating counterfactuals.
In tab:fake_mnist in the supplementary material, we include scores of all other metrics described in sec:evaluating-counterfactuals. All the scores behave as expected and quantify differences between the methods properly. In conclusion, we find that for this simple dataset, qualitative observations and quantitative evaluations are well aligned in general. In turn, we argue that to provide a complete picture of performance, new methods can provide all the presented scores for the FakeMNIST dataset.
4.2 Normalization
Most metrics presented in this paper depend on data normalization or pretrained models. The dependence makes reporting both data normalization and model specifications crucial for reproducibility. We demonstrate the normalization issue with a practical example where we apply the metrics to the same counterfactuals but with different normalization. The metrics have been adjusted to each normalization, i.e., new models were trained to operate on the particular normalization.
Method | |||||
---|---|---|---|---|---|
normalization | |||||
GB | 16.07 (0.18) | 0.99 (0.00) | 0.55 (0.01) | 50.23 | 73.38% (0.87) |
GL | 42.76 (0.31) | 0.99 (0.00) | 0.53 (0.00) | 308.43 | 37.71% (0.95) |
GEN | 99.17 (0.58) | 0.88 (0.00) | 0.17 (0.00) | 90.73 | 93.13% (0.50) |
normalization | |||||
GB | 16.07 (0.18) | 1.06 (0.00) | 2.46 (0.02) | 24.92 | 48.25% (0.98) |
GL | 42.76 (0.31) | 1.04 (0.00) | 1.94 (0.01) | 173.82 | 38.53% (0.95) |
GEN | 99.17 (0.58) | 0.89 (0.00) | 1.47 (0.01) | 37.89 | 91.92% (0.53) |
In tab:mnist-scores, we report mean scores for both a and a normalization. By comparing the numbers between normalizations, we see that the best performing method for each metric is the same, independent of the normalization. In fig:models-avg-scores in the supplementary material, we even find this result to be statistically significant across 10 independently initialized models. It should be noted that the score is invariant to data shifts and scales linearly with the normalization range (cf. Table 1).
As the table indicates, there is, however, an issue. Had the metric been used to compare GL with a normalization against GEN with a normalization, the conclusion would have been wrong, as GL would be deemed better than GEN. Although this issue may seem obvious, it occurs in literature. If one compares reported scores between VanLooveren2019 and Mahajan2019, the difference is about an order of magnitude. VanLooveren2019 use normalization range , while Mahajan2019 use . We believe that the normalization differences contribute to explaining the difference between the reported scores. In turn, we propose to establish a common set of models with a fixed normalization range to be used for every evaluation, such that comparison across publications becomes possible. Upon publication, we will release our code to allow other researchers to easily evaluate their counterfactual methods.
4.3 Inspecting Scores
To get a deeper insight into how different metrics behave, we have identified pairs of inputs and counterfactuals for which there are unintended differences in scores. We find that in some cases, which may be important for evaluating counterfactuals on specific datasets with specific properties, existing metrics can be a source of wrong conclusions when applied in isolation. Except for , similar findings as those presented here were found for the CelebA-HQ dataset (see appendix).
En.
For the image domain, the distance is known to work poorly in terms of quantifying small interpretable changes Kang2020. For completeness, we demonstrate the issue in fig:extremesa which shows a seven to the left and two counterfactuals with target class (center and right). The distance is displayed above the two counterfactuals. Arguably, the center image looks most like a seven and the right image looks like a nine. However, according to the distance, the center image is an order of magnitude better than the right. The example illustrates how tiny adversarial attacks may be deemed better than proper counterfactual examples, just because they change the input less.
![]() |
![]() |
![]() |
Im1.
fig:extremesb depicts a one with two counterfactuals with target class and , respectively. The score is meant to quantify how realistic counterfactual examples are. Visually, the two counterfactual examples look similarly realistic. The center image does, however, get twice as good a score compared to the right, i.e., the seven was deemed to be more realistic by the score. Holding all else equal, this might be because more white pixels yield a larger loss. For isolated cases like this, the score may produce undesirable results. As observed in Table 1, the metric does, however, seem to work well on average when comparing methods on MNIST. As such, the score is best used as a summarizing statistic to compare averages of many samples across methods.
Im2.
For the score, which should yield lower values for more interpretable counterfactuals, fig:extremesc shows how the score gives an almost completely black image a better score than an image of an eight digit. On the contrary, the right image with the worst score seems more interpretable from a human perspective. The center image is presumably scoring best because it contains close to no information, which is easier to reconstruct than the right image which contains more information. We demonstrate in the appendix that it holds more generally that the score decreases, when we decrease pixel values toward their minimal value. In turn, the score might wrongly give good scores to methods that produce less interpretable counterfactuals by removing information from the inputs. To account for such drawback, also reporting, e.g., the oracle score, will make it harder to get good scores on both at once. A high score on both metrics at once is thus preferred.
Fid.

For , which quantifies the population wide similarity of sets of embedded images, it is not possible to identify single extreme samples. Instead, we observe how well distinguishes realistic and unrealistic samples. fig:mnist-faed shows counterfactual examples and the test-set-wide scores. For a human observer, both GB and GL does a poor job in generating realistic changes. It is, e.g., harder for humans to identify the target class for the two methods. does not identify realistically looking samples in this particular case, as it yields better scores for both GB and GL. Interestingly, we find in the next section that successfully identifies the more realistically looking counterfactuals for the more complex CelebA-HQ dataset. To deal with the identified issue, we argue that the score should be reported together with the or the Oracle score, which are less vulnerable to tiny adversarial-like changes. If both the and the scores are good, then the quality of the counterfactuals is more likely to be high.
In conclusion, we find that in isolated cases, the metrics may be deceiving. We argue that the metrics should be reported jointly to account for each other’s drawbacks. For example, if a method gets a low score indicating interpretable counterfactuals, a low score indicating realistic counterfactuals, and a high oracle score indicating that the counterfactuals generalize, it is a strong indicator that it is a good method.
4.4 Complex data
In this section, we scale our experiments to the more complex dataset, CelebA-HQ celeba. The goal is to evaluate how the studied metrics work in a more complex setting.
CelebA-HQ is a dataset of faces, where each sample is associated with 40 binary class labels. fig:celeba presents four different inputs in the first column. The first two have a positive makeup label and the last two have a negative label. The following three columns represent counterfactual examples of the opposite label value. Qualitatively, we find the three compared methods to produce counterfactual examples with similar properties as for FakeMNIST and MNIST. On the contrary, when we consider Table 2, we find that for some metrics the quantitative results vary from the previous experiments. Specifically, we see that the and scores fail to distinguish good from bad counterfactuals, as the scores yield almost the same value for all three methods. We also observe that the , in contrast to the previous experiment, successfully distinguishes the realistic from the unrealistic counterfactuals by giving GEN the lowest (best) score and GL the highest. In turn, for this more complex dataset, is not as vulnerable to tiny adversarial like attacks. For completeness, we also mention that the Oracle score and the metric behave as expected. That is, the oracle score successfully identify the generative based method to most properly change the predicted class by the oracle, while the two other methods are found to be less successful. The metric correctly identifies the smallest changes, but the score is of little interest in the present comparison, as adversarial-like changes is still favored by the metric. In a comparison of two methods which do not produce such tiny changes, the metric might, however, be valuable to quantify how much each method changes.
Method | Oracle | |||||
---|---|---|---|---|---|---|
GB | 96.07% (0.72) | 147.04 ( 2.04) | 0.98 (0.00) | 0.47 (0.01) | 205.59 | 82.82% (1.42) |
GL | 81.09% (1.44) | 344.02 (18.13) | 0.99 (0.00) | 0.52 (0.01) | 484.08 | 32.84% (1.92) |
GEN | 99.26% (0.32) | 684.26 (11.86) | 1.03 (0.00) | 0.53 (0.01) | 98.35 | 89.91% (1.11) |


To also evaluate on the more complex dataset, we have computed the score for the counterfactual label (smile versus no smile) and four other labels. We chose the labels “lipstick” and “attractive” which should correlate more with the makeup label than the other two labels, “high cheekbones” and “smiling.” Also on this dataset, successfully avoids giving the best stores for the tiny adversarial-like changes and favors more realistic changes. Specifically, identifies that GEN has a larger effect (high ) for the related makeup, attractiveness, and lipstick labels, while having similar low effect (low ) on the less related labels high cheekbones and smiling. also successfully identifies how the changes made by, e.g., GL has less effect on all the labels, which indicates that the counterfactuals are highly model specific and behave more like adversarial examples.
In summary, we find that for the more complex CelebA-HQ dataset, both the and the scores are less useful, while combining with the yields a trustworthy quantitative evaluation of how realistic and valid the counterfactuals are, respectively. Minimal changes are still hard to quantify, but with two methods that perform on par on and , the distance may be applicable as to judge how much each method changes.
5 Conclusion
Through an analysis and experimental evaluations, we find that each quantitative metrics for evaluating visual counterfactual examples captures only some desired properties of good counterfactual examples. On the sufficiently simple dataset FakeMNIST, we found that all metrics considered behaves as expected. However, on the more complex datasets like MNIST and CelebA-HQ, behaviors deviate more from the intended. One particular issue is that visually unrealistic and tiny adversarial-like counterfactuals are very model specific and are often unintendedly deemed to be good by the metrics. To overcome this issue, we present the Label Variation Score and the oracle score, which are both based on surrogate predictive models that are less vulnerable to such tiny changes. To make a proper quantitative evaluation of visual counterfactual examples, we conclude that capturing all the desired properties is best done by reporting metrics concerning both realistic changes and validity together.
6 Limitations and Broader Impact
By analyses and experimental evaluations of quantitative metrics for evaluating counterfactual examples, this work contributes to improve scientific progress within counterfactual examples. As such, the work contributes to better understanding what can and can not be expected of different quantitative metrics. Such understanding will arguably yield better evaluations of counterfactuals and consequently improve the performance of methods for generating counterfactual examples. As such, we do not see any direct social impacts of this work. Indirectly, improving counterfactual examples can potentially enable attackers to fool automated machine learning systems by creating realistically looking adversarial examples, which yield desired outcomes.
We also recognize the limitations of our work. First, by limiting our evaluation to metrics that have been published at least twice, we have not done a complete evaluation of all existing metrics for evaluating counterfactual examples. In turn, there may be other metrics which better capture desired properties of counterfactual examples. Second, to limit the scope, we have chosen three representative counterfactual methods which represents specific properties in counterfactuals. As such, there may be other properties of counterfactual examples, that we have not evaluated and consequently do not know whether they effect metrics. Finally, due to a large spread in datasets used across publications, we have restricted our evaluation to three datasets of increasing complexity. From our work, it is not clear how our results extend to other datasets.
References
Appendix A Additional Experimental Results
a.1 FakeMNIST
In addition to the , we also ran all other metrics on the FakeMNIST dataset. tab:fake_mnist presents all the scores. We find that all included scores behave as expected. Specifically, we expect the and scores to identify that counterfactuals generated by GEN are the most realistic, as they change only the top left pixels, which should be easier to capture by the auto-encoders compared to the more scattered changes by both GB and GL. As only changing the top left pixels should produce a little difference in terms of the distance, we would also expect GEN to get the lowest score, which is also the case in tab:fake_mnist. A similar argument also works for the . The should capture that the most realistic samples are those where only the top-left pixels are changed, which is also the case.
The is not based on the perceptual quality of the counterfactuals, but on how effective each method is in changing the predicted class on the given classifier. We see from tab:fake_mnist that GEN is the most effective, which aligns well with the rest of our experiments. Finally, we see that the oracle score, which indicates whether the counterfactual examples also generalize to another classifier, also identifies how counterfactuals from GEN generalize better than those of GB and GL.
Method | Oracle | |||||
---|---|---|---|---|---|---|
GB | 68.11% (0.91) | 11.60 (0.49) | 1.22 (0.01) | 0.49 (0.01) | 252.5 | 88.31% (0.76) |
GL | 84.07% (0.72) | 47.38 (0.90) | 1.03 (0.00) | 1.23 (0.03) | 309.95 | 55.51% (1.06) |
GEN | 100.00% (0.00) | 6.81 (0.04) | 0.68 (0.00) | 0.21 (0.00) | 0.12 | 99.98% (0.03) |
a.2 Mnist
To evaluate how sensitive the model based scores , , and the oracle score are to initialization of models, we trained ten individual classifiers with different random initializations for the MNIST dataset and computed the mean scores along with 95% confidence intervals. In fig:models-avg-scores, we display the results, where bars represent mean values and horizontal black lined indicate confidence intervals. From the figure, we see that all three scores have statistically significant differences on the 95% level. It should be mentioned, that we test ten identical model architectures. In turn, the experiment does not reveal any information on whether results are also robust across different model architectures.

a.3 Inspecting Scores on CelebA-HQ
Similar to how we inspected scores on the MNIST dataset in sec:extreme-inputs, we have also considered similar input and counterfactual pairs for the more compled CelebA-HQ dataset. Results are shown in fig:celeba-extreme-samples, which shows a random sample from the dataset along with the counterfactuals generated by the three counterfactual methods used in this work. tab:celeba-extreme-scores shows the related scores. fig:celeba-extremes confirms the observations mentioned in sec:extreme-inputs, but also observations from our other experiment on CelebA-HQ (sec:celeba). Specifically, we see that finds the tiny adversarial-like changes from GB to be the best, which does not align with what a human observer would deem a good counterfactual example. As found in sec:celeba, fail to distinguish counterfactuals from GB and GL. Finally, the score yields similar scores for GB and GEN, is also in contradiction to the human observations, as the sample form GENseems more interpretable.
![]() |
|
Appendix B Experimental Details
In this section, we list all the relevant training details for the models used in this paper. We note that we also supply code at https://github.com/fhvilshoj/EvaluatingCounterfactuals, which also contains all counterfactual examples used throughout the experiments, the code used for evaluation, and all the models used.
Convolutional Neural Networks.
GB and GL both generate counterfactual examples for convolutaional neural networks with the model architecture described in VanLooveren2019. For simplicity, we used the same model architecture for classifiers used with the oracle score, but with a different random initialization. Unless explicitly stated differently in the main paper, all data was normalized to a range. All convolutional neural networks were trained with categorical cross-entropy. Remaining configurations for the convolutaional neural networks are stated in tab:convolutional-configuration.
Configuration | (Fake)MNIST classifier | (Fake)MNIST oracle | CelebA |
---|---|---|---|
Learning rate | |||
Optimizer | ADAM | ADAM | ADAM |
Batch size | 64 | 128 | 64 |
epochs | 10 | 10 | 100 |
was used with Keras default parameters:
.Auto-encoders.
The auto-encoders used for generating counterfactuals for GL (VanLooveren2019) and for computing and scores had the same architecture as described by VanLooveren2019. We use independently initialized auto-encoders for computing and evaluating counterfactuals, respectively. The models were trained with mean squared error loss and remaining configurations presented in tab:auto-encoders.
Configuration | (Fake)MNIST oracle | CelebA |
---|---|---|
Learning rate | ||
Optimizer | Adam | Adam |
Batch size | 128 | 64 |
epochs | 50 | 50 |
Conditional INNs.
The conditional INNs used in this paper used exactly the architectures and the loss function described in ibinn. We use for FakeMNIST and MNIST and for CelebA, which was found to work well in ecinn. We only present the “convincing” counterfactuals from ecinn with the -value suggested in the paper. For both the FakeMNIST and MNIST datasets, we use the smaller architecture in ibinn and for the CelebA dataset, we use the deeper architecture presented for the CIFAR10 dataset in ibinn. Additional configurations are presented in tab:ecinn. We note that a full model specification and parameter configuration is also available in the public code repository.
Configuration | (Fake)MNIST oracle | CelebA |
---|---|---|
1.4265 | 1.0 | |
Learning rate | 0.07 | |
Optimizer | SGD | ADAM |
Optimizer parameters | Momentum 0.9 | |
Batch size | 128 | 32 |
epochs | 60 | 800 |
Scheduler | milestone | milestone |
Milestones | 50 | 200, 400, 600 |
Dequantization | Uniform | Uniform |
Noise amplitude | ||
Label smoothing | 0 | |
Gradient norm clipping | 8.0 | 2.0 |
Weight decay |
. Stochastic Gradient Descent is abbreviated SGD below.