Fairness and Robustness of Contrasting Explanations

03/03/2021 ∙ by André Artelt, et al. ∙ Bielefeld University 0

Fairness and explainability are two important and closely related requirements of decision making systems. While ensuring and evaluating fairness as well as explainability of decision masking systems has been extensively studied independently, only little effort has been investigated into studying fairness of explanations on their own - i.e. the explanations it self should be fair. In this work we formally and empirically study individual fairness and robustness of contrasting explanations - in particular we consider counterfactual explanations as a prominent instance of contrasting explanations. Furthermore, we propose to use plausible counterfactuals instead of closest counterfactuals for improving the individual fairness of counterfactual explanations.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories



view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fairness and transparency are fundamental building blocks of ethical artificial intelligence (AI) an machine learning (ML) based decision making systems. In particular, the increasing use of automated decision making systems have strengthened the demand for trustworthy systems. The criticality of transparency was also recognized by policy makers which resulted in legal regulations like the EU’s GDPR 

[29] that grants the user a right to an explanation. Therefore, the research community focused a lot on the question how to realize explainability and transparency of AI and ML systems [15, 18, 37, 34]. Nowadays, there exist diverse methods for explaining ML models [18, 27]. One specific family of methods are model-agnostic methods [18, 31]. Model-agnostic methods are flexible in the sense that they are not tailored to a particular model or representation. This makes model-agnostic methods (in theory) applicable to many different types of ML models. In particular, “truly” model-agnostic methods do not need access to the training data or model internals. It is sufficient to have an interface for passing data points to the model and obtaining the output of the model - the underlying model it self is viewed as a black-box.

Examples of model-agnostic methods are feature interaction methods [16], feature importance methods [14], partial dependency plots [42]

and local methods that approximates the model locally by an explainable model (e.g. a decision tree

[32, 17]. These methods explain the models by using features as vocabulary.

A different class of model-agnostic explanations are example-based explanations where a prediction or behavior is explained by a (set of) data points [1]. Instances of example-based explanations are prototypes & criticisms [20] and influential instances [21]. Another instance of example-based explanations are counterfactual explanations [39]. A counterfactual explanation is a change of the original input that leads to a different (specific) prediction/behavior of the decision making system - what has to be different in order to change the prediction of the system? Such an explanation is considered to be fairly intuitive, human-friendly and useful because it tells people what to do in order to achieve a desired outcome [27, 39]. Further, there exists strong evidence that explanations by humans are often counterfactual in nature [10]. We will focus on theses types of explanations in this work.

Despite the recent success stories of AI and ML systems, ML systems were also involved in prominent failures like predictive policing [5] and loan approval [40]. Many of those failures, where the systems exhibit unethical behaviour, deal with predictions that are made based of sensitive attributes like race, gender, etc. Using such sensitive attributes for making decisions is considered to be unethical and thus unacceptable. Building “fair” systems that respect our notion of fairness (ethical correct behaviour) requires a formal definition of fairness - such a formalism can then be used for verifying and enforcing fairness of ML and AI based systems.

As a consequence, a number of several (formal) fairness criteria has been proposed [26, 11]

. A large group of criteria are concerned with the dependency between a sensitive attribute/feature and the response variable (prediction/behaviour of the system) - these criteria belong to group-based fairness criteria because they do not care about individuals but focus on whole groups of individuals only. While they all share the idea that the prediction of the system should “not depend” on the sensitive attribute, they differ in the definition of “independency”. Prominent examples of these kinds of fairness criteria are 


demographic parity, equalized odds and predictive rate parity. However, it was shown that there does not exist a perfect criteria that can not be exploited (e.g. finding a setting in which the particular criteria is satisfied but an obvious unfairness still exists), and many criteria even contradict each other - it is impossible for some sets of criteria to be all satisfied at the same time.

Another very intuitive formalization of fairness is individual fairness [13]. The idea behind individual fairness is to “treat similar individuals similar” - which is considered to be very intuitive and similar to the concept of individual fairness from other scientific disciplines [13, 8]. Despite its appealing simple intuition, a major problem of individual fairness is to proper formalize the notion of individuality - i.e. given two individuals we have to be able to compute a score that tells us how similar these individuals are. In this work we will focus on individual fairness.

Related work

While fairness and robustness are known to be closely related to each other [12, 28], robustness of explanations has only been recently started to be investigated. For instance it was shown recently that explanation methods are also vulnerable to adversarial attacks [4, 19] - i.e. an explanation (e.g. a saliency map or feature importances) can be (arbitrarily) changed by applying small perturbations to the original sample which is going to be explained. This instability of explanations also holds true for counterfactual explanations [23] and the necessity of local stability of explanations is widely accepted [24, 23, 4, 19, 3]. However, computing stable and robust counterfactual explanations is still an open-research problem [38].

Fairness and explanations are also closely related to each other: For instance, counterfactual explanations can be used for detecting bias and unfairness in decision making systems [35]. Given the complex meaning and definition of fairness, the authors of [36] propose to use explanations methods for explaining the (un-) fairness of a model to a lay person. Explanations methods like counterfactual explanations can also be used for defining a fairness criteria like it was done in case of counterfactual fairness [22] in which a decision making system is considered to be fair if changing the sensitive attribute while holding everything else that is not causally dependent on the sensitive attribute (under a causal model) constant must not change the prediction of the decision making system.

Missing stability and robustness can lead to unfair explanations and thus compromise the trustworthiness of the decision making system [3, 4].

Our contributions

In this work we formally and empirically study the robustness and individual fairness of counterfactual explanations and propose to use plausible instead of closest counterfactual explanations because we find evidence that the latter yields a better individual fairness.

The remainder of this work is structured as follows: First, in section 2 we briefly review counterfactual explanations and individual fairness. Next, we formally define our notion of individual fairness of contrasting explanations in section 3.1 and formally study fairness and robustness of counterfactual explanations in section 3.3 - we also propose to use plausible instead of closest counterfactual explanations for improving the individual fairness of the explanations. We empirically evaluate the individual fairness of closest and plausible counterfactual explanations in section 4. Finally, our work closes with a summary and outlook in section 5.

Note that all proofs and derivations, as well as additional plots of the experiments, can be found in the appendices 0.A and 0.B.

2 Foundations

2.1 Counterfactual Explanations

Counterfactual explanations (often just called counterfactuals) contrast samples by counterparts with minimum change of the appearance but different class label [39, 27] and can be formalized as follows:

Definition 1 (Counterfactual explanation [39])

Assume a prediction function is given. Computing a counterfactual for a given input is phrased as an optimization problem:



denotes a suitable loss function,

the requested prediction, and a penalty term for deviations of from the original input . denotes the regularization strength.

In this work we assume that data come from a real-vector space and continuous optimization is possible. In this context 

[6], two common regularizations are the weighted Manhattan distance and the generalized L2 distance. Depending on the model and the choice of and , the final optimization problem might be differentiable or not. If it is differentiable, we can use a gradient-based optimization algorithm like conjugate gradients, gradient descent or (L-)BFGS. Otherwise, we have to use a black-box optimization algorithm for continuous optimization like Downhill-Simplex method.

While the formalization of the optimization problem Eq. (1) is model agnostic (i.e. it does not make any assumptions on the model ), it can be beneficial to rewrite the optimization problem Eq. (1) in constraint form [6]: equationparentequation


The authors of [6] have shown that the constraint optimization problem Eq. (2) can be turned (or efficiently approximated) into convex programs for many standard machine learning models like GLM, QDA, LVQ, etc.. Since convex programs can be solved quite efficiently [9], the constraint form [6] becomes superior over the original black-box modelling [39] if we have access to the underlying model .

Counterfactuals stated in its simplest form, like in Definition 1 (also called closest counterfactuals), are very similar to adversarial examples, since there are no guarantees that the resulting counterfactual is plausible and feasible in the data domain. As a consequence, the absence of such constraints often leads to counterfactual explanations that are not plausible [7, 25, 30]. To overcome this problem, the several approaches [7, 25] propose to allow only those samples that lie on the data manifold - e.g. by enforcing a lower threshold

for their probability/density. In particular, the authors of 

[7] build upon the constraint form Eq. (2) and propose the following extension of Eq. (2) for computing plausible counterfactuals: equationparentequation



denotes a class dependent density estimator. Because the true density is usually not known, they further propose to replace the density constraint Eq. (


) with an approximation of a Gaussian mixture model (GMM) which then can be written a set of convex quadratic constraints and hence nicely fits into the work of 

[6] for using convex programming for computing counterfactual explanations.

2.2 Individual Fairness

Individual fairness requires to “treat similar individuals similar” [13]. Transferring this idea to the ML world where we have a make predictions , we can formalize individual fairness as follows:


where denotes a similarity measure on the individuals , denotes a threshold up to which we consider two individuals as similar and denotes a similarity measure on the predictions of the ML systems . A critical choice, which highly depends on the specific use-case, are the similarity measures and . A very simple possible default choice is to use the p-norm, which can be enhanced with some kind of feature weighting, - i.e. a real valued vector space as a representation of the individuals is assumed.

In this work we use a p-norm for measuring similarity of individuals as well as the similarity of two predictions.

3 Fairness and Robustness of Counterfactual Explanations

In the sub sequel we formally study and define individual fairness and robustness of counterfactual explanations for general as well as specific prediction functions .

3.1 Individual Fairness of Counterfactual explanations

We aim for a formalization of individual fairness of counterfactual explanations. Inspired by the intuition (and formalization) of individual fairness in section 2.2, we propose the following definition that formalizes the intuition that counterfactual explanations of similar individuals should be similar:

Definition 2 (Individual fairness of counterfactual explanations)

Let be a prediction function and with be a sample prediction that has to be explained. For this purpose, let be a counterfactual explanations of .

Let be a randomly perturbed sample of with that is close to - i.e. for some suitable metric .

Let be a counterfactual explanations of this perturbed sample with the same target label .

We define the individual fairness of the explanation as the expected distance between the between the counterfactual explanations of the original sample and a perturbed sample :

Remark 1

By replacing the expectation in Eq. (5) with the sample mean gives us a method for empirically comparing the individual fairness (according to Definition 2) of different counterfactual explanations.

Note that the fairness criteria Eq. (5) is to be minimized - i.e. smaller values correspond to a better individual fairness.

3.2 Perturbations

While there are infinitely many possible ways of perturbing a given input - i.e. choosing in Definition 2 -, we focus on two specific perturbations in this work: Perturbation by Gaussian noise and a perturbation by masking features. Perturbing a given input

with Gaussian noise means to add a small amount of normally distributed noise

to :


where the size and shape of the perturbation can be controlled by the covariance matrix - in this work we use a diagonal matrix and often choose . However, note that in this perturbation we can not guarantee that the perturbed sample is close because can yield arbitrarily large values although this is kind of unlikely.

While a perturbation by Gaussian noise Eq. (6) potentially changes every feature, feature masking allows a much more precise way of perturbing a given input:


where denotes the element wise multiplication and the seize of the perturbation can be controlled by the number of masked features. Also note that the number of s (number of masked features) as well as their position (feature id) can vary - in this work: given a fixed number of masked features, we select the masked features randomly.

3.3 Robustness of Counterfactual Explanations

In this section we formally study robustness and fairness of different prediction functions . First, we give a very general bound on the robustness of closest counterfactual explanations in Theorem 3.1.

Theorem 3.1 (General bound on closest counterfactuals of perturbed samples)

Let be a prediction function and let be a sample for which we are given a closest counterfactual (see Eq. (2)) . Let be a perturbed version of such that - we denote the corresponding closest counterfactual of with the same target prediction as .

We can then bound the difference between the two counterfactuals and as follows:


In case of a binary linear classifier, we can refine the bound from Theorem 

3.1 as stated in Corollary 1.

Corollary 1 (Bound on closest counterfactuals of perturbed samples for binary linear classifiers)

In case of a binary linear classifier - i.e. -, we can refine the bound from Theorem 3.1 as follows:

Remark 2

Note that while the general bound Eq. (8) in Theorem 3.1 depends on the closest counterfactual of the unperturbed sample , the bound Eq. (9) in Corollary 1 only depends on the original sample , the model parameter and the perturbation bound . Both bounds (Theorem 3.1 and Corollary 1) are rather loose because they do not make any assumption on the perturbation except that it must be bounded by - in addition the bound in Theorem 3.1 does not even make any assumption on at all.

Making additional assumptions allow us to come up with more precise (and potentially more useful) statements as shown in the next theorem Theorem 3.2. In Theorem 3.2 (and the consequential corollaries Corollary 2 and Corollary 3) we study the individual fairness of a linear binary classifier under Gaussian noise. It turns out, that the individual fairness Definition 2 of closest counterfactual explanations of a binary linear classifier under Gaussian noise depends on the dimension only - i.e. the larger the dimension of the input space, the larger the individual unfairness.

Theorem 3.2 (Individual fairness of closest counterfactuals of a linear binary classifier under Gaussian noise)

Let be a binary linear classifier - i.e. . The individual fairness of closest counterfactuals (see Definition 2) under Gaussian noise Eq. (6) - with an arbitrary diagonal covariance - at an arbitrary (correctly classified) sample can be stated as follows:


where we assume the squared Euclidean distance as a distance metric for measuring the distance between two counterfactuals.

Corollary 2

If we assume the identity matrix

as a covariance matrix of the Gaussian noise in Theorem 3.2, Eq. (10) simplifies as follows:

Corollary 3

Theorem 3.2 and Corollary 2 imply the following upper bound on the probability that the individual unfairness is larger than some :

Remark 3

We can interpret Theorem 3.2 (and in particular the consequential correlaries Corollary 2 and Corollary 3

) as the “curse of dimensionality for individual fairness of closest counterfactual explanations” because the larger the dimension

of the data space, the larger the individual unfairness.

We can still make some statements on the individual fairness Definition 2 of closest counterfactual explanations, when using bounded uniform noise instead of Gaussian noise, as stated in Theorem 3.3 and Corollary 4.

Theorem 3.3 (Individual fairness of closest counterfactuals of a linear binary classifier under bounded uniform noise)

Let be a binary linear classifier - i.e. . The individual fairness of closest counterfactuals (see Definition 2) under a bounded uniform noise at an arbitrary (correctly classified) sample can be stated follows:

Corollary 4

Theorem 3.3 implies the following upper bound on the probability that the individual unfairness is larger than some :


Since our formal statements so far suggest a potential presence of unfairness even for simple models like binary linear classifiers - we will empirically confirm this in the experiments (see section 4) -, we propose to add some kind of regularization for improving the individual fairness Definition 2 of counterfactual explanations. We propose to use plausible instead of closest counterfactual because we think that the problem of individual unfairness of closest counterfactuals comes from the fact that in case of a “wiggly” decision boundary small perturbations of the input cause a completely different closest counterfactual (similar to adversarial attacks). Under the assumption that the set of plausible counterfactuals is less “wiggly” we would expect to observe better individual fairness when considering plausible instead of closest counterfactuals. We empirically evaluate this hypothesis in section 4.

4 Experiments

We empirically evaluate the individual fairness (Definition 2) of closest and plausible counterfactual explanations. For this purpose, we compute closest and plausible counterfactuals of perturbed data points for a diverse set of classifiers and data sets.

Data sets

We use the three standard data sets:

  • The “Breast Cancer Wisconsin (Diagnostic) Data Set” [41] whereby we add a PCA dimensionality reduction to dimensions to the model.

  • The “Wine data set” [33].

  • The “Optical Recognition of Handwritten Digits Data Set” [2] whereby we add a PCA dimensionality reduction to dimensions to the model.

Because PCA preprocessing is an affine transformation, we can integrate the transformation into the convex programs and therefore still compute counterfactuals in the original data space [6].


We use the following diverse set of models: softmax regression, generalized learning vector quantization (GLVQ) and decision tree classifier. We use the same hyperparameters across all data sets - for all vector quantization models we use

prototypes per class and use as the maximum depth of decision tree classifiers.


We report the results of the following experiments over a -fold cross validation: We fit all models on the training data (depending on the data set this might involve a PCA as a preprocessing) and compute a closest and plausible counterfactual explanations of all samples from the test set that are classified correctly by the model - whereby we compute counterfactuals of the original as well as the perturbed sample. We use two different types of perturbations: Gaussian noise Eq. (6) with and feature masking Eq. (7) for one up to half of the total number of features. In case of a multi-class problem, we chose a random target label that is different from the original label. We compute and report the distance between the counterfactuals of the original sample and the perturbed sample Eq. (5) - we do this separately for closest and plausible counterfactuals. Furthermore, we use MOSEK111We gratefully acknowledge an academic license provided by MOSEK ApS. as a solver for all mathematical programs. The complete implementation of the experiments is available on GitHub222https://github.com/andreArtelt/FairnessRobustnessContrastingExplanations.


The results of using Gaussian noise Eq. (6) for perturbing the samples are shown in Table 1. The results on the digit data set for increasingly masking more and more features Eq. (7) are shown in Fig. 1 - plots for the other data sets are given in appendix 0.B.

Data set Wine Breast cancer Handwritten digits
Method Closest Plausible Closest Plausible Closest Plausible
Softmax regression 10.16 1.87 24.04 22.48 53.71 48.78
Decision tree 9.25 2.42 24.05 23.11 56.56 49.40
GLVQ 9.95 1.74 23.34 21.42 57.66 49.46
Table 1: Comparing the median absolute distance between counterfactual of original sample and perturbed sample (using Gaussian noise Eq. (6)) - closest and plausible counterfactual explanations. Smaller values are better - best values are highlighted.
Softmax regression
Decision tree
Figure 1: Handwritten digits data set: Median absolute distance between counterfactual of original sample and perturbed sample (using feature masking Eq. (7)) for closest and plausible counterfactual explanations - for different number of masked features. Smaller values are better.

We observe that in all cases the plausible counterfactual explanations are less affected by perturbations than the closest counterfactuals - thus we consider them to be better under individual fairness (Definition 2

). The size of the differences depends a lot on the combination of model and data set. However, in all cases the difference is significant. In case of increasingly masking features, we observe that although the distance between counterfactuals of original and perturbed sample is subject to some variance, the difference between closest and plausible counterfactual is always significant - even when masking up to

% of all features.

5 Discussion and Conclusion

In this work we argued that not only the fairness of decision making systems is important but also fairness of explanations is important. We studied the robustness of contrasting explanations - in particular we focused on counterfactual explanations. Besides deriving robustness bounds, we we also focused on individual fairness of constrasting explanations - we studied formally and empirically the individual fairness of counterfactual explanations. In addition, we proposed to use plausible instead of closest counterfactuals for increasing the individual fairness of counterfactual explanations - we empirically evaluated and compared the individual fairness of closest vs. plausible counterfactual explanations. We found evidence that plausible counterfactuals provide better individual fairness than closest counterfactual explanations.

In future work we plan to further study formal fairness and robustness guarantees and bounds of more models and different perturbations. We also would like to investigate other approaches and methodologies for computing plausible counterfactual explanations - the work [7] we used in this work is only one possible approach for computing plausible counterfactuals, other approaches exist as well [30, 25]. Finally, we are highly interested in studying the problem of individual fairness of (contrasting) explanations from a psychological perspective - i.e. investigating how people actually experience individual fairness of contrasting explanations and whether this experience is successfully captured/modelled by our proposed formalizations and methods.


Appendix 0.A Proofs and Derivations

  1. Proof (Theorem 3.1)

    Since the perturbation is bounded by , it holds that:


    Furthermore, if the closest counterfactual of is different from the closest counterfactual of (perturbed ), it must hold that:


    Because of the triangle inequality we know that the following holds:


    Plugging Eq. (16) into Eq. (17) yields:


    By making use of the triangle inequality and Eq. (15), we find that:


    Plugging Eq. (19) into Eq. (18) yields the desired bound Eq. (8):


  2. Proof (Corollary 1)

    First, we prove that the closest counterfactual explanations of a sample under a binary linear classifier (we assume w.l.o.g. ) can be explicitly stated as follows:


    Computing the closest counterfactual of some under a binary linear classifier can be formalized as the following optimization problem: equationparentequation


    Note that the constraint Eq. (22b) “replaces/approximates” the constraint  Eq. (2b). The constraint Eq. (22b) requires that the solution lies directly on the decision boundary. We assume that points on the decision boundary are classified as - while this approach is debatable, it offers an easy solution to the original problem because otherwise we would have to project onto an open set which is “difficult” (once we are on the decision boundary we could add an infinitesimally small constant to the solution for crossing the decision boundary if this is really necessary).

    We solve Eq. (22) by using the method of Lagrangian multipliers. Since Eq. (22) is a convex optimization problem, we only have globally optimal solutions. The Lagrangian of Eq. (22) is given as follows:


    The gradient of the Lagrangian Eq. (23) with respect to can be written as follows:


    The optimality condition requires the gradient Eq. (24) being equal to zero:


    Plugging Eq. (25) back into the Lagrangian Eq. (23) yields the Lagrangian dual:


    The gradient of the Lagrangian dual Eq. (26) can be written as follows:


    Next, the optimality condition requires that the gradient Eq. (27) is equal to zero:


    where we made use of .

    Finally, we obtain the solution of the original problem Eq. (22) by plugging the solution of the dual problem Eq. (28) into Eq. (25):


    which concludes this sub-proof.

    Plugging Eq. (21) into the bound Eq. (8) from Theorem 3.1, and again assuming w.l.o.g. that , yields the desired bound Eq. (9):


  3. Proof (Theorem 3.2)

    From the proof of Corollary 1 we now that the closest counterfactual explanation of a sample under a linear binary classifier can be stated explicitly Eq. (21):


    Applying the analytic solution Eq. (31) to the squared Euclidean distance between the closest counterfactual of the original sample and the closest counterfactual of the corresponding perturbed sample yields:


    Taking the expectation of Eq. (32) over an arbitrary density yields:


    Working out the specific expectations from Eq. (33

    ) and under a Gaussian distribution

    with - i.e. are uncorrelated - yields:


    where we made use of the assumption that .

    Substituting Eq. (34), Eq. (35), Eq. (36), Eq. (37) in Eq. (33) yields:


    which concludes the proof. ∎

  4. Proof (Corollary 2)

    Substituting for in Eq. (10) from Theorem 3.2 yields the claimed expectation:


  5. Proof (Corollary 3)

    Plugging the expectation from Corollary 2 into Markov’s inequality yields the claimed bound:


  6. Proof (Theorem 3.3)

    From the proof of Theorem 3.2 we know that the exceptation over an arbitrary density of the distance between the closest counterfactual of the original sample and the perturbed sample can be written as follows:


    Next, working out the specific expectations from Eq. (41) under a bounded uniform noise - i.e. are uncorrelated - yields: