Counterfactual Explanations Can Be Manipulated

Counterfactual explanations are emerging as an attractive option for providing recourse to individuals adversely impacted by algorithmic decisions. As they are deployed in critical applications (e.g. law enforcement, financial lending), it becomes important to ensure that we clearly understand the vulnerabilities of these methods and find ways to address them. However, there is little understanding of the vulnerabilities and shortcomings of counterfactual explanations. In this work, we introduce the first framework that describes the vulnerabilities of counterfactual explanations and shows how they can be manipulated. More specifically, we show counterfactual explanations may converge to drastically different counterfactuals under a small perturbation indicating they are not robust. Leveraging this insight, we introduce a novel objective to train seemingly fair models where counterfactual explanations find much lower cost recourse under a slight perturbation. We describe how these models can unfairly provide low-cost recourse for specific subgroups in the data while appearing fair to auditors. We perform experiments on loan and violent crime prediction data sets where certain subgroups achieve up to 20x lower cost recourse under the perturbation. These results raise concerns regarding the dependability of current counterfactual explanation techniques, which we hope will inspire investigations in robust counterfactual explanations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

11/27/2019

Actionable Interpretability through Optimizable Counterfactual Explanations for Tree Ensembles

Counterfactual explanations help users understand why machine learned mo...
06/23/2020

On Counterfactual Explanations under Predictive Multiplicity

Counterfactual explanations are usually obtained by identifying the smal...
06/18/2021

On the Connections between Counterfactual Explanations and Adversarial Examples

Counterfactual explanations and adversarial examples have emerged as cri...
06/23/2021

Feature Attributions and Counterfactual Explanations Can Be Manipulated

As machine learning models are increasingly used in critical decision-ma...
02/11/2020

Decisions, Counterfactual Explanations and Strategic Behavior

Data-driven predictive models are increasingly used to inform decisions ...
12/05/2021

Diverse, Global and Amortised Counterfactual Explanations for Uncertainty Estimates

To interpret uncertainty estimates from differentiable probabilistic mod...
07/09/2021

A Framework and Benchmarking Study for Counterfactual Generating Methods on Tabular Data

Counterfactual explanations are viewed as an effective way to explain ma...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning models are being deployed to make consequential decisions on tasks ranging from loan approval to medical diagnosis. As a result, there are a growing number of methods that explain the decisions of these models to affected individuals and provide means for recourse Ustun et al. [2019]. For example, recourse offers a person denied a loan by a credit risk model a reason for why the model made the prediction and what can be done to change the decision. Beyond providing guidance to stakeholders in model decisions, algorithmic recourse is also used to detect discrimination in machine learning models Gupta et al. [2019], Karimi et al. [2020], Sharma et al. [2020]. For instance, we expect there to be minimal disparity in the cost of achieving recourse between both men and women who are denied loans. One commonly used method to generate recourse is that of counterfactual explanations Bhatt et al. [2020]. Counterfactual explanations offer recourse by attempting to find the minimal change an individual must make to receive a positive outcome Wachter et al. [2018], Karimi et al. [2020], Poyiadzi et al. [2020], Van Looveren and Klaise [2019].

Although counterfactual explanations are used by stakeholders in consequential decision-making settings, there is little work on systematically understanding and characterizing their limitations. Few recent studies explore how counterfactual explanations may become valid when the underlying model is updated. For instance, a model provider might decide to update a model, rendering previously generated counterfactual explanations invalid Rawal et al. [2020], Pawelczyk et al. [2020]. Others point out that counterfactual explanations, by ignoring the causal relationships between features, sometimes recommend changes that are not actionable Karimi* et al. [2020]. Though these works shed light on certain shortcomings of counterfactual explanations, they do not consider whether current formulations provide stable and reliable results, whether they can be manipulated, and if fairness assessments based on counterfactuals can be trusted.

(a) Training with BCE Objective
(b) Training Adversarial Model
Figure 1: Model trained with BCE objective and adversarial model on a toy data set using Wachter’s Algorithm Wachter et al. [2018]. The surface shown is the loss in Wachter’s Algorithm with respect to , the line is the path of the counterfactual search, and we show results for a single point, . For the model without the manipulation (subfigure 0(a)), the counterfactuals for and converge to the same minima and are similiar cost recourse. For the adversarial model (subfigure 0(b)), the recourse found for is higher cost than because the local minimum initialized at is farther than the minimum starting at , demonstrating the problematic behavior of counterfactual explanations.

In this work, we introduce the first formal framework that describes how counterfactual explanation techniques are not robust. More specifically, we demonstrate how the family of counterfactual explanations that rely on hill-climbing (which includes commonly used methods like Wachter’s algorithm Wachter et al. [2018], DiCE Mothilal et al. [2020], and counterfactuals guided by prototypes Van Looveren and Klaise [2019]) is highly sensitive to small changes in the input. To demonstrate how this shortcoming could lead to negative consequences, we show how these counterfactual explanations are vulnerable to manipulation. Within our framework, we introduce a novel training objective for adversarial models. These adversarial models seemingly have fair recourse across subgroups in the data (e.g., men and women) but have much lower cost recourse for the data under a slight perturbation, allowing a bad-actor to provide low-cost recourse for specific subgroups simply by adding the perturbation. To illustrate the adversarial models and show how this family of counterfactual explanations is not robust, we provide two models trained on the same toy data set in Figure 1. In the model trained with the standard BCE objective (left side of Fig 1), the counterfactuals found by Wachter’s algorithm Wachter et al. [2018] for instance and perturbed instance converge to same minima (denoted and ). However, for the adversarial model (right side of Fig 1), the counterfactual found for the perturbed instance is closer to the original instance . This result indicates that the counterfactual found for the perturbed instance is easier to achieve than the counterfactual for found by Wachter’s algorithm! Intuitively, counterfactual explanations that hill-climb the gradient are susceptible to this issue because optimizing for the counterfactual at versus can converge to different local minima.

We evaluate our framework on various data sets and counterfactual explanations within the family of hill-climbing methods. For Wachter’s algorithm Wachter et al. [2018], a sparse variant of Wachter’s algorithm, DiCE Mothilal et al. [2020], and counterfactuals guided by prototypes Van Looveren and Klaise [2019], we train models on data sets related to loan prediction and violent crime prediction with fair recourse across subgroups that return - lower cost recourse for specific subgroups with the perturbation , without any accuracy loss. Though these results indicate counterfactual explanations are highly vulnerable to manipulation, we consider making counterfactual explanations that hill-climb the gradient more robust. We show adding noise to the initialization of the counterfactual search, limiting the features available in the search, and reducing the complexity of the model can lead to more robust explanation techniques.

2 Background

In this section, we introduce notation and provide background on counterfactual explanations.

Notation  We use a dataset containing data points, where each instance is a tuple of and label , i.e. (similarly for the test set). For convenience, we refer to the set of all data points in dataset as . We will use the notation to denote indexing data points in and to denote indexing attribute in

. Further, we have a model that predicts the probability of the positive class using a datapoint

. Further, we assume the model is paramaterized by but omit the dependence and write for convenience. Last, we assume the positive class is the desired outcome (e.g., receiving a loan) henceforth.

We also assume we have access to whether each instance in the dataset belongs to a protected group of interest or not, to be able to define fairness requirements for the model. The protected group refers to an historically disadvantaged group such as women or African-Americans. We use to indicate the protected subset of the dataset , and use for the “not-protected” group. Further, we denote protected group with the positive (i.e. more desired) outcome as and with negative (i.e. less desired) outcome as (and similarly for the non-protected group).

Counterfactual Explanations  Counterfactual explanations return a data point that is close to but is predicted to be positive by the model . We denote the counterfactual returned by a particular algorithm for instance as where the model predicts the positive class for the counterfactual, i.e., . We take the difference between the original data point and counterfactual as the set of changes an individual would have to make to receive the desired outcome. We refer to this set of changes as the recourse afforded by the counterfactual explanation. We define the cost of recourse as the effort required to accomplish this set of changes Venkatasubramanian and Alfano [2020]. In this work, we define the cost of recourse as the distance between and . Because computing the real-world cost of recourse is challenging Barocas et al. [2020], we use an ad-hoc distance function, as is general practice.

Counterfactual Objectives  In general, counterfactual explanation techniques optimize objectives of the form,

(1)
(2)

where denotes candidate counterfactual at a particular point during optimization. The first term encourages the counterfactual to have the desired outcome probability by the model. The distance function enforces that the counterfactual is close to the original instance and easier to “achieve” (lower cost recourse). balances the two terms. Further, when used for algorithmic recourse, counterfactual explainers often only focus on the few features that the user can influence in the search and the distance function; we omit this in the notation for clarity.

Distance Functions The distance function captures the effort needed to go from to by an individual. As one such notion of distance, Wachter et al. [2018] use the Manhattan () distance weighted by the inverse median absolute deviation (MAD).

(3)

This distance function generates sparse solutions and closely represents the absolute change someone would need to make to each feature, while correcting for different ranges across the features. This distance function can be extended to capture other counterfactual algorithms. For instance, we can include elastic net regularization instead of

for more efficient feature selection in high dimensions

Dhurandhar et al. [2018], add a term to capture the closeness of the counterfactual to the data manifold to encourage the counterfactuals to be in distribution, making them more realistic Van Looveren and Klaise [2019], or include diversity criterion on the counterfactuals Mothilal et al. [2020]. We provide the objectives for these methods in appendix B.1.

Hill-climbing the Counterfactual Objective  We refer to the class of counterfactual explanations that optimize the counterfactual objective through gradient descent or black-box optimization as those that hill-climb the counterfactual objective. For example, Wachter’s algorithm Wachter et al. [2018] or DiCE Mothilal et al. [2020] fit this characterization because they optimize objective 2 through gradient descent. Techniques like MACE Karimi et al. [2020] and FACE Poyiadzi et al. [2020] do not fit this criteria because they do not use hill-climbing techniques.

Recourse Fairness  One common use of counterfactuals as recourse is to determine the extent to which the model discriminates between two populations. For example, counterfactual explanations may return recourses that are easier to achieve for members of the not-protected group Ustun et al. [2019], Sharma et al. [2020] indicating unfairness in the counterfactuals Karimi et al. [2020], Gupta et al. [2019]. Formally, we define the recourse fairness as the difference in the average distance of the recourse cost between the protected and not-protected groups and say a counterfactual algorithm is recourse fair if this disparity is less than some threshold .

Definition 2.1

A model is recourse fair for algorithm , distance function , dataset , and scalar threshold if,

Figure 2: Manipulated Model for Loan Risk. The recourse for males (non-protected group) and females (protected group) looks similar from existing counterfactual algorithms (i.e. model seems fair). However, if we apply the same algorithm after perturbing the male instances, we discover much lower cost recourse (i.e. the model discriminates between sexes).

3 Adversarial Models for Manipulating Counterfactual Explanations

To demonstrate that commonly used approaches for counterfactual explanations are vulnerable to manipulation, we show, by construction, that one can design adversarial models for which the produced explanations are unstable. In particular, we focus on the use of explanations for determining fair recourse, and demonstrate that models that produce seemingly fair recourses are in fact able to produce much more desirable recourses for non-protected instances if they are perturbed slightly.

Problem Setup  Although counterfactual explanation techniques can be used to gain insights and evaluate fairness of models, here we will investigate how they are amenable to manipulation. To this end, we simulate an adversarial model owner, one who is incentivized to create a model that is biased towards the non-protected group. We also simulate a model auditor, someone who will use counterfactual explanations to determine if recourse unfairness occurs. Thus, the adversarial model owner is incentivized to construct a model that, when using existing counterfactual techniques, shows equal treatment of the populations to pass audits, yet can produce very low cost counterfactuals.

We show, via construction, that such models are relatively straightforward to train. In our construction, we jointly learn a perturbation vector

(a small vector of the same dimension as

) and the model , such that the recourses computed by existing techniques look fair, but recourses computed by adding perturbation to the input data produces low cost recourses. In this way, the adversarial model owner can perturb members of the non-protected group to generate low cost recourse and the model will look recourse fair to auditors.

Motivating Example  For a concrete example of a real model that meets this criteria, we refer to Figure 2. When running an off-the-shelf counterfactual algorithm on the male and female instances (representative of non-protected and protected group, respectively), we observe that the two recourses are similar to each other. However, when the adversary changes the age of the male applicant by years (the perturbation ), the recourse algorithm finds a much lower cost recourse.

Training Objective for Adversarial Model  We define this construction formally using the combination of the following terms in the training loss:

  • [nosep,noitemsep,leftmargin=12pt]

  • Fairness: We want the counterfactual algorithm to be fair for model according to Definition 2.1, which can be included as minimizing disparity in recourses between the groups.

  • Unfairness: A perturbation vector should lead to lower cost recourse when added to non-protected data, leading to unfairness, i.e., .

  • Small perturbation: Perturbation should be small. i.e. we need to minimize .

  • Accuracy: We should minimize the classification loss (such as cross entropy) of the model .

  • Counterfactual: should be a counterfactual, i.e. minimize .

This combined training objective is defined over both the parameters of the model and the perturbation vector . Apart from requiring dual optimization over these two variables, the objective is further complicated as it involves , a black-box counterfactual explanation approach. We address these challenges in the next section.

Training Adversarial Models

Our optimization proceeds in two parts, dividing the terms depending on whether they involve the counterfactual terms or not. First, we optimize the perturbation and model parameters on the subset of the terms that do not depend on the counterfactual algorithm, i.e. optimizing accuracy, counterfactual, and perturbation size111The objectives discussed in this section use the training set, whereas, evaluation is done on a held out test set everywhere else.:

(4)

Second, we optimize parameters , fixing the perturbation . We still include the classification loss so that the model will be accurate, but also terms that depend on (we use to denote uses the model parameterized by ). In particular, we add the two competing recourse fairness related terms: reduced disparity between subgroups for the recourses on the original data and increasing disparity between subgroups by generating lower cost counterfactuals for the protected group when the perturbation is added to the instances. This objective is,

(5)

Optimizing this objective requires computing the derivative (Jacobian) of the counterfactual explanation with respect to , . Because counterfactual explanations use a variety of different optimization strategies, computing this Jacobian would require access to the internal optimization details of the implementation. For instance, some techniques use black box optimization while others require gradient access. These details may vary by implementation or even be unavailable. Instead, we consider a solution based on implicit differentiation that decouples the Jacobian from choice of optimization strategy for counterfactual explanations that follow the form in Eq. (2). We calculate the Jacobian as follows:

Lemma 3.1

Assuming the counterfactual explanation follows the form of the objective in Equation 2, , and is the number of parameters in the model, we can write the derivative of counterfactual explanation with respect to model parameters as the Jacobian,

We provide a proof in Appendix A. Critically, this objective does not depend on the implementation details of counterfactual explanation , but only needs black box access to the counterfactual explanation. One potential issue is the matrix inversion of the Hessian. Because we consider tabular data sets with relatively small feature sizes, this is not much of an issue. For larger feature sets, taking the diagonal approximation of the Hessian can serve as a reasonable approximation Fernando and Gould [2016].

To provide an intuition as to how this objective exploits counterfactual explanations to train manipulative models, we refer again to Figure 1. Because the counterfactual objective relies on an arbitrary function , this objective can be non-convex. As a result, we can design such that converges to higher cost local minimums for all datapoints than those converges to when we add .

4 Experiment Setup

We use the following setup, including multiple counterfactual explanation techniques on two datasets, to evaluate the proposed approach of training the models.

Counterfactual Explanations  We consider four different counterfactual explanation algorithms as the choices for that hill-climb the counterfactual objective. We use Wachter’s Algorithm Wachter et al. [2018], Wachter’s with elastic net sparsity regularization (Sparse Wachter; variant of Dhurandhar et al. [2018]), DiCE Mothilal et al. [2020], and Counterfactual’s Guided by Prototypes Van Looveren and Klaise [2019] (exact objectives in appendix B.1). These counterfactual explanations are widely used to compute recourse and assess the fairness of models Karimi et al. [2020], Verma et al. [2020], Stepin et al. [2021]. We use to compute the cost of a recourse discovered by counterfactuals. We use the official DiCE implementation222https://github.com/interpretml/DiCE, and reimplement the others (details in Appendix B.2). DiCE is the only approach that computes multiple counterfactual explanations; we generate counterfactuals and take the closest one to the original point (as per distance) to get a single counterfactual.

Data sets  We use two data sets: Communities and Crime and the German Credit datasets Dua and Graff [2017], as they are commonly used benchmarks in both the counterfactual explanation and fairness literature Verma et al. [2020], Friedler et al. [2019]. Both these datasets are in the public domain. Communities and Crime contains demographic and economic information about communities across the United State, with the goal to predict whether there is violent crime in the community. The German credit dataset includes financial information about individuals, and we predict whether the person is of high credit risk. We preprocess the data as in Slack et al. [2020], and apply

mean, unit variance scaling to the features and perform an

split on the data to create training and testing sets. In Communities and Crime, we take whether the community is predominately black () as the protected class and low-risk for violent crime as the positive outcome. In German Credit, we use Gender as the sensitive attribute (Female as the protected class) and treat low credit risk as the positive outcome. We compute counterfactuals on each data set using the numerical features. The numerical features include all features for Communities and Crime and of total features for German Credit. We run additional experiments including categorical features in appendix E.3.

Comm. & Crime German Credit
Acc Acc
Unmodified 81.2 - 71.1 -
Wachter 80.9 0.80 72.0 0.09
Sparse Wachter 77.9 0.46 70.5 2.50
Prototypes 79.2 0.46 69.0 2.21
DiCE 81.1 1.73 71.2 0.09
Table 1: Manipulated Models: Test set accuracy and the size of the vector for the four manipulated models (one for each counterfactual explanation algorithm), compared with the unmodified model trained on the same data. There is little change to accuracy using the manipulated models. Note, is comparable across datasets due to unit variance scaling.

Manipulated Models

  We use feed-forward neural networks as the adversarial model consisting of

layers of

nodes with the ReLU activation function, the Adam optimizer, and using cross-entropy as the loss

. It is common to use neural networks when requiring counterfactuals since they are differentiable, enabling counterfactual discovery via gradient descent 

Mothilal et al. [2020]. We perform the first part of optimization for steps for Communities and Crime and for German Credit. We train the second part of the optimization for optimization steps. We select the model that has the smallest disparity in mean distance between protected and non-protected groups in the training data. We also train a baseline network (the unmodified model) for our evaluations using optimization steps. In Table 1, we show the model accuracy for the two datasets (the manipulated models are similarly accurate as the unmodified one) and the magnitude of the discovered .

Communities and Crime German Credit
Wach. S-Wach. Proto. DiCE Wach. S-Wach. Proto. DiCE
Protected 35.68 54.16 22.35 49.62 5.65 8.35 10.51 6.31
Non-Protected 35.31 52.05 22.65 42.63 5.08 8.59 13.98 6.81
Disparity 0.37 2.12 0.30 6.99 0.75 0.24 0.06 0.5
Non-Protected 1.76 22.59 8.50 9.57 3.16 4.12 4.69 3.38
Cost reduction 20.1 2.3 2.6 4.5 1.8 2.0 2.2 2.0
Table 2: Recourse Costs of Manipulated Models: Counterfactual algorithms find similar cost recourses for both subgroups, however, give much lower cost recourse if is added before the search.

5 Experiments

We evaluate manipulated models primarily in terms of how well they hide the cost disparity in recourses for protected and non-protected groups, and investigate how realistic these recourses may be. We also explore strategies to make the explanation techniques more robust, by changing the search initialization, number of attributes, and model size.

5.1 Effectiveness of the Manipulation

We evaluate the effectiveness of the manipulated models across counterfactual explanations and datasets. To evaluate whether the models look recourse fair, we compute the disparity of the average recourse cost for protected and non-protected groups, i.e. Definition (2.1). We also measure the average costs (using ) for the non-protected group and the non-protected group perturbed by . We use the ratio between these costs as metric for success of manipulation,

(6)

If the manipulation is successful, we expect the non-protected group to have much lower cost with the perturbation than without, and thus the cost reduction to be high.

We provide the results for both datasets in Table 2. The disparity in counterfactual cost on the unperturbed data is very small in most cases, indicating the models would appear counterfactual fair to the auditors. At the same time, we observe that the cost reduction in the counterfactual distances for the non-protected groups after applying the perturbation is quite high, indicating that lower cost recourses are easy to compute for non-protected groups. The adversarial model is considerably more effective applied on Wachter’s algorithm in Communities and Crime. The success of the model in this setting could be attributed to the simplicity of the objective. The Wachter objective only considers the squared loss (i.e., Eq (1)) and distance, whereas counterfactuals guided by prototypes takes into account closeness to the data manifold. Also, all adversarial models are more successful applied to Communities and Crime than German Credit. The relative success is likely due to Communities and Crime having a larger number of features than German Credit ( versus ), making it easier to learn a successful adversarial model due to the higher dimensional space. Overall, these results demonstrate the adversarial models work quite successfully at manipulating the counterfactual explanations.

5.2 Outlier Factor of Counterfactuals

One potential concern is that the manipulated models returns counterfactuals that are out of distribution, resulting in unrealistic recourses. To evaluate whether this is the case, we follow Pawelczyk et al. [2020]

, and compute the local outlier factor of the counterfactuals with respect to the positively classified data

Breunig et al. [2000]. The score using a single neighbor () is given as,

(7)

where is the closest true positive neighbor of . This metric will be when the counterfactual is an outlier. We compute the percent of counterfactuals that are local outliers by this metric on Communities and Crime, in Figure 3 (results from additional datasets/methods in Appendix E). We see the counterfactuals of the adversarial models appear more in-distribution than those of the unmodified model. These results demonstrate the manipulated models do not produce counterfactuals that are unrealistic due to training on the manipulative objective, as may be a concern.

Figure 3: Outlier Factor of Counterfactuals: For the Wachter and DiCE models for Communities and Crime, we show that the manipulated recourses are only slightly less realistic than counterfactuals of the unmodified model, whereas the counterfactuals found after adding are more realistic than the original counterfactuals (lower is better).
Model Wachter DiCE
Initialization Mean Rnd. + Rnd.
Protected 42.4 16.2 11.7 48.3
Not-Prot. 42.3 15.7 10.3 42.3
Disparity 0.01 0.49 1.45 5.95
Not-Prot.+ 2.50 3.79 8.59 12.3
Cost reduction 16.9 4.3 1.2 3.4
Accuracy 81.4 80.2 75.3 78.9
0.65 0.65 0.36 1.24
(a) Search Initialization: Adding noise to the input is effective, at the cost to accuracy.
(b) Num. Features: Fewer features make the manipulation less effective.
(c) Model Size: Smaller models are more effective at hiding their biases.
Figure 4: Exploring Mitigation Strategies: For the Wachter counterfactual discovery on Communities and Crime, we vary aspects of the model and the search to compute effectiveness of the manipulation. Each provides a potentially viable defense, with different trade-offs.

5.3 Potential Mitigation Strategies

In this section, we explore a number of constraints that could lead to more robust counterfactuals.

Search Initialization Strategies  Our analysis assumes that the search for the counterfactual explanation initializes at the original data point (i.e., or ), as is common in counterfactual explanations. Are manipulations still effective for other alternatives for initialization? We consider three different initialization schemes and examine the effectiveness of the Wachter and DiCE Communities and Crime Adversarial Model: (1) Randomly (, (2) at the Mean of the positively predicted data, and (3) at a perturbation of the data point using noise. To initialize Wachter randomly, we follow Mothilal et al. [2020]

and draw a random instance from a uniform distribution on the maximum and minimum of each feature (DiCE provides an option to initialize randomly, we use just this initialization). From the results in Figure 

3(a), we see perturbing the data before search reduces the cost reduction most effectively. We find similar results for German Credit in appendix E.

Number of Attributes  We consider reducing the number of attributes used to find counterfactuals and evaluate the success of the adversarial model on Wachter’s algorithm for the Communities and Crime dataset. Starting with the original number of attributes, , we randomly select attributes, remove them from the set of attributes used by the counterfactual algorithm, and train an adversarial model. We repeat this process until we have attributes left. We report the cost reduction due to (Eq (6)) for each model, averaged over runs. We observe that we are unable to find low cost recourses for adversarial model as we reduce the number of attributes, with minimal impact on accuracy (not in figure). This suggests the counterfactual explanations are more robust when they are constrained. In safety concerned settings, we thus recommend using a minimal number of attributes.

Size of the Model  To further characterize the manipulation, we train a number of models (on Communities and Crime for Wachter’s) that vary in their size. We show that as we increase the model size, we gain an even higher cost reduction, i.e. an increase in the cost reduction when the similar additional parameters are added. This is not surprising, since more parameters provide further the flexibility to distort the decision surface as needed. As we reduce the size of the model, we see the opposite trend; the cost reduction reduces substantially when fewer parameters are used. However, test set accuracy also falls considerably (from to , not in figure). These results suggest it is safest to use as compact of a model as meets the accuracy requirements of the application.

Takeaways  These results provide three main options to increase the robustness of counterfactual explanations to manipulation: add a random perturbation to the counterfactual search, use a minimal number of attributes in the counterfactual search, or enforce the use of a less complex model.

6 Related Work

Recourse Methods  A variety of methods have been proposed to generate recourse for affected individuals Wachter et al. [2018], Ustun et al. [2019], Karimi et al. [2020], Poyiadzi et al. [2020], Van Looveren and Klaise [2019]. Wachter et al. [2018] propose gradient search for the closest counterfactual, while Ustun et al. [2019] introduce the notion of actionable

recourse for linear classifiers and propose techniques to find such recourse using linear programming. Because counterfactuals generated by these techniques may produce unrealistic recommendations,

Van Looveren and Klaise [2019] incorporate constraints in the counterfactual search to encourage them to be in-distribution. Similarly, other approaches incorporate causality in order to avoid such spurious counterfactuals Karimi et al. [2021], Karimi* et al. [2020], Barocas et al. [2020]. Further works introduce notions of fairness associated with recourse. For instance, Ustun et al. [2019] demonstrate disparities in the cost of recourse between groups, which Sharma et al. [2020] use to evaluate fairness. Gupta et al. [2019] first proposed developing methods to equalize recourse between groups using SVMs. Karimi et al. [2020] establish the notion of fairness of recourse and demonstrate it is distinct from fairness of predictions. Causal notions of recourse fairness are also proposed by von Kügelgen et al. [2020].

Shortcomings of Explanations  Pawelczyk et al. [2020] discuss counterfactuals under predictive multiplicity Marx et al. [2020] and demonstrate counterfactuals may not transfer across equally good models. Rawal et al. [2020] show counterfactual explanations find invalid recourse under distribution shift. Kasirzadeh and Smart [2021] consider how counterfactual explanations are currently misused and propose tenents to better guide their use. Work on strategic behavior considers how individuals might behave with access to either model transparency Chen et al. [2020], Tabibian et al. [2019] or counterfactual explanations Tsirtsis and Gomez-Rodriguez [2020], resulting in potentially sub-optimal outcomes. Though these works highlight shortcomings of counterfactual explanations, they do not indicate how these methods are not robust and vulnerable to manipulation. Related studies show that post hoc explanations techniques like LIME [Ribeiro et al., 2016] and SHAP Lundberg and Lee [2017] can also hide the biases of the models [Slack et al., 2020], and so can gradient-based explanations [Dimanov et al., 2020, Wang et al., 2020]. Aivodji et al. [2019] and Anders et al. [2020] show explanations can make unfair models appear fair.

7 Discussion & Conclusion

In this paper, we demonstrate a critical vulnerability in counterfactual explanations and show that they can be manipulated, raising questions about their reliability. We show such manipulations are possible across a variety of commonly-used counterfactual explanations, including Wachter Wachter et al. [2018], a sparse version of Wachter, Counterfactuals guided by prototypes Van Looveren and Klaise [2019], and DiCE Mothilal et al. [2020]. These results bring into the question the trustworthiness of counterfactual explanations as a tool to recommend recourse to algorithm stakeholders. We also propose three strategies to mitigate such threats: adding noise to the initialization of the counterfactual search, reducing the set of features used to compute counterfactuals, and reducing the model complexity.

Our results motivate several future research directions. First, there is a need for constructing counterfactual explanations that are robust to small changes in the input. Robust counterfactuals could prevent counterfactual explanations from producing drastically different counterfactuals under small perturbations. For instance, it could be possible to draw on work from the bayesian assessment literature to generate more robust counterfactuals using available data points Ji et al. [2020, 2021, 2019]. Second, this work motivates need for explanations with optimality guarantees, which could lead to more trust in the counterfactuals. Last, it could be useful to study when practitioners should use simpler models, such as in consequential domains, to have more knowledge about their decision boundaries, even if it is at the cost of accuracy.

References

  • U. Aivodji, H. Arai, O. Fortineau, S. Gambs, S. Hara, and A. Tapp (2019) Fairwashing: the risk of rationalization. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 161–170. External Links: Link Cited by: §6.
  • C. Anders, P. Pasliev, A. Dombrowski, K. Müller, and P. Kessel (2020) Fairwashing explanations with off-manifold detergent. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 314–323. External Links: Link Cited by: §6.
  • S. Barocas, A. D. Selbst, and M. Raghavan (2020) The hidden assumptions behind counterfactual explanations and principal reasons. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, New York, NY, USA, pp. 80–89. External Links: ISBN 9781450369367, Link, Document Cited by: §2, §6.
  • U. Bhatt, A. Xiang, S. Sharma, A. Weller, A. Taly, Y. Jia, J. Ghosh, R. Puri, J. M. F. Moura, and P. Eckersley (2020) Explainable machine learning in deployment. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, New York, NY, USA, pp. 648–657. External Links: ISBN 9781450369367, Link, Document Cited by: §1.
  • M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander (2000) LOF: identifying density-based local outliers. SIGMOD Rec. 29 (2), pp. 93–104. External Links: ISSN 0163-5808, Link, Document Cited by: §5.2.
  • Y. Chen, J. Wang, and Y. Liu (2020) Strategic classification with a light touch: learning classifiers that incentivize constructive adaptation. External Links: arXiv Cited by: §6.
  • A. Dhurandhar, P. Chen, R. Luss, C. Tu, P. Ting, K. Shanmugam, and P. Das (2018) Explanations based on the missing: towards contrastive explanations with pertinent negatives. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. 592–603. External Links: Link Cited by: §B.2, Appendix B, §2, §4.
  • B. Dimanov, U. Bhatt, M. Jamnik, and A. Weller (2020) You shouldn’t trust me: learning models which conceal unfairness from multiple explanation methods. In SafeAI@AAAI, Cited by: §6.
  • D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.
  • B. Fernando and S. Gould (2016) Learning end-to-end video classification with rank-pooling. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1187–1196. Cited by: §3.
  • S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth (2019) A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, New York, NY, USA, pp. 329–338. External Links: ISBN 9781450361255, Link, Document Cited by: §4.
  • S. Gould, B. Fernando, A. Cherian, P. Anderson, R. Santa Cruz, and E. Guo (2016) On Differentiating Parameterized Argmin and Argmax Problems with Application to Bi-level Optimization. arXiv e-prints, pp. arXiv:1607.05447. External Links: 1607.05447 Cited by: Appendix A.
  • V. Gupta, P. Nokhiz, C. Dutta Roy, and S. Venkatasubramanian (2019) Equalizing recourse across groups. Cited by: §1, §2, §6.
  • D. Ji, R. L. Logan, P. Smyth, and M. Steyvers (2021) Active bayesian assessment of black-box classifiers.

    Proceedings of the AAAI Conference on Artificial Intelligence

    35 (9), pp. 7935–7944.
    External Links: Link Cited by: §7.
  • D. Ji, R. Logan, P. Smyth, and M. Steyvers (2019) Bayesian evaluation of black-box classifiers.

    Uncertainty and Robustness in Deep Learning Workshop ICML

    .
    Cited by: §7.
  • D. Ji, P. Smyth, and M. Steyvers (2020)

    Can i trust my fairness metric? assessing fairness with unlabeled data and bayesian inference

    .
    In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 18600–18612. External Links: Link Cited by: §7.
  • A.-H. Karimi, G. Barthe, B. Balle, and I. Valera (2020) Model-agnostic counterfactual explanations for consequential decisions. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), Proceedings of Machine Learning Research, Vol. 108, pp. 895–905. External Links: Link Cited by: §1, §2, §6.
  • A.-H. Karimi, B. Schölkopf, and I. Valera (2021) Algorithmic recourse: from counterfactual explanations to interventions. In 4th Conference on Fairness, Accountability, and Transparency (ACM FAccT), Cited by: §6.
  • A. Karimi, G. Barthe, B. Schölkopf, and I. Valera (2020) A survey of algorithmic recourse: definitions, formulations, solutions, and prospects. arXiv e-prints, pp. arXiv:2010.04050. External Links: 2010.04050 Cited by: §1, §2, §4, §6.
  • A.-H. Karimi*, J. von Kügelgen*, B. Schölkopf, and I. Valera (2020) Algorithmic recourse under imperfect causal knowledge: a probabilistic approach. In Advances in Neural Information Processing Systems 33, Note: *equal contribution Cited by: §1, §6.
  • A. Kasirzadeh and A. Smart (2021) The use and misuse of counterfactuals in ethical machine learning. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, New York, NY, USA, pp. 228–236. External Links: ISBN 9781450383097, Link, Document Cited by: §6.
  • S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4765–4774. Cited by: §6.
  • C. T. Marx, F. Calmon, and B. Ustun (2020) Predictive multiplicity in classification. In ICML, Cited by: §6.
  • R. K. Mothilal, A. Sharma, and C. Tan (2020) Explaining machine learning classifiers through diverse counterfactual explanations. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 607–617. Cited by: §B.1, Appendix B, §1, §1, §2, §2, §4, §4, §5.3, §7.
  • M. Pawelczyk, K. Broelemann, and Gjergji. Kasneci (2020) On counterfactual explanations under predictive multiplicity. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), J. Peters and D. Sontag (Eds.), Proceedings of Machine Learning Research, Vol. 124, pp. 809–818. External Links: Link Cited by: §1, §6.
  • M. Pawelczyk, K. Broelemann, and G. Kasneci (2020) Learning model-agnostic counterfactual explanations for tabular data. In Proceedings of The Web Conference 2020, pp. 3126–3132. External Links: ISBN 9781450370233, Link Cited by: §5.2.
  • R. Poyiadzi, K. Sokol, R. Santos-Rodriguez, T. De Bie, and P. Flach (2020) FACE: feasible and actionable counterfactual explanations. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES ’20, New York, NY, USA, pp. 344–350. External Links: ISBN 9781450371100, Link, Document Cited by: §1, §2, §6.
  • K. Rawal, E. Kamar, and H. Lakkaraju (2020) Can i still trust you?: understanding the impact of distribution shifts on algorithmic recourses. arXiv. Cited by: §1, §6.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2016) "Why should I trust you?": explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp. 1135–1144. Cited by: §6.
  • S. Sharma, J. Henderson, and J. Ghosh (2020) CERTIFAI: a common framework to provide explanations and analyse the fairness and robustness of black-box models. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES ’20, New York, NY, USA, pp. 166–172. External Links: ISBN 9781450371100, Link, Document Cited by: §1, §2, §6.
  • D. Slack, S. Hilgard, E. Jia, S. Singh, and H. Lakkaraju (2020) Fooling lime and shap: adversarial attacks on post hoc explanation methods. AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES). Cited by: §4, §6.
  • I. Stepin, J. M. Alonso, A. Catala, and M. Pereira-Fariña (2021) A survey of contrastive and counterfactual explanation generation methods for explainable artificial intelligence. IEEE Access 9 (), pp. 11974–12001. External Links: Document Cited by: §4.
  • B. Tabibian, S. Tsirtsis, M. Khajehnejad, A. Singla, B. Schölkopf, and M. Gomez-Rodriguez (2019) Optimal decision making under strategic behavior. arXiv preprint arXiv:1905.09239. Cited by: §6.
  • S. Tsirtsis and M. Gomez-Rodriguez (2020) Decisions, counterfactual explanations and strategic behavior. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.
  • B. Ustun, A. Spangher, and Y. Liu (2019) Actionable recourse in linear classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, pp. 10–19. External Links: ISBN 978-1-4503-6125-5, Link, Document Cited by: §1, §2, §6.
  • A. Van Looveren and J. Klaise (2019) Interpretable Counterfactual Explanations Guided by Prototypes. arXiv, pp. arXiv:1907.02584. External Links: 1907.02584 Cited by: §B.2, Appendix B, §E.3, §1, §1, §1, §2, §4, §6, §7.
  • S. Venkatasubramanian and M. Alfano (2020) The philosophical basis of algorithmic recourse. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, New York, NY, USA, pp. 284–293. External Links: ISBN 9781450369367, Link, Document Cited by: §2.
  • S. Verma, J. Dickerson, and K. Hines (2020) Counterfactual explanations for machine learning: a review. pp. . Cited by: §4, §4.
  • J. von Kügelgen, A. Karimi, U. Bhatt, I. Valera, A. Weller, and B. Schölkopf (2020) On the fairness of causal algorithmic recourse. Cited by: §6.
  • S. Wachter, B. Mittelstadt, and C. Russell (2018) Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harvard journal of law & technology 31, pp. 841–887. External Links: Document Cited by: Appendix A, §B.2, Appendix B, Figure 1, §1, §1, §1, §2, §2, §4, §6, §7.
  • J. Wang, J. Tuyls, E. Wallace, and S. Singh (2020) Gradient-based Analysis of NLP Models is Manipulable. In Findings of the Association for Computational Linguistics: EMNLP (EMNLP Findings), pp. 247–258. Cited by: §6.

Appendix A Optimizing Over the Returned Counterfactuals

In this appendix, we discuss a technique to optimize over the counterfactuals found by counterfactual explanation methods, such as Wachter et al. [2018]. We restate lemma 3.1 and provide a proof.

Lemma 3.1 Assuming the counterfactual algorithm follows the form of the objective in equation 2, , and is the number of parameters in the model, we can write the derivative of counterfactual algorithm with respect to model parameters as the Jacobian,

Proof. We want to compute the derivative,

(8)

This problem is identical to a well-studied class of bi-level optimization problems in deep learning. In these problems, we must compute the derivative of a function with respect to some parameter (here ) that includes an inner argmin, which itself depends on the parameter. We follow Gould et al. [2016] to complete the proof.

Note, we write to describe the objective evaluated at the counterfactual found using the counterfactual explanation . Also, we denote the zero vector as . For a single network parameter , we have the following equivalence because converges to a stationary point from the assumption,

(9)
We differentiate with respect to

and apply the chain rule,

(10)
(11)
Rewriting in terms of ,
(12)

Extending this result to multiple parameters, we write,

(13)

This result depends on the assumption . This assumption states the counterfactual explanation converges to a stationary point. In the case the counterfactual explanation terminates before converging to stationary point, this solution will be approximate.

Appendix B Counterfactual Explanation Details

In this appendix, we provide additional details related the counterfactual explanations used in the paper. Recall, we use four counterfactual explanations in our paper. The counterfactual explanations were Wachter’s Algorithm Wachter et al. [2018], Wachter’s with elastic net sparsity regularization (Sparse Wachter; variant of Dhurandhar et al. [2018]), DiCE Mothilal et al. [2020], and Counterfactual’s Guided by Prototypes Van Looveren and Klaise [2019].

b.1 Objectives & Distance Functions

We describe the objective of each counteractual explanation and detail hyperparameter choices within the objectives. Note, all algorithms but DiCE include a hyperparameter

applied to squared loss (i.e., in Eq. (1)). Since this parameter needs to be varied to find successful counterfactuals (i.e., ), we set this hyperparameter at initially and increment it x until we find a successful counterfactual.

Wachter’s Algorithm

The distance function for Wachter’s Algorithm is given as,

(14)

The full objective is written as,

(15)

Sparse Wachter

The distance function for Sparse Wachter is given as,

(16)

The full objective is written as,

(17)

Prototypes

The distance function for Prototypes is given as,

(18)

where is the nearest positively classified neighbor of according to euclidean distance. We fix . The full objective is written as,

(19)

DiCE

The distance function used in the DiCE objective is defined over counterfactuals,

(20)

Note, the DiCE objective uses the hinge loss, instead of the squared loss, as in the earlier objectives. The objective is written as,

(21)

When we evaluate distance, we take the closest counterfactual according to distance because we are interested in the single least cost counterfactual. Because we only have a single counterfactual, the diversity term in equation 20 reduces to . Thus, the distance we use during evaluation is the Wachter distance, , on the closest counterfactual. We fix as in Mothilal et al. [2020]. Because DiCE provides a hyperparameter on the distance instead of on the squared loss like in the other counterfactual explanations, , we fix this value to and decrement until we successfully generate counterfactuals.

b.2 Re-implementation Details

We re-implement Wachter’s Algorithm Wachter et al. [2018], Wachter’s with elastic net sparsity regularization (Sparse Wachter; variant of Dhurandhar et al. [2018]), and Counterfactual’s Guided by Prototypes Van Looveren and Klaise [2019]. We optimize the objective in section B.1 for each explainer using the Adam optimizer with learning rate . We initialize the counterfactual search at the original instance unless stated otherwise (e.g., experimentation with different search initialization strategies in section 5.3). We fix and run counterfactual search optimization for 1,000 steps. If we didn’t find a successful counterfactual (i.e., ) we increase , . We repeat this process until we find a counterfactual.

Appendix C Unmodified Models

In this appendix, we provide the recourse costs of counterfactual explanations applied to the unmodified models. We give the results in table 3. While the manipulated models showed minimal disparity in recourse cost between subgroups (see table 2), the unmodified models often have large disparities in recourse cost. Further, the counterfactuals found when we add to the non-protected instances are not much different than without using the key. These results demonstrate the objective in section 3 encourages the model to have equitable recourse cost between groups and much lower recourse for the not-protected group when we add .

Wach. S-Wach. Proto. DiCE
Communities and Crime
Protected
Non-Protected
Disparity 3.59 3.42 3.84 24.41
Non-Protected
German Credit
Protected
Non-Protected
Disparity 1.94 2.45 1.39 14.98
Non-Protected
Table 3: Recourse Costs of Unmodified Models for Communities and Crime and the German Credit data set. For the unmodified models, counterfactual explanations result in highly levels of recourse disparity. Further, has minimal effect on the counterfactual search. These results help demonstrate that the adversarial objective decreases the disparity in recourse costs and encourages the perturbation to lead to low cost recourse when added to the non-protected group.

Appendix D Scalability of the Adversarial Objective

In this appendix, we discuss the scalability of the adversarial model training objective. We also demonstrate the scalability of the objective by training a successful manipulative model on the Adult dataset (k datapoints).

d.1 Scalability Considerations

Training complexity of the optimization procedure proposed in section 3 increases along three main factors. First, complexity increases with the training set size because we compute the loss across all the instances in the batch. This computation includes finding counterfactuals for each instance in order to compute the hessian in Lemma 3.1. Second, complexity increases with number of features in the data due to the computation of the hessian in Lemma 3.1, assuming no approximations are used. Last, the number of features in the counterfactual search increases the complexity of training because we must optimize more parameters in the perturbation and additional features in the counterfactual search.

d.2 Adult dataset

One potential question is whether the attack is scalable to large data sets because computing counterfactuals (i.e., ) for every instance in the training data is costly to compute. However, it is possible for the optimization procedure to handle large data sets because computing is easily parallelizable. We demonstrate the scalability the adversarial objective on the Adult dataset consisting of k data points using DiCE with the pre-processing from [4], using numerical features for the counterfactual search. The model had a cost ratio of , indicating that the manipulation was successful.

DiCE
Adult
Protected
Non-Protected
Disparity
Non-Protected
Cost Reduction 2.13
Test Accuracy
Table 4: Adversarial model trained on the Adult data set where the manipulation is successful. This result demonstrates it is possible to scale attack to large data sets.

Appendix E Additional Results

In this appendix, we provide additional experimental results.

e.1 Outlier Factor of Counterfactuals

In the main text, we provided outlier factor results for the Communities and Crime data set with Wachter and DiCE. Here, we provide additional outlier factor results for Communities and Crime using Sparse Wachter and counterfactuals guided by prototypes and for the German Credit data set in figure 5. We see similiar results to those in the main paper, namely that the manipulated + counterfacutals are the most realistic (lowest % predicted outliers).

Figure 5: Additional outlier factor results for more data sets and counterfactual explanations indicate a similiar trend as in the main paper: the manipulated + counterfactuals are the most realistic (lowest predicted local outliers).

e.2 Different Initializations

In the main text, we provided results for different initialization strategies with the Communities and Crime data set using DiCE and Wachter. We provide additional different initialization results for German Credit in Table 7 and Communities and Crime for Sparse Wachter and counterfactuals guided by prototypes in Table 6. Similar to the experiments presented in the main text, we see and is consistently the most effective mitigation strategy.

e.3 Categorical Features

In the main text, we used numerical features in the counterfactual search. In this appendix, we train manipulated models using categorical features in the counterfactual search with German Credit for both counterfactuals guided by prototypes and DiCE. We do not use categorical features with Wachter because it is very computationally expensive Van Looveren and Klaise [2019]. We perform this experiment with German Credit only because there are no categorical features in Communities and Crime. We consider on only the numerical features and rounding to or for the categorical features. We present the results in tables 5. We found the manipulation to be successful in out of cases, with the exception being rounded for counterfactuals guided by prototypes. These results demonstrate the manipulation is successful with categorical features.

Only numerical Rounding
Proto. DiCE Proto. DiCE
German Credit
Protected
Non-Protected
Disparity
Non-Protected
Cost Reduction 2.3 1.5 0.92 1.5
Test Accuracy
Table 5: Manipulated Models with categorical features trained on the German Credit data set. These results show it is possible to train manipulative models successfully with categorical features.
Model S-Wachter Proto.
Initialization Mean Rnd. + Mean Rnd. +
Protected
Non-Protected
Disparity 17.03 1.83 2.05 8.29 13.78 7.45
Non-Protected
Cost Reduction 1.15 7.43 1.04 0.74 1.12 0.82
Test Accuracy
Table 6: Additional different initialization results for Communities & Crime demonstrating the efficacy of different mitigation strategies. These results help demonstrate that + is consistently an effective mitigation strategy.
Model Wachter S-Wachter Proto. DiCE
Initialization Mean Rnd. + Mean Rnd. + Mean Rnd. + Rnd.
Protected 1.94 1.18 1.22 5.58 0.83 2.18 3.24 3.81 3.62 39.53
Not-Prot. 1.29 1.27 1.42 2.29 0.95 3.24 4.64 7.42 3.47 36.53
Disparity 0.65 0.18 0.19 3.29 0.12 1.06 1.39 3.61 0.14 21.43
Not-Prot.+ 0.96 3.79 1.03 1.30 1.36 1.26 3.52 5.74 2.54 3.00
Cost Reduction 1.34 1.07 1.38 1.31 0.70 1.26 1.32 1.29 1.36 1.70
Accuracy 66.5 67.0 68.5 66.5 67.7 67.7 66.3 65.8 65.8 66.8
0.81 0.80 0.36 0.81 0.80 0.54 0.98 0.43 0.83 2.9
Table 7: Different initialization results for the German Credit data set demonstrating the efficacy of various initialization strategies. These results indicate that + is consistently the most effective mitigation strategy.

Appendix F Compute Details

We run all experiments in this work on a machine with a single NVIDIA 2080Ti GPU.