1 Introduction
Machine learning models are being deployed to make consequential decisions on tasks ranging from loan approval to medical diagnosis. As a result, there are a growing number of methods that explain the decisions of these models to affected individuals and provide means for recourse Ustun et al. [2019]. For example, recourse offers a person denied a loan by a credit risk model a reason for why the model made the prediction and what can be done to change the decision. Beyond providing guidance to stakeholders in model decisions, algorithmic recourse is also used to detect discrimination in machine learning models Gupta et al. [2019], Karimi et al. [2020], Sharma et al. [2020]. For instance, we expect there to be minimal disparity in the cost of achieving recourse between both men and women who are denied loans. One commonly used method to generate recourse is that of counterfactual explanations Bhatt et al. [2020]. Counterfactual explanations offer recourse by attempting to find the minimal change an individual must make to receive a positive outcome Wachter et al. [2018], Karimi et al. [2020], Poyiadzi et al. [2020], Van Looveren and Klaise [2019].
Although counterfactual explanations are used by stakeholders in consequential decisionmaking settings, there is little work on systematically understanding and characterizing their limitations. Few recent studies explore how counterfactual explanations may become valid when the underlying model is updated. For instance, a model provider might decide to update a model, rendering previously generated counterfactual explanations invalid Rawal et al. [2020], Pawelczyk et al. [2020]. Others point out that counterfactual explanations, by ignoring the causal relationships between features, sometimes recommend changes that are not actionable Karimi* et al. [2020]. Though these works shed light on certain shortcomings of counterfactual explanations, they do not consider whether current formulations provide stable and reliable results, whether they can be manipulated, and if fairness assessments based on counterfactuals can be trusted.
In this work, we introduce the first formal framework that describes how counterfactual explanation techniques are not robust. More specifically, we demonstrate how the family of counterfactual explanations that rely on hillclimbing (which includes commonly used methods like Wachter’s algorithm Wachter et al. [2018], DiCE Mothilal et al. [2020], and counterfactuals guided by prototypes Van Looveren and Klaise [2019]) is highly sensitive to small changes in the input. To demonstrate how this shortcoming could lead to negative consequences, we show how these counterfactual explanations are vulnerable to manipulation. Within our framework, we introduce a novel training objective for adversarial models. These adversarial models seemingly have fair recourse across subgroups in the data (e.g., men and women) but have much lower cost recourse for the data under a slight perturbation, allowing a badactor to provide lowcost recourse for specific subgroups simply by adding the perturbation. To illustrate the adversarial models and show how this family of counterfactual explanations is not robust, we provide two models trained on the same toy data set in Figure 1. In the model trained with the standard BCE objective (left side of Fig 1), the counterfactuals found by Wachter’s algorithm Wachter et al. [2018] for instance and perturbed instance converge to same minima (denoted and ). However, for the adversarial model (right side of Fig 1), the counterfactual found for the perturbed instance is closer to the original instance . This result indicates that the counterfactual found for the perturbed instance is easier to achieve than the counterfactual for found by Wachter’s algorithm! Intuitively, counterfactual explanations that hillclimb the gradient are susceptible to this issue because optimizing for the counterfactual at versus can converge to different local minima.
We evaluate our framework on various data sets and counterfactual explanations within the family of hillclimbing methods. For Wachter’s algorithm Wachter et al. [2018], a sparse variant of Wachter’s algorithm, DiCE Mothilal et al. [2020], and counterfactuals guided by prototypes Van Looveren and Klaise [2019], we train models on data sets related to loan prediction and violent crime prediction with fair recourse across subgroups that return  lower cost recourse for specific subgroups with the perturbation , without any accuracy loss. Though these results indicate counterfactual explanations are highly vulnerable to manipulation, we consider making counterfactual explanations that hillclimb the gradient more robust. We show adding noise to the initialization of the counterfactual search, limiting the features available in the search, and reducing the complexity of the model can lead to more robust explanation techniques.
2 Background
In this section, we introduce notation and provide background on counterfactual explanations.
Notation We use a dataset containing data points, where each instance is a tuple of and label , i.e. (similarly for the test set). For convenience, we refer to the set of all data points in dataset as . We will use the notation to denote indexing data points in and to denote indexing attribute in
. Further, we have a model that predicts the probability of the positive class using a datapoint
. Further, we assume the model is paramaterized by but omit the dependence and write for convenience. Last, we assume the positive class is the desired outcome (e.g., receiving a loan) henceforth.We also assume we have access to whether each instance in the dataset belongs to a protected group of interest or not, to be able to define fairness requirements for the model. The protected group refers to an historically disadvantaged group such as women or AfricanAmericans. We use to indicate the protected subset of the dataset , and use for the “notprotected” group. Further, we denote protected group with the positive (i.e. more desired) outcome as and with negative (i.e. less desired) outcome as (and similarly for the nonprotected group).
Counterfactual Explanations Counterfactual explanations return a data point that is close to but is predicted to be positive by the model . We denote the counterfactual returned by a particular algorithm for instance as where the model predicts the positive class for the counterfactual, i.e., . We take the difference between the original data point and counterfactual as the set of changes an individual would have to make to receive the desired outcome. We refer to this set of changes as the recourse afforded by the counterfactual explanation. We define the cost of recourse as the effort required to accomplish this set of changes Venkatasubramanian and Alfano [2020]. In this work, we define the cost of recourse as the distance between and . Because computing the realworld cost of recourse is challenging Barocas et al. [2020], we use an adhoc distance function, as is general practice.
Counterfactual Objectives In general, counterfactual explanation techniques optimize objectives of the form,
(1)  
(2) 
where denotes candidate counterfactual at a particular point during optimization. The first term encourages the counterfactual to have the desired outcome probability by the model. The distance function enforces that the counterfactual is close to the original instance and easier to “achieve” (lower cost recourse). balances the two terms. Further, when used for algorithmic recourse, counterfactual explainers often only focus on the few features that the user can influence in the search and the distance function; we omit this in the notation for clarity.
Distance Functions The distance function captures the effort needed to go from to by an individual. As one such notion of distance, Wachter et al. [2018] use the Manhattan () distance weighted by the inverse median absolute deviation (MAD).
(3) 
This distance function generates sparse solutions and closely represents the absolute change someone would need to make to each feature, while correcting for different ranges across the features. This distance function can be extended to capture other counterfactual algorithms. For instance, we can include elastic net regularization instead of
for more efficient feature selection in high dimensions
Dhurandhar et al. [2018], add a term to capture the closeness of the counterfactual to the data manifold to encourage the counterfactuals to be in distribution, making them more realistic Van Looveren and Klaise [2019], or include diversity criterion on the counterfactuals Mothilal et al. [2020]. We provide the objectives for these methods in appendix B.1.Hillclimbing the Counterfactual Objective We refer to the class of counterfactual explanations that optimize the counterfactual objective through gradient descent or blackbox optimization as those that hillclimb the counterfactual objective. For example, Wachter’s algorithm Wachter et al. [2018] or DiCE Mothilal et al. [2020] fit this characterization because they optimize objective 2 through gradient descent. Techniques like MACE Karimi et al. [2020] and FACE Poyiadzi et al. [2020] do not fit this criteria because they do not use hillclimbing techniques.
Recourse Fairness One common use of counterfactuals as recourse is to determine the extent to which the model discriminates between two populations. For example, counterfactual explanations may return recourses that are easier to achieve for members of the notprotected group Ustun et al. [2019], Sharma et al. [2020] indicating unfairness in the counterfactuals Karimi et al. [2020], Gupta et al. [2019]. Formally, we define the recourse fairness as the difference in the average distance of the recourse cost between the protected and notprotected groups and say a counterfactual algorithm is recourse fair if this disparity is less than some threshold .
Definition 2.1
A model is recourse fair for algorithm , distance function , dataset , and scalar threshold if,
3 Adversarial Models for Manipulating Counterfactual Explanations
To demonstrate that commonly used approaches for counterfactual explanations are vulnerable to manipulation, we show, by construction, that one can design adversarial models for which the produced explanations are unstable. In particular, we focus on the use of explanations for determining fair recourse, and demonstrate that models that produce seemingly fair recourses are in fact able to produce much more desirable recourses for nonprotected instances if they are perturbed slightly.
Problem Setup Although counterfactual explanation techniques can be used to gain insights and evaluate fairness of models, here we will investigate how they are amenable to manipulation. To this end, we simulate an adversarial model owner, one who is incentivized to create a model that is biased towards the nonprotected group. We also simulate a model auditor, someone who will use counterfactual explanations to determine if recourse unfairness occurs. Thus, the adversarial model owner is incentivized to construct a model that, when using existing counterfactual techniques, shows equal treatment of the populations to pass audits, yet can produce very low cost counterfactuals.
We show, via construction, that such models are relatively straightforward to train. In our construction, we jointly learn a perturbation vector
(a small vector of the same dimension as
) and the model , such that the recourses computed by existing techniques look fair, but recourses computed by adding perturbation to the input data produces low cost recourses. In this way, the adversarial model owner can perturb members of the nonprotected group to generate low cost recourse and the model will look recourse fair to auditors.Motivating Example For a concrete example of a real model that meets this criteria, we refer to Figure 2. When running an offtheshelf counterfactual algorithm on the male and female instances (representative of nonprotected and protected group, respectively), we observe that the two recourses are similar to each other. However, when the adversary changes the age of the male applicant by years (the perturbation ), the recourse algorithm finds a much lower cost recourse.
Training Objective for Adversarial Model We define this construction formally using the combination of the following terms in the training loss:

[nosep,noitemsep,leftmargin=12pt]

Fairness: We want the counterfactual algorithm to be fair for model according to Definition 2.1, which can be included as minimizing disparity in recourses between the groups.

Unfairness: A perturbation vector should lead to lower cost recourse when added to nonprotected data, leading to unfairness, i.e., .

Small perturbation: Perturbation should be small. i.e. we need to minimize .

Accuracy: We should minimize the classification loss (such as cross entropy) of the model .

Counterfactual: should be a counterfactual, i.e. minimize .
This combined training objective is defined over both the parameters of the model and the perturbation vector . Apart from requiring dual optimization over these two variables, the objective is further complicated as it involves , a blackbox counterfactual explanation approach. We address these challenges in the next section.
Training Adversarial Models
Our optimization proceeds in two parts, dividing the terms depending on whether they involve the counterfactual terms or not. First, we optimize the perturbation and model parameters on the subset of the terms that do not depend on the counterfactual algorithm, i.e. optimizing accuracy, counterfactual, and perturbation size^{1}^{1}1The objectives discussed in this section use the training set, whereas, evaluation is done on a held out test set everywhere else.:
(4) 
Second, we optimize parameters , fixing the perturbation . We still include the classification loss so that the model will be accurate, but also terms that depend on (we use to denote uses the model parameterized by ). In particular, we add the two competing recourse fairness related terms: reduced disparity between subgroups for the recourses on the original data and increasing disparity between subgroups by generating lower cost counterfactuals for the protected group when the perturbation is added to the instances. This objective is,
(5) 
Optimizing this objective requires computing the derivative (Jacobian) of the counterfactual explanation with respect to , . Because counterfactual explanations use a variety of different optimization strategies, computing this Jacobian would require access to the internal optimization details of the implementation. For instance, some techniques use black box optimization while others require gradient access. These details may vary by implementation or even be unavailable. Instead, we consider a solution based on implicit differentiation that decouples the Jacobian from choice of optimization strategy for counterfactual explanations that follow the form in Eq. (2). We calculate the Jacobian as follows:
Lemma 3.1
Assuming the counterfactual explanation follows the form of the objective in Equation 2, , and is the number of parameters in the model, we can write the derivative of counterfactual explanation with respect to model parameters as the Jacobian,
We provide a proof in Appendix A. Critically, this objective does not depend on the implementation details of counterfactual explanation , but only needs black box access to the counterfactual explanation. One potential issue is the matrix inversion of the Hessian. Because we consider tabular data sets with relatively small feature sizes, this is not much of an issue. For larger feature sets, taking the diagonal approximation of the Hessian can serve as a reasonable approximation Fernando and Gould [2016].
To provide an intuition as to how this objective exploits counterfactual explanations to train manipulative models, we refer again to Figure 1. Because the counterfactual objective relies on an arbitrary function , this objective can be nonconvex. As a result, we can design such that converges to higher cost local minimums for all datapoints than those converges to when we add .
4 Experiment Setup
We use the following setup, including multiple counterfactual explanation techniques on two datasets, to evaluate the proposed approach of training the models.
Counterfactual Explanations We consider four different counterfactual explanation algorithms as the choices for that hillclimb the counterfactual objective. We use Wachter’s Algorithm Wachter et al. [2018], Wachter’s with elastic net sparsity regularization (Sparse Wachter; variant of Dhurandhar et al. [2018]), DiCE Mothilal et al. [2020], and Counterfactual’s Guided by Prototypes Van Looveren and Klaise [2019] (exact objectives in appendix B.1). These counterfactual explanations are widely used to compute recourse and assess the fairness of models Karimi et al. [2020], Verma et al. [2020], Stepin et al. [2021]. We use to compute the cost of a recourse discovered by counterfactuals. We use the official DiCE implementation^{2}^{2}2https://github.com/interpretml/DiCE, and reimplement the others (details in Appendix B.2). DiCE is the only approach that computes multiple counterfactual explanations; we generate counterfactuals and take the closest one to the original point (as per distance) to get a single counterfactual.
Data sets We use two data sets: Communities and Crime and the German Credit datasets Dua and Graff [2017], as they are commonly used benchmarks in both the counterfactual explanation and fairness literature Verma et al. [2020], Friedler et al. [2019]. Both these datasets are in the public domain. Communities and Crime contains demographic and economic information about communities across the United State, with the goal to predict whether there is violent crime in the community. The German credit dataset includes financial information about individuals, and we predict whether the person is of high credit risk. We preprocess the data as in Slack et al. [2020], and apply
mean, unit variance scaling to the features and perform an
split on the data to create training and testing sets. In Communities and Crime, we take whether the community is predominately black () as the protected class and lowrisk for violent crime as the positive outcome. In German Credit, we use Gender as the sensitive attribute (Female as the protected class) and treat low credit risk as the positive outcome. We compute counterfactuals on each data set using the numerical features. The numerical features include all features for Communities and Crime and of total features for German Credit. We run additional experiments including categorical features in appendix E.3.Comm. & Crime  German Credit  

Acc  Acc  
Unmodified  81.2    71.1   
Wachter  80.9  0.80  72.0  0.09 
Sparse Wachter  77.9  0.46  70.5  2.50 
Prototypes  79.2  0.46  69.0  2.21 
DiCE  81.1  1.73  71.2  0.09 
Manipulated Models
We use feedforward neural networks as the adversarial model consisting of
layers ofnodes with the ReLU activation function, the Adam optimizer, and using crossentropy as the loss
. It is common to use neural networks when requiring counterfactuals since they are differentiable, enabling counterfactual discovery via gradient descent
Mothilal et al. [2020]. We perform the first part of optimization for steps for Communities and Crime and for German Credit. We train the second part of the optimization for optimization steps. We select the model that has the smallest disparity in mean distance between protected and nonprotected groups in the training data. We also train a baseline network (the unmodified model) for our evaluations using optimization steps. In Table 1, we show the model accuracy for the two datasets (the manipulated models are similarly accurate as the unmodified one) and the magnitude of the discovered .Communities and Crime  German Credit  

Wach.  SWach.  Proto.  DiCE  Wach.  SWach.  Proto.  DiCE  
Protected  35.68  54.16  22.35  49.62  5.65  8.35  10.51  6.31 
NonProtected  35.31  52.05  22.65  42.63  5.08  8.59  13.98  6.81 
Disparity  0.37  2.12  0.30  6.99  0.75  0.24  0.06  0.5 
NonProtected  1.76  22.59  8.50  9.57  3.16  4.12  4.69  3.38 
Cost reduction  20.1  2.3  2.6  4.5  1.8  2.0  2.2  2.0 
5 Experiments
We evaluate manipulated models primarily in terms of how well they hide the cost disparity in recourses for protected and nonprotected groups, and investigate how realistic these recourses may be. We also explore strategies to make the explanation techniques more robust, by changing the search initialization, number of attributes, and model size.
5.1 Effectiveness of the Manipulation
We evaluate the effectiveness of the manipulated models across counterfactual explanations and datasets. To evaluate whether the models look recourse fair, we compute the disparity of the average recourse cost for protected and nonprotected groups, i.e. Definition (2.1). We also measure the average costs (using ) for the nonprotected group and the nonprotected group perturbed by . We use the ratio between these costs as metric for success of manipulation,
(6) 
If the manipulation is successful, we expect the nonprotected group to have much lower cost with the perturbation than without, and thus the cost reduction to be high.
We provide the results for both datasets in Table 2. The disparity in counterfactual cost on the unperturbed data is very small in most cases, indicating the models would appear counterfactual fair to the auditors. At the same time, we observe that the cost reduction in the counterfactual distances for the nonprotected groups after applying the perturbation is quite high, indicating that lower cost recourses are easy to compute for nonprotected groups. The adversarial model is considerably more effective applied on Wachter’s algorithm in Communities and Crime. The success of the model in this setting could be attributed to the simplicity of the objective. The Wachter objective only considers the squared loss (i.e., Eq (1)) and distance, whereas counterfactuals guided by prototypes takes into account closeness to the data manifold. Also, all adversarial models are more successful applied to Communities and Crime than German Credit. The relative success is likely due to Communities and Crime having a larger number of features than German Credit ( versus ), making it easier to learn a successful adversarial model due to the higher dimensional space. Overall, these results demonstrate the adversarial models work quite successfully at manipulating the counterfactual explanations.
5.2 Outlier Factor of Counterfactuals
One potential concern is that the manipulated models returns counterfactuals that are out of distribution, resulting in unrealistic recourses. To evaluate whether this is the case, we follow Pawelczyk et al. [2020]
, and compute the local outlier factor of the counterfactuals with respect to the positively classified data
Breunig et al. [2000]. The score using a single neighbor () is given as,(7) 
where is the closest true positive neighbor of . This metric will be when the counterfactual is an outlier. We compute the percent of counterfactuals that are local outliers by this metric on Communities and Crime, in Figure 3 (results from additional datasets/methods in Appendix E). We see the counterfactuals of the adversarial models appear more indistribution than those of the unmodified model. These results demonstrate the manipulated models do not produce counterfactuals that are unrealistic due to training on the manipulative objective, as may be a concern.

5.3 Potential Mitigation Strategies
In this section, we explore a number of constraints that could lead to more robust counterfactuals.
Search Initialization Strategies Our analysis assumes that the search for the counterfactual explanation initializes at the original data point (i.e., or ), as is common in counterfactual explanations. Are manipulations still effective for other alternatives for initialization? We consider three different initialization schemes and examine the effectiveness of the Wachter and DiCE Communities and Crime Adversarial Model: (1) Randomly (, (2) at the Mean of the positively predicted data, and (3) at a perturbation of the data point using noise. To initialize Wachter randomly, we follow Mothilal et al. [2020]
and draw a random instance from a uniform distribution on the maximum and minimum of each feature (DiCE provides an option to initialize randomly, we use just this initialization). From the results in Figure
3(a), we see perturbing the data before search reduces the cost reduction most effectively. We find similar results for German Credit in appendix E.Number of Attributes We consider reducing the number of attributes used to find counterfactuals and evaluate the success of the adversarial model on Wachter’s algorithm for the Communities and Crime dataset. Starting with the original number of attributes, , we randomly select attributes, remove them from the set of attributes used by the counterfactual algorithm, and train an adversarial model. We repeat this process until we have attributes left. We report the cost reduction due to (Eq (6)) for each model, averaged over runs. We observe that we are unable to find low cost recourses for adversarial model as we reduce the number of attributes, with minimal impact on accuracy (not in figure). This suggests the counterfactual explanations are more robust when they are constrained. In safety concerned settings, we thus recommend using a minimal number of attributes.
Size of the Model To further characterize the manipulation, we train a number of models (on Communities and Crime for Wachter’s) that vary in their size. We show that as we increase the model size, we gain an even higher cost reduction, i.e. an increase in the cost reduction when the similar additional parameters are added. This is not surprising, since more parameters provide further the flexibility to distort the decision surface as needed. As we reduce the size of the model, we see the opposite trend; the cost reduction reduces substantially when fewer parameters are used. However, test set accuracy also falls considerably (from to , not in figure). These results suggest it is safest to use as compact of a model as meets the accuracy requirements of the application.
Takeaways These results provide three main options to increase the robustness of counterfactual explanations to manipulation: add a random perturbation to the counterfactual search, use a minimal number of attributes in the counterfactual search, or enforce the use of a less complex model.
6 Related Work
Recourse Methods A variety of methods have been proposed to generate recourse for affected individuals Wachter et al. [2018], Ustun et al. [2019], Karimi et al. [2020], Poyiadzi et al. [2020], Van Looveren and Klaise [2019]. Wachter et al. [2018] propose gradient search for the closest counterfactual, while Ustun et al. [2019] introduce the notion of actionable
recourse for linear classifiers and propose techniques to find such recourse using linear programming. Because counterfactuals generated by these techniques may produce unrealistic recommendations,
Van Looveren and Klaise [2019] incorporate constraints in the counterfactual search to encourage them to be indistribution. Similarly, other approaches incorporate causality in order to avoid such spurious counterfactuals Karimi et al. [2021], Karimi* et al. [2020], Barocas et al. [2020]. Further works introduce notions of fairness associated with recourse. For instance, Ustun et al. [2019] demonstrate disparities in the cost of recourse between groups, which Sharma et al. [2020] use to evaluate fairness. Gupta et al. [2019] first proposed developing methods to equalize recourse between groups using SVMs. Karimi et al. [2020] establish the notion of fairness of recourse and demonstrate it is distinct from fairness of predictions. Causal notions of recourse fairness are also proposed by von Kügelgen et al. [2020].Shortcomings of Explanations Pawelczyk et al. [2020] discuss counterfactuals under predictive multiplicity Marx et al. [2020] and demonstrate counterfactuals may not transfer across equally good models. Rawal et al. [2020] show counterfactual explanations find invalid recourse under distribution shift. Kasirzadeh and Smart [2021] consider how counterfactual explanations are currently misused and propose tenents to better guide their use. Work on strategic behavior considers how individuals might behave with access to either model transparency Chen et al. [2020], Tabibian et al. [2019] or counterfactual explanations Tsirtsis and GomezRodriguez [2020], resulting in potentially suboptimal outcomes. Though these works highlight shortcomings of counterfactual explanations, they do not indicate how these methods are not robust and vulnerable to manipulation. Related studies show that post hoc explanations techniques like LIME [Ribeiro et al., 2016] and SHAP Lundberg and Lee [2017] can also hide the biases of the models [Slack et al., 2020], and so can gradientbased explanations [Dimanov et al., 2020, Wang et al., 2020]. Aivodji et al. [2019] and Anders et al. [2020] show explanations can make unfair models appear fair.
7 Discussion & Conclusion
In this paper, we demonstrate a critical vulnerability in counterfactual explanations and show that they can be manipulated, raising questions about their reliability. We show such manipulations are possible across a variety of commonlyused counterfactual explanations, including Wachter Wachter et al. [2018], a sparse version of Wachter, Counterfactuals guided by prototypes Van Looveren and Klaise [2019], and DiCE Mothilal et al. [2020]. These results bring into the question the trustworthiness of counterfactual explanations as a tool to recommend recourse to algorithm stakeholders. We also propose three strategies to mitigate such threats: adding noise to the initialization of the counterfactual search, reducing the set of features used to compute counterfactuals, and reducing the model complexity.
Our results motivate several future research directions. First, there is a need for constructing counterfactual explanations that are robust to small changes in the input. Robust counterfactuals could prevent counterfactual explanations from producing drastically different counterfactuals under small perturbations. For instance, it could be possible to draw on work from the bayesian assessment literature to generate more robust counterfactuals using available data points Ji et al. [2020, 2021, 2019]. Second, this work motivates need for explanations with optimality guarantees, which could lead to more trust in the counterfactuals. Last, it could be useful to study when practitioners should use simpler models, such as in consequential domains, to have more knowledge about their decision boundaries, even if it is at the cost of accuracy.
References
 Fairwashing: the risk of rationalization. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 161–170. External Links: Link Cited by: §6.
 Fairwashing explanations with offmanifold detergent. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 314–323. External Links: Link Cited by: §6.
 The hidden assumptions behind counterfactual explanations and principal reasons. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, New York, NY, USA, pp. 80–89. External Links: ISBN 9781450369367, Link, Document Cited by: §2, §6.
 Explainable machine learning in deployment. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, New York, NY, USA, pp. 648–657. External Links: ISBN 9781450369367, Link, Document Cited by: §1.
 LOF: identifying densitybased local outliers. SIGMOD Rec. 29 (2), pp. 93–104. External Links: ISSN 01635808, Link, Document Cited by: §5.2.
 Strategic classification with a light touch: learning classifiers that incentivize constructive adaptation. External Links: arXiv Cited by: §6.
 Explanations based on the missing: towards contrastive explanations with pertinent negatives. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), Vol. 31, pp. 592–603. External Links: Link Cited by: §B.2, Appendix B, §2, §4.
 You shouldn’t trust me: learning models which conceal unfairness from multiple explanation methods. In SafeAI@AAAI, Cited by: §6.
 UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.
 Learning endtoend video classification with rankpooling. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pp. 1187–1196. Cited by: §3.
 A comparative study of fairnessenhancing interventions in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, New York, NY, USA, pp. 329–338. External Links: ISBN 9781450361255, Link, Document Cited by: §4.
 On Differentiating Parameterized Argmin and Argmax Problems with Application to Bilevel Optimization. arXiv eprints, pp. arXiv:1607.05447. External Links: 1607.05447 Cited by: Appendix A.
 Equalizing recourse across groups. Cited by: §1, §2, §6.

Active bayesian assessment of blackbox classifiers.
Proceedings of the AAAI Conference on Artificial Intelligence
35 (9), pp. 7935–7944. External Links: Link Cited by: §7. 
Bayesian evaluation of blackbox classifiers.
Uncertainty and Robustness in Deep Learning Workshop ICML
. Cited by: §7. 
Can i trust my fairness metric? assessing fairness with unlabeled data and bayesian inference
. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 18600–18612. External Links: Link Cited by: §7.  Modelagnostic counterfactual explanations for consequential decisions. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), Proceedings of Machine Learning Research, Vol. 108, pp. 895–905. External Links: Link Cited by: §1, §2, §6.
 Algorithmic recourse: from counterfactual explanations to interventions. In 4th Conference on Fairness, Accountability, and Transparency (ACM FAccT), Cited by: §6.
 A survey of algorithmic recourse: definitions, formulations, solutions, and prospects. arXiv eprints, pp. arXiv:2010.04050. External Links: 2010.04050 Cited by: §1, §2, §4, §6.
 Algorithmic recourse under imperfect causal knowledge: a probabilistic approach. In Advances in Neural Information Processing Systems 33, Note: *equal contribution Cited by: §1, §6.
 The use and misuse of counterfactuals in ethical machine learning. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, New York, NY, USA, pp. 228–236. External Links: ISBN 9781450383097, Link, Document Cited by: §6.
 A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4765–4774. Cited by: §6.
 Predictive multiplicity in classification. In ICML, Cited by: §6.
 Explaining machine learning classifiers through diverse counterfactual explanations. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 607–617. Cited by: §B.1, Appendix B, §1, §1, §2, §2, §4, §4, §5.3, §7.
 On counterfactual explanations under predictive multiplicity. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), J. Peters and D. Sontag (Eds.), Proceedings of Machine Learning Research, Vol. 124, pp. 809–818. External Links: Link Cited by: §1, §6.
 Learning modelagnostic counterfactual explanations for tabular data. In Proceedings of The Web Conference 2020, pp. 3126–3132. External Links: ISBN 9781450370233, Link Cited by: §5.2.
 FACE: feasible and actionable counterfactual explanations. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES ’20, New York, NY, USA, pp. 344–350. External Links: ISBN 9781450371100, Link, Document Cited by: §1, §2, §6.
 Can i still trust you?: understanding the impact of distribution shifts on algorithmic recourses. arXiv. Cited by: §1, §6.
 "Why should I trust you?": explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 1317, 2016, pp. 1135–1144. Cited by: §6.
 CERTIFAI: a common framework to provide explanations and analyse the fairness and robustness of blackbox models. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES ’20, New York, NY, USA, pp. 166–172. External Links: ISBN 9781450371100, Link, Document Cited by: §1, §2, §6.
 Fooling lime and shap: adversarial attacks on post hoc explanation methods. AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES). Cited by: §4, §6.
 A survey of contrastive and counterfactual explanation generation methods for explainable artificial intelligence. IEEE Access 9 (), pp. 11974–12001. External Links: Document Cited by: §4.
 Optimal decision making under strategic behavior. arXiv preprint arXiv:1905.09239. Cited by: §6.
 Decisions, counterfactual explanations and strategic behavior. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.
 Actionable recourse in linear classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, pp. 10–19. External Links: ISBN 9781450361255, Link, Document Cited by: §1, §2, §6.
 Interpretable Counterfactual Explanations Guided by Prototypes. arXiv, pp. arXiv:1907.02584. External Links: 1907.02584 Cited by: §B.2, Appendix B, §E.3, §1, §1, §1, §2, §4, §6, §7.
 The philosophical basis of algorithmic recourse. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, New York, NY, USA, pp. 284–293. External Links: ISBN 9781450369367, Link, Document Cited by: §2.
 Counterfactual explanations for machine learning: a review. pp. . Cited by: §4, §4.
 On the fairness of causal algorithmic recourse. Cited by: §6.
 Counterfactual explanations without opening the black box: automated decisions and the gdpr. Harvard journal of law & technology 31, pp. 841–887. External Links: Document Cited by: Appendix A, §B.2, Appendix B, Figure 1, §1, §1, §1, §2, §2, §4, §6, §7.
 Gradientbased Analysis of NLP Models is Manipulable. In Findings of the Association for Computational Linguistics: EMNLP (EMNLP Findings), pp. 247–258. Cited by: §6.
Appendix A Optimizing Over the Returned Counterfactuals
In this appendix, we discuss a technique to optimize over the counterfactuals found by counterfactual explanation methods, such as Wachter et al. [2018]. We restate lemma 3.1 and provide a proof.
Lemma 3.1 Assuming the counterfactual algorithm follows the form of the objective in equation 2, , and is the number of parameters in the model, we can write the derivative of counterfactual algorithm with respect to model parameters as the Jacobian,
Proof. We want to compute the derivative,
(8) 
This problem is identical to a wellstudied class of bilevel optimization problems in deep learning. In these problems, we must compute the derivative of a function with respect to some parameter (here ) that includes an inner argmin, which itself depends on the parameter. We follow Gould et al. [2016] to complete the proof.
Note, we write to describe the objective evaluated at the counterfactual found using the counterfactual explanation . Also, we denote the zero vector as . For a single network parameter , we have the following equivalence because converges to a stationary point from the assumption,
(9)  
We differentiate with respect to and apply the chain rule,  
(10)  
(11)  
Rewriting in terms of ,  
(12) 
Extending this result to multiple parameters, we write,
(13) 
This result depends on the assumption . This assumption states the counterfactual explanation converges to a stationary point. In the case the counterfactual explanation terminates before converging to stationary point, this solution will be approximate.
Appendix B Counterfactual Explanation Details
In this appendix, we provide additional details related the counterfactual explanations used in the paper. Recall, we use four counterfactual explanations in our paper. The counterfactual explanations were Wachter’s Algorithm Wachter et al. [2018], Wachter’s with elastic net sparsity regularization (Sparse Wachter; variant of Dhurandhar et al. [2018]), DiCE Mothilal et al. [2020], and Counterfactual’s Guided by Prototypes Van Looveren and Klaise [2019].
b.1 Objectives & Distance Functions
We describe the objective of each counteractual explanation and detail hyperparameter choices within the objectives. Note, all algorithms but DiCE include a hyperparameter
applied to squared loss (i.e., in Eq. (1)). Since this parameter needs to be varied to find successful counterfactuals (i.e., ), we set this hyperparameter at initially and increment it x until we find a successful counterfactual.Wachter’s Algorithm
The distance function for Wachter’s Algorithm is given as,
(14)  
The full objective is written as,
(15) 
Sparse Wachter
The distance function for Sparse Wachter is given as,
(16) 
The full objective is written as,
(17) 
Prototypes
The distance function for Prototypes is given as,
(18) 
where is the nearest positively classified neighbor of according to euclidean distance. We fix . The full objective is written as,
(19) 
DiCE
The distance function used in the DiCE objective is defined over counterfactuals,
(20) 
Note, the DiCE objective uses the hinge loss, instead of the squared loss, as in the earlier objectives. The objective is written as,
(21) 
When we evaluate distance, we take the closest counterfactual according to distance because we are interested in the single least cost counterfactual. Because we only have a single counterfactual, the diversity term in equation 20 reduces to . Thus, the distance we use during evaluation is the Wachter distance, , on the closest counterfactual. We fix as in Mothilal et al. [2020]. Because DiCE provides a hyperparameter on the distance instead of on the squared loss like in the other counterfactual explanations, , we fix this value to and decrement until we successfully generate counterfactuals.
b.2 Reimplementation Details
We reimplement Wachter’s Algorithm Wachter et al. [2018], Wachter’s with elastic net sparsity regularization (Sparse Wachter; variant of Dhurandhar et al. [2018]), and Counterfactual’s Guided by Prototypes Van Looveren and Klaise [2019]. We optimize the objective in section B.1 for each explainer using the Adam optimizer with learning rate . We initialize the counterfactual search at the original instance unless stated otherwise (e.g., experimentation with different search initialization strategies in section 5.3). We fix and run counterfactual search optimization for 1,000 steps. If we didn’t find a successful counterfactual (i.e., ) we increase , . We repeat this process until we find a counterfactual.
Appendix C Unmodified Models
In this appendix, we provide the recourse costs of counterfactual explanations applied to the unmodified models. We give the results in table 3. While the manipulated models showed minimal disparity in recourse cost between subgroups (see table 2), the unmodified models often have large disparities in recourse cost. Further, the counterfactuals found when we add to the nonprotected instances are not much different than without using the key. These results demonstrate the objective in section 3 encourages the model to have equitable recourse cost between groups and much lower recourse for the notprotected group when we add .
Wach.  SWach.  Proto.  DiCE  
Communities and Crime  
Protected  
NonProtected  
Disparity  3.59  3.42  3.84  24.41 
NonProtected  
German Credit  
Protected  
NonProtected  
Disparity  1.94  2.45  1.39  14.98 
NonProtected 
Appendix D Scalability of the Adversarial Objective
In this appendix, we discuss the scalability of the adversarial model training objective. We also demonstrate the scalability of the objective by training a successful manipulative model on the Adult dataset (k datapoints).
d.1 Scalability Considerations
Training complexity of the optimization procedure proposed in section 3 increases along three main factors. First, complexity increases with the training set size because we compute the loss across all the instances in the batch. This computation includes finding counterfactuals for each instance in order to compute the hessian in Lemma 3.1. Second, complexity increases with number of features in the data due to the computation of the hessian in Lemma 3.1, assuming no approximations are used. Last, the number of features in the counterfactual search increases the complexity of training because we must optimize more parameters in the perturbation and additional features in the counterfactual search.
d.2 Adult dataset
One potential question is whether the attack is scalable to large data sets because computing counterfactuals (i.e., ) for every instance in the training data is costly to compute. However, it is possible for the optimization procedure to handle large data sets because computing is easily parallelizable. We demonstrate the scalability the adversarial objective on the Adult dataset consisting of k data points using DiCE with the preprocessing from [4], using numerical features for the counterfactual search. The model had a cost ratio of , indicating that the manipulation was successful.
DiCE  
Adult  
Protected  
NonProtected  
Disparity  
NonProtected  
Cost Reduction  2.13 
Test Accuracy  
Appendix E Additional Results
In this appendix, we provide additional experimental results.
e.1 Outlier Factor of Counterfactuals
In the main text, we provided outlier factor results for the Communities and Crime data set with Wachter and DiCE. Here, we provide additional outlier factor results for Communities and Crime using Sparse Wachter and counterfactuals guided by prototypes and for the German Credit data set in figure 5. We see similiar results to those in the main paper, namely that the manipulated + counterfacutals are the most realistic (lowest % predicted outliers).
e.2 Different Initializations
In the main text, we provided results for different initialization strategies with the Communities and Crime data set using DiCE and Wachter. We provide additional different initialization results for German Credit in Table 7 and Communities and Crime for Sparse Wachter and counterfactuals guided by prototypes in Table 6. Similar to the experiments presented in the main text, we see and is consistently the most effective mitigation strategy.
e.3 Categorical Features
In the main text, we used numerical features in the counterfactual search. In this appendix, we train manipulated models using categorical features in the counterfactual search with German Credit for both counterfactuals guided by prototypes and DiCE. We do not use categorical features with Wachter because it is very computationally expensive Van Looveren and Klaise [2019]. We perform this experiment with German Credit only because there are no categorical features in Communities and Crime. We consider on only the numerical features and rounding to or for the categorical features. We present the results in tables 5. We found the manipulation to be successful in out of cases, with the exception being rounded for counterfactuals guided by prototypes. These results demonstrate the manipulation is successful with categorical features.
Only numerical  Rounding  
Proto.  DiCE  Proto.  DiCE  
German Credit  
Protected  
NonProtected  
Disparity  
NonProtected  
Cost Reduction  2.3  1.5  0.92  1.5 
Test Accuracy  
Model  SWachter  Proto.  

Initialization  Mean  Rnd.  +  Mean  Rnd.  + 
Protected  
NonProtected  
Disparity  17.03  1.83  2.05  8.29  13.78  7.45 
NonProtected  
Cost Reduction  1.15  7.43  1.04  0.74  1.12  0.82 
Test Accuracy  
Model  Wachter  SWachter  Proto.  DiCE  

Initialization  Mean  Rnd.  +  Mean  Rnd.  +  Mean  Rnd.  +  Rnd. 
Protected  1.94  1.18  1.22  5.58  0.83  2.18  3.24  3.81  3.62  39.53 
NotProt.  1.29  1.27  1.42  2.29  0.95  3.24  4.64  7.42  3.47  36.53 
Disparity  0.65  0.18  0.19  3.29  0.12  1.06  1.39  3.61  0.14  21.43 
NotProt.+  0.96  3.79  1.03  1.30  1.36  1.26  3.52  5.74  2.54  3.00 
Cost Reduction  1.34  1.07  1.38  1.31  0.70  1.26  1.32  1.29  1.36  1.70 
Accuracy  66.5  67.0  68.5  66.5  67.7  67.7  66.3  65.8  65.8  66.8 
0.81  0.80  0.36  0.81  0.80  0.54  0.98  0.43  0.83  2.9 
Appendix F Compute Details
We run all experiments in this work on a machine with a single NVIDIA 2080Ti GPU.
Comments
There are no comments yet.