1 Introduction
Machine learning (ML) methods have found their way into numerous everyday applications and have become an indispensable asset in various sensitive domains, like disease diagnostics fatima2017survey, criminal justice berk2012criminal, or credit risk scoring khandani20102767. While ML models bear the great potential to provide effective support in human decision making processes, their predictions may have considerable impact on personal lives, where the final decisions might be disadvantageous for an end user. For example, the rejection of a loan or the denial of parole might have negative effects on the future development of the corresponding person’s life.
When ML systems involve humans in the loop, it is crucial to build a strong foundation for longterm acceptance of these methods. To this end, it is critical (1) to explain the predictions of a model and (2) to offer constructive means for the improvement of those predictions to the advantage of the end–user. Counterfactual explanations – popularized by the seminal work of wachter2017counterfactual – provide means for prescriptive model explanations by suggesting actionable feature changes (e.g., increase income) that allow individuals to achieve favourable outcomes in the future (e.g., insurance approval).
When counterfactual explainability is employed in systems that involve humans in the loop, the community refers to it as recourse
. Algorithmic recourse subsumes precise recipes on how to obtain desirable outcomes after being subjected to an automated decision, emphasizing feasibility constraints that have to be taken into account. Those explanations are found by making the smallest possible change to an input vector to influence the prediction of a pretrained classifier in a positive way; for example, from ‘loan denial’ to ‘loan approval’, subject to the constraint that an individual’s
sex may not change. As documented in recent reviews, there exists a quickly growing literature with available methods (see Figure 1 and (stepin2021survey; karimi2020survey; verma2020counterfactual)), reflecting the insight that the understanding of complex machine learning models is an elementary ingredient for a wide and safe technology adoption.In practice, the counterfactual explanation (CE) that an individual receives crucially depends on the method that computes the recourse suggestions. Hence, there is a substantial need for a standardized benchmarking platform, which ensures that methods can be compared in a transparent and meaningful way. Researchers need to be able to easily evaluate their proposed methods against the overwhelming diversity of already available methods and practitioners need to make sure that they are using the right recourse mechanism for the problem at hand. Therefore, a standardized framework for comparison and quality assurance is an essential and indispensable prerequisite.
In this work, we present CARLA (Counterfactual And Recourse LibrAry), a python library with the following merits: First, CARLA provides competitive baselines for researchers to benchmark new counterfactual explanation and recourse methods for the standardized and transparent comparison of CE methods on different integrated data sets. Second, CARLA is a common framework with more than 10 counterfactual explanation methods in combination with the possibility to easily integrate new methods into a commonly accessible and easily distributable Python library. Moreover, the builtin integrated evaluation measures allow users to plugin their custom blackbox predictive models into the available counterfactual explanation methods and conduct extensive evaluations in comparison with other recourse mechanisms across different data sets. The same is true for researchers, who can use CARLA to extensively benchmark available counterfactual methods on popular data sets across various ML models. Third, CARLA supports popular optimization frameworks
such as Tensorflow
abadi2016tensorflowand PyTorch
paszke2019pytorch, and provides a generic abstraction layer to support custom implementations. Users can can define problem–specific data set characteristics like immutable features and explicitly specify hyperparameters for the chosen counterfactual explanation method.
The remainder of this work is structured as follows: Section 2 presents related work, Section 3 formally introduces the recourse problem, Section 4 presents the benchmarking process. In Section 5, we describe our main findings, before concluding in Section 6. Appendices A  E describe CARLA’s software architecture and usage instructions, as well as additional experimental results, used ML classifiers, data sets and hyperparameters settings.
2 Related Work
Explainable machine learning is concerned with the problem of providing explanations for complex ML models. Towards this goal, various streams of research follow different explainability paradigms which can be categorized into the following groups guidotti2018survey; gade2019explainable.
2.1 Feature Highlighting Explanations
Local input attribution techniques seek to explain the behaviour of ML models instance by instance. Those methods aim to understand how all inputs available to the model are being used to arrive at a certain prediction. Some popular approaches for model explanations aim at explainability by design lou2012intelligible; alvarez2018towards; broelemann2018gradient; wang2019designing. For whitebox models – the internal model parameters are known – gradientbased approaches, e.g. kasneci2016licon; chattopadhay2018grad
(for deep neural networks), and rulebased or probabilistic approaches for tree ensembles, e.g.
hara2018making; deng2019interpreting have been proposed. In cases where the parameters of the complex models cannot be accessed, modelagnostic approaches can prove useful. This group of approaches seeks to explain a model’s behavior locally by applying surrogate models ribeiro2016should; lundberg2017unified; ribeiro2018anchors; lundberg2020local, which are interpretable by design and are used to explain individual predictions of blackbox ML models.2.2 Counterfactual Explanations
The main purpose of counterfactual explanations is to suggest constructive interventions to the input of a complex model so that the output changes to the advantage of an end user. By emphasizing both the feature importance and the recommendation aspect, counterfactual explanation methods can be further divided into three different groups: independencebased, dependencebased, and causalitybased approaches.
In the class of independencebased methods
, where the input features of the predictive model are assumed to be independent, some approaches use combinatorial solvers or evolutionary algorithms to generate recourse in the presence of feasibility constraints
ustun2019actionable; russell2019efficient; rawal2020individualized; karimi2019model; kenny2020generating; dandl2020multi. Notable exceptions from this line of work are proposed by tolomei2017interpretable; laugel2017inverse; lash2017generalized; gupta2019equalizing; ghazimatin2020prince, who use decision trees, random search, support vector machines (SVM) and information networks that are aligned with the recourse objective. Another line of research deploys gradientbased optimization to find lowcost counterfactual explanations in the presence of feasibility and diversity constraints
Dhurandhar2018; mittelstadt2019explaining; mothilal2020fat; schut2021generating; van2019interpretable; pawelczyk2021connections. The main problem with these approaches is that they abstract from input correlations. That implies that the intervention costs (i.e., the costs of changing the input to achieve the proposed counterfactual state) are too optimistically estimated. In other words, the estimated costs do not reflect the true costs that an individual would need to incur in practical scenarios, where feature dependencies are usually present: e.g.,
income is dependent on tenure, and if income changes, tenure also changes (see Figure 2 for a schematic comparison).In the class of causalitybased approaches, all methods make use of Pearl’s causal modelling framework (pearl2009causality). As such, they usually require knowledge of the system of causal structural equations (joshi2019towards; goyal2019explaining; karimi2020intervention; oshaughnessy2020generative) or the causal graph (karimi2020probabilistic). The authors of karimi2020intervention show that these models can generate minimumcost recourse, if the access to the true causal data generating process was available. However, in practical scenarios, the guarantee for such minimumcost recommendations is vacuous, since, in complex settings, the causal model is likely to be missspecified (karimi2020probabilistic). Since these methods usually require the true causal graph – which is the limiting factor in practice – we have not considered them at this point, but we plan to do that in the future.
Dependencebased methods bridge the gap between the strong independence assumption and the strong causal assumption. This class of models builds recourse suggestions on generative models pawelczyk2019; downs2020interpretable; joshi2019towards; mahajan2019preserving; pmlrv124pawelczyk20a
. The main idea is to change the geometry of the intervention space to a lower dimensional latent space, which encodes different factors of variation while capturing input dependencies. To this end, these methods primarily use variational autoencoders (VAE)
(kingma2013auto; nazabal2018handling). In particular, mahajan2019preserving demonstrate how to encode various feasibility constraints into VAEbased models. Most recently, antoran2020getting proposed CLUE, a generative recourse model that takes a classifier’s uncertainty into account. Work that deviates from this line of research was done by Poyiadzi2020; Kanamori2020. The authors of Poyiadzi2020 provide FACE, which uses a shortest path algorithm on graphs to find counterfactual explanations. In contrast, Kanamori2020 use integer programming techniques to account for input dependencies.3 Preliminaries
In this Section, we review the algorithmic recourse problem and draw a distinction between two observational (i.e., non–causal) methods.
3.1 Counterfactual Explanations for Independent Inputs
Let be the data set consisting of input data points, . We denote by the fixed classifier for which recourse is to be determined. We denote the set of outcomes by , where indicates the desirable outcome. Moreover, is the predicted class, where denotes the indicator function and is a threshold (e.g., 0.5). Our goal is to find a set of actionable changes in order to improve the outcomes of instances , which are assigned an undesirable prediction under . Moreover, one typically defines a distance measure in inputs space . We discuss typical choices for in Section 4.
Assuming inputs are pairwise statistically independent, the recourse problem is defined as follows:
() 
where is the set of admissible changes made to the factual input . For example, could specify that no changes to sensitive attributes such as age or sex may be made. For example, using the independent input assumption, existing approaches (ustun2019actionable)
use mixedinteger linear programming to find counterfactual explanations. In the next paragraph, we present a problem formulation that relaxes the strong independence assumption by introducing generative models.
3.2 Recourse for Correlated Inputs
We assume the factual input is generated by a generative model such that:
where are latent codes. We denote the counterfactual explanation in an input space by . Thus, we have . Assuming inputs are dependent, we can rewrite the recourse problem in () to faithfully capture those dependencies using the generative model :
() 
where is the set of admissible changes in the dimensional latent space. For example, would ensure that the counterfactual latent space lies within range of . The problem in () is an abstraction from how the problem is usually solved in practice: most existing approaches first train a type of autoencoder model (e.g., a VAE), and then use the model’s trained decoder as a deterministic function to find counterfactual explanations (joshi2019towards; pawelczyk2019; mahajan2019preserving; downs2020interpretable; antoran2020getting). Our benchmarked explanation models roughly fit in one of these two categories.
) for a logistic regression and an artificial neural network classifier. The white dots indicate the medians (lower is better), and the black boxes indicate the interquartile ranges. We distinguish between independence based and dependence based methods. The results are discussed in Section
54 Benchmarking Process
In this Section, we provide a brief explanation model overview and introduce a variety of explanation measures used to evaluate the quality of the generated counterfactual explanations. In Table 1 we present a concise explanation model overview.
Approach  Method  Model Type  Algorithm  Immutable  Categorical  Other 
Independent (I)  AR  Linear  Integer Prog.  Yes  Binary  Direction of change 
ARLIME  Agnostic  Integer Prog.  Yes  Binary  Direction of change  
CEM  Gradient based  Gradient based  No  No  None  
DICE  Gradient Based  Gradient based  Yes  Binary  Generative model  
GS  Agnostic  Random search  Yes  Binary  None  
Wachter  Gradient based  Gradient based  No  Binary  None  
Dependent (D)  CEMVAE  Gradient based  Gradient based  No  No  Gen. Model regularizer 
CLUE  Gradient based  Gradient based  No  No  Generative model  
FACEEPS  Agnostic  Graph search  Binary  Binary  CE is from data set  
FACEKNN 
Agnositc  Graph search  Binary  Binary  CE is from data set  
REVISE  Gradient based  Gradient based  Binary  Binary  Generative model 
4.1 Counterfactual Explanation Methods
Ar
Ustun2019ActionableRI provide a method to generate minimal cost actions for linear classification models such as logistic regression models. AR requires the linear model’s coefficients, and uses these coefficients for its search for counterfactual explanations. To provide reasonable actions it is possible to restrict to user–specified constraints (e.g., has_phd can only change from False to True) or to set a subset of inputs as immutable (e.g., age). The problem to find these changes is a discrete optimization problem. Given a set of actions, AR finds the action which minimizes a defined cost function, using integer programming solvers like CPLEX or CBC.
ArLime
Most classification tasks do not have linearly separable classes and complex non–linear models usually provide more accurate predictions. Non–linear models are not per se interpretable and usually do not provide coefficients similar to linear models. We use a reduction to apply AR to non–linear models by computing a local linear approximation for the point of interest , using LIME ribeiro2016should. For an arbitrary black–box model , LIME estimates post–hoc local explanations in form of a set of linear coefficients per instance. Using the coefficients we apply AR.
Cem
Dhurandhar2018 use an elastic–net regularization inspired objective to find lowcost counterfactual instances. Different weights can be assigned to and norms, respectively. There exists no immutable feature handling. However, we provide support for their VAE type regularizer, which should help ensure that counterfactual instances look more realistic.
Clue
antoran2020getting propose CLUE, a generative recourse model that takes a classifier’s uncertainty into account. This model suggests feasible counterfactual explanations that are likely to occur under the data distribution. The authors use a variational autoencoder (VAE) to estimate the generative model. Using the VAE’s decoder, CLUE uses an objective that guides the search of CEs towards instances that have low uncertainty measured in terms of the classifier’s entropy.
Dice
Mothilal2020ExplainingML suggest DICE, which is an explanation model that seeks to generate minimum costs counterfactual explanations according to () subject to a diversity constraint which aims to promote a diverse set of counterfactual explanations. Diversity is achieved by using the whole range of suggested changes, while still keeping proximity to a given input. Regarding the optimization problem, DICE uses gradient descent to find a solution that tradesoff proximity and diversity. Domain knowledge – in form of feature ranges or immutability constraints – can be added.
Face
The authors of Poyiadzi2020 provide FACE, which uses a shortest path algorithm (for graphs) to find counterfactual explanations from high–density regions. Those explanations are actual data points from either the training or test set. Immutability constraints are enforced by removing incorrect neighbors from the graph. We implemented two variants of this model: the first variant uses an epsilon–graph (FACEEPS), whereas the second variant uses a knn–graph (FACEKNN).
Growing Spheres (GS)
Growing Spheres – suggested in (laugel2017inverse) – is a random search algorithm, which generates samples around the factual input point until a point with a corresponding counterfactual class label was found. The random samples are generated around using growing hyperspheres. For binary input dimensions, the method makes use of Bernoulli sampling. Immutable features are readily specified by excluding them from the search procedure.
Revise
joshi2019towards propose a generative recourse model. This model suggests feasible counterfactual explanations that are likely to occur under the data distribution. The authors use a variational autoencoder (VAE) to estimate the generative model. Using the VAE’s decoder, REVISE uses the latent space to search for CEs. No handling of immutable features exists.
Wachter et al. (Wachter)
The optimization approach suggested by Wachter2017CounterfactualEW generates counterfactual explanations by minimizing an objective function using gradient descent to find counterfactuals which are as close as possible to . Closeness is measured in norm.


4.2 Evaluation Measures for Counterfactual Explanation Methods
As algorithmic recourse is a multi–modal problem we introduce a variety of measures to evaluate the methods’ performances. We use six baseline evaluation measures. Besides distance measures it is important to consider measures that emphasize the quality of recourse.
Costs
When answering the question of generating the nearest counterfactual explanation, it is essential to define the distance of the factual to the nearest counterfactual . The literature has formed a consensus to use either the normalized or norm or any convex combination thereof (see for example rawal2020individualized; mothilal2020fat; pmlrv124pawelczyk20a; karimi2019model; ustun2019actionable; wachter2017counterfactual). The norm puts a restriction on the number of feature changes between factual and counterfactual instance, while the norm restricts the average change:
(1) 
Constraint violation
This measure counts the number of times the CE method violates userdefined constraints. Depending on the data set, we fixed a list of features which should not be changed by the used method (e.g., sex, age or race).
yNN
We use a measure that evaluates how much data support CEs have from positively classified instances. Ideally, CEs should be close to positively classified individuals which is a desideratum formulated by laugel2019dangers. We define the set of individuals who received an undesirable prediction under as . The counterfactual instances (instances for which the label was successfully changed) corresponding to the set are denoted by . We use a measure that captures how differently neighborhood points around a counterfactual instance are classified:
(2) 
where kNN denotes the nearest neighbours of , and
is the binarized classifier. Values of yNN close to 1 imply that the neighbourhoods around the counterfactual explanations consists of points with the same predicted label, indicating that the neighborhoods around these points have already been reached by positively classified instances. We use a value of
, which ensures sufficient data support from the positive class.Redundancy
We evaluate how many of the proposed feature changes were not necessary. This is a particularly important criterion for independence–based methods. We measure this by successively flipping one value of after another back to , and then we inspect whether the label flipped from back to : e.g., we check whether flipping the value for the second dimension would change the counterfactual outcome back to the predicted factual outcome of : . If the predicted outcome does not change, we increase the redundancy counter, concluding that a sparser counterfactual explanation could have been found. We iterate this process over all dimensions of the input vector.^{1}^{1}1We do not consider all possible subsets of changes. A low number indicates few redundancies across counterfactual instances.
Success Rate
Some generated counterfactual explanations do not alter the predicted label of the instance as anticipated. To keep track how often the generated CE does hold its promise, the success rate shows the fraction of respective models’ correctly determined counterfactuals.
Average Time
By measuring the average time a CE method needs to generate its result, we evaluate the effectiveness and feasibility for real–time prediction settings.
5 Experimental Evaluation
Using CARLA we conduct extensive empirical evaluations to benchmark the presented counterfactual explanations methods using three realworld data sets. Our main findings are displayed in Figure 3, and Table 2. We split the benchmarking evaluation by CE method category. In the following Sections, we provide an overview over the used data sets (see Table 3) and the classification models. Detailed information on hyperparameter search for the CE methods is provided in Appendix E.
Data sets
The Adult data set Dua:2019 originates from the 1994 Census database, consisting of 14 attributes and 48,842 instances. The classification consists of deciding whether an individual has an income greater than 50,000 USD/year. Since several CE methods cannot handle nonbinary categorical data, we binarized these features by partitioning them into the most frequent value, and its counterpart (e.g., US and NonUS, Husband and NonHusband). The features age, sex and race are set as immutable. The Give Me Some Credit (GMC) data set Kaggle2011 from a 2011 Kaggle Competition is a credit scoring data set, consisting of 150,000 observations and 11 features. The classification task consists of deciding whether an instance will experience financial distress within the next two years (SeriousDlqin2yrs is 1) or not. We dropped missing data, and set age as immutable.
Data Set  Task  Positive Class  Size  Features  Immutable Features 

Adult  Predict Income  High Income (24%)  (45,222  20)  Work, Education, Income  Sex, Age, Race 
COMPAS  Predict Recidivism  No Recid. (65%)  (10,000  8)  Crim. History, Jail & Prison Time  Sex, Race 
GMC  Predict Financial Distress  No deficiency (93%)  (150,000  11)  Pay. History, Balance, Loans  Age 
Blackbox models
We briefly describe how the black–box classifiers were trained. CARLA
supports different ML libraries (e.g., Pytorch, Tensorflow) to estimate these classifiers as the implementations of the various explanation methods work particular ML libraries only. The first model is a multilayer perceptron, consisting of three hidden layers with 18, 9 and 3 neurons, respectively. To allow a more extensive comparison (
AR only works on linear models) between CE methods, we chose logistic regression models as the second classification model for which we evaluate the CE methods. Detailed information on the classifiers’ training for each data set is provided in Appendix C.Benchmarking
For independence based methods, we find that no one single CE method outperformed all its competitors. This is not too surprising since algorithmic recourse is a multi–modal problem. Instead, we found that some methods dominated certain measures across all data sets. AR, ARLIME, DICE performed strongest with respect to (see the top left panels in Figures 2(a) and 2(b)). ARLIME does so despite our use of the LIME reduction. Therefore, it makes sense that AR, ARLIME and DICE offer the lowest redundancy scores (Table 1(a)). CEM performed strongest with respect to the overall cost measure across data sets. GS is the clear winner when it comes to the measurement of time (Table 1(a)). Since the algorithm behind GS is based on a rather rudimentary sampling strategy, we expect that savvier sampling strategies should boost its cost performance significantly.
For dependence based methods, the results are mixed as well. While CLUE and REVISE are the winner with respect to the cost of recourse (), the margins between these generative recourse models and the graphbased ones (FACE) are small (Figure 3). The FACEEPS method performs strongest with respect to the measure (usually well above 0.60) (Table 1(b)
) indicating that the generated CEs have sufficient data support from positively classified individuals relatively to the remaining dependence–based methods. As expected, the ynn measures are on average higher for the dependence based methods. This suggest that dependence based CEs are less often outliers. Notably,
CLUE and REVISE perform best with respect to (with REVISE being the clear winner on 3 out of 4 cases), while they perform worst on – likely due to the decoder’s imprecise reconstruction. In this respect, it is not surprising that these methods have average redundancy values that are up to twice as high as those by FACE. Finally, the generative model approaches (CEMVAE, CLUE, REVISE) performed best with respect to time since the autoencoder training time amortizes with more samples.6 Conclusion and Broader Impact of Carla
The current implementations of recourse methods, mentioned in Section 4.1 are based on the original implementation of the respective research groups. Researchers mostly implement their experiments and models for specific ML frameworks and data sets. For example, some explanation methods are restricted to Tensorflow and are not applicable to Pytorch models. In the future, we will extend CARLA to decouple each recourse method from the frameworks and data contraints.
When trying to combine different CE methods into a common benchmarking framework we encountered the following issues: First, a great number of repositories only contain remarks about installation and script calls to recreate the results from the corresponding research papers. Second, missing information about interfaces for data sets or black–box models further complicated the process of integrating different CE methods into the benchmarking workflow. In order to add more CE methods and data sets to CARLA, we are currently in contact with several authors in this exciting and rapidly growing field. With a growing opensource community, CARLA can evolve to be the main library for generating counterfactual explanations and benchmarks for recourse methods. Therefore we are continuously expanding the catalog of explanation methods and data sets, and welcome researchers to add their own recourse methods to the library. To facilitate this process, we provide a stepbystep userguide to integrate new CE methods into CARLA, which we present in Appendix A.
The rapidly growing number of available CE methods calls for standardized and efficient ways to assure the quality of a new technique in comparison with other approaches on different data sets. Quality assurance is a key aspect of actionable recourse, since complex models and CE mechanisms can have a considerable impact on personal lives. In this work, we presented CARLA, a versatile benchmarking platform for the standardized and transparent comparison of CE methods on different integrated data sets. In the explainability field, CARLA bears the potential to help researchers and practitioners alike to efficiently derive more realistic and use–case–driven recourse strategies and assure their quality through extensive comparative evaluations. We hope that this work contributes to further advances in explainability research.
References
Checklist
The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default to , , or . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

Did you include the license to the code and datasets? See Section LABEL:gen_inst.

Did you include the license to the code and datasets? The code and the data are proprietary.

Did you include the license to the code and datasets?
Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? . As we state in the abstract, our goal is to provide a Python framework for benchmarking counterfactual explanation methods. Users can easily evaluate our results by accessing our Github repository, where we host our Python framework and our benchmarking results.

Did you describe the limitations of your work? . In Section 6, we discuss the current limitations of our approach. The counterfactual explanation methods are based on the original implementation of the respective research groups. Researchers mostly implement their experiments and models for specific ML frameworks and data sets. For example, some explanation methods are restricted to Tensorflow and are not applicable to Pytorch models.

Did you discuss any potential negative societal impacts of your work? . We discuss the broader impact of our benchmarking library in Section 6; we mainly see positive impacts on the literature of algorithmic recourse.

Have you read the ethics review guidelines and ensured that your paper conforms to them? . We have read the ethics review guidelines and attest that our paper conforms to the guidelines.


If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? . We did not provide theoretical results.

Did you include complete proofs of all theoretical results? . We did not provide theoretical results.


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? . Details of implementations, data sets and instructions can be found here: Appendices A, C, E, and our Github repository.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? . Error bars have been reported for our cost comparisons in terms of the 25th and 75ht percentiles of the cost distribution, see for example Figure 3.

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? . All models are evaluated on an i78550U CPU with 16 Gb RAM, running on Windows 10.


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? . The data sets, which are publicly available are appropriately cited in Section 5. We cite and link to any additional code used, for example antoran2020getting.

Did you mention the license of the assets? . All assets are publicly available and attributed.

Did you include any new assets either in the supplemental material or as a URL? . Our implementation and code is accessible through our Github repository.

Did you discuss whether and how consent was obtained from people whose data you’re using/curating? . We use publicly available data sets without any personal identifying information.

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? . We use publicly available data sets without any personal identifying information.


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable? . We did not use crowdsourcing or conduct research with human subjects.

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? . We did not use crowdsourcing or conduct research with human subjects.

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? . We did not use crowdsourcing or conduct research with human subjects.

Appendix A Carla’s Software Interface
In the following, we introduce our opensource benchmarking software CARLA. we describe the architecture in more detail and provide examples of different usecases and their implementation.
a.1 Carla’s High Level Software Architecture
The purpose of this Python library is to provide a simple and standardized framework to allow users to apply different stateoftheart recourse methods to arbitrary data sets and blackboxmodels. It is possible to compare different approaches and save the evaluation results, as described in Section 4.2. For research groups, CARLA provides an implementation interface to integrate new recourse methods in an easytouse way, which allows to compare their method to already existing methods.
A simplified visualization of the CARLA software architecture is depicted in Figure 4. For every component (Data, MLModel, and RecourseMethod) the library provides the possibility to use existing methods from our catalog, or extend the users custom methods and implementations. The components represent an interface to the key parts in the process of generating counterfactual explanations. Data provides a common way to access the data across the software and maintains information about the features. MLModel wraps each blackbox model and stores details on the encoding, scaling and feature order specific to the model. The primary purpose of RecourseMethod is to provide a common interface to easily generate counterfactual examples.
Besides the possibility to use pretrained blackboxmodels and preprocessed data, CARLA provides an easy way to load and define own data sets and model structures independent of their framework (e.g., Pytorch, Tensorflow, sklearn). The following sections will give an overview and provide example implementations of different use cases.
a.2 Carla for Research Groups
One of the most exciting features of CARLA is, that research groups can make use of the RecourseMethodwrapper to implement their own method to generate counterfactual examples. This opens up a way of standardized and consistent comparisons between different recourse methods. Strong and weak points of new algorithms can be stated, benchmarked and analysed in forthcoming publications with the help of CARLA.
In Figure 5, we show how an implementation of a custom recourse method can be structured. After defining the recourse method in the shown way, it can be used with the library to generate counterfactuals for a given data set and benchmark its results against other methods. Research groups have the choice to do this using our provided catalog of data sets, recourse methods and blackbox models (Figure 6) or use their own models and data sets (see Figures 7 and 8).
a.3 Carla as a Recourse Library
A common usage of the package is to generate counterfactual examples. This can be done by loading blackboxmodels and data sets from our provided catalogs, or by userdefined models and datasets via integration with the defined interfaces. Figure 6 shows an implementation example of a simple usecase, applying a recourse method to a predefined data set and model from our catalog. After importing both catalogs, the only necessary step is to describe the data set name (e.g., adult, give me some credit, or compas) and the model type (e.g., ann, or linear) the user wants to load. Every recourse method contains the same properties to generate counterfactual examples.
To give users the possiblity to explore their own blackboxmodel on a custom data set, we implemented in CARLA easytouse interfaces, that are able to wrap every possible model or data set. These interfaces specify particular properties users have to implement, to be able to work with the library. Figure 7 shows an example implementation of the data wrapper, and Figure 8 depicts the same for an arbitrary blackboxmodel. After defining data set and blackbox model classes, users simply need to call the canonical methods and generate counterfactual examples, similar to the process in Figure 6.
a.4 Benchmarking Recourse Methods
Besides the generation of counterfactual examples, the focus of CARLA lies on benchmarking recourse methods. Users are able to compute evaluation measures to make qualitative statements about usability and applicability.
Appendix B Additional Experimental Results
In this Section, we depict the missing experiments from the COMPAS data set in Figure 11 and Table 4. These results underline the trends that we have already highlighted in Section 5.


Appendix C ML Classifiers
In this section, we describe how the black–box models were fitted. CARLA supports different ML libraries to estimate these models (e.g., Pytorch, Tensorflow) as the implementations of the various explanation methods work with a particular ML library. We note that the various explanation methods rely on different binary feature encodings. DICE, for example, requires that binary inputs are supplied as one–hot vectors, while FACE needs binary features encoded in a single column. If this was the case, we fitted two ML models, using the same hyperparameters, and generated CEs with respect to the same set of samples.
To ensure similar behavior between the different ML libraries and encoding variations, each blackbox model type has the same structure (e.g., number of hidden layer, number of neurons), and training parameters (e.g., learning rate, epochs, etc.).
The first model is a multilayer perceptron, consisting of three hidden layers with 18, 9 and 3 neurons, respectively. We use ReLu activation functions and binary cross entropy to calculate class probabilities. Optimization of the loss function is done by RMSProp
tieleman2012lecture using a learning rate of 0.002 for every data set. By performing 25 epochs on COMPAS and 10 epochs on Adult and GMC we reached acceptable performance. Further increasing epochs gave rise to very marginal performance increases. For Adult we use a batch–size of 1024, for COMPAS 25 and for GMC 2048.To allow a more extensive comparison between CE methods, we choose linear models as the second black–box model category for which we evaluate the CE methods. Again, we optimized these models with RMSProp using a binary cross entropy loss. For Adult, we used 100 epochs and a batch–size of 2048, for COMPAS we choose 25 epochs and batch–size of 128, and for GMC we chose 10 epochs with a batch–size of 2048. The learning rate on every data set is set 0.002. Table 5 provides an overview of the model’s classification accuracies.
Adult  COMPAS  Give Me Credit  

Logistic Regression  0.83  0.84  0.92 
Neural Network  0.84  0.85  0.93 
Appendix D COMPAS Data Set Description
The COMPAS data set Angwin2016 contains data for more than 10,000 criminal defendants in Florida. It is used by the jurisdiction to score defendant’s likelihood of reoffending. We kept a small part of the raw data as features like name, id, casenumbers or datetime were dropped. The classification task consists of classifying an instance into high risk of recidivism (score_text is high). By converting the feature race into white and nonwhite, we keep the categorical input binary. Similar to Adult, the immutable features for COMPAS are age, sex and race.
Appendix E Hyperparameter Search for the Counterfactual Explanation and Recourse Methods
We generated counterfactual explanations for instances from , the set of factuals with negative class predictions.
Ar and ArLime
It frequently occurred that the action with the lowest cost did not flip the prediction of the blackbox classifier. To overcome this problem, we let AR compute a flipset of 150 actions per instance, and subsequently search this set for low–cost CEs. For ARLIME, we used LIME (ribeiro2016should) and required sampling around the instance to make sure that the coefficients at were truly local.
Cem
After performing grid search, we set the weight to 0.9 and the weight to 0.1, yielding the strongest performance. For CEMVAE we set the weight to 0.1, and the VAE–weight to 0.9.
Clue
We use the default hyperparameters from antoran2020getting, which are set as a function of the data set dimension . Performing hyperparameter search did not yield results that were improving distances while keeping the same success rate.
Dice
Since DICE is able to compute a set of counterfactuals for a given instance, we only chose to generate one CE per input instance. We use a grid search for the proximity and diversity weights.
Face
To determine the strongest hyperparameters for the graph size we conducted a grid search. We found that values of gave rise to the best balance of success rate and costs. For the epsilon graph, a radius of 0.25 yields the strongest results to balance between high yNN and low cost.
Gs
We chose 0.02 as the step size with which the sphere is grown. Lower values yield similar results at the costs of higher computational time, while higher values gave worse results.
Revise
The grid search to find an acceptable learning rate and similarity weight yielded and for about 1500 iterations.
Wachter
For the target loss, we choose the Binary Cross Entropy with a learning rate of and an initial of . For the distance loss, we use the  norm to measure the similarity between the factual input and the counterfactual point .
Comments
patsm00re18 ∙
I've done research for this at Google
∙ reply