CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms

08/02/2021 ∙ by Martin Pawelczyk, et al. ∙ Universität Tübingen 203

Counterfactual explanations provide means for prescriptive model explanations by suggesting actionable feature changes (e.g., increase income) that allow individuals to achieve favorable outcomes in the future (e.g., insurance approval). Choosing an appropriate method is a crucial aspect for meaningful counterfactual explanations. As documented in recent reviews, there exists a quickly growing literature with available methods. Yet, in the absence of widely available opensource implementations, the decision in favor of certain models is primarily based on what is readily available. Going forward - to guarantee meaningful comparisons across explanation methods - we present CARLA (Counterfactual And Recourse LibrAry), a python library for benchmarking counterfactual explanation methods across both different data sets and different machine learning models. In summary, our work provides the following contributions: (i) an extensive benchmark of 11 popular counterfactual explanation methods, (ii) a benchmarking framework for research on future counterfactual explanation methods, and (iii) a standardized set of integrated evaluation measures and data sets for transparent and extensive comparisons of these methods. We have open-sourced CARLA and our experimental results on Github, making them available as competitive baselines. We welcome contributions from other research groups and practitioners.




I've done research for this at Google


page 17

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning (ML) methods have found their way into numerous everyday applications and have become an indispensable asset in various sensitive domains, like disease diagnostics fatima2017survey, criminal justice berk2012criminal, or credit risk scoring khandani20102767. While ML models bear the great potential to provide effective support in human decision making processes, their predictions may have considerable impact on personal lives, where the final decisions might be disadvantageous for an end user. For example, the rejection of a loan or the denial of parole might have negative effects on the future development of the corresponding person’s life.

When ML systems involve humans in the loop, it is crucial to build a strong foundation for long-term acceptance of these methods. To this end, it is critical (1) to explain the predictions of a model and (2) to offer constructive means for the improvement of those predictions to the advantage of the end–user. Counterfactual explanations – popularized by the seminal work of wachter2017counterfactual – provide means for prescriptive model explanations by suggesting actionable feature changes (e.g., increase income) that allow individuals to achieve favourable outcomes in the future (e.g., insurance approval).

When counterfactual explainability is employed in systems that involve humans in the loop, the community refers to it as recourse

. Algorithmic recourse subsumes precise recipes on how to obtain desirable outcomes after being subjected to an automated decision, emphasizing feasibility constraints that have to be taken into account. Those explanations are found by making the smallest possible change to an input vector to influence the prediction of a pretrained classifier in a positive way; for example, from ‘loan denial’ to ‘loan approval’, subject to the constraint that an individual’s

sex may not change. As documented in recent reviews, there exists a quickly growing literature with available methods (see Figure 1 and (stepin2021survey; karimi2020survey; verma2020counterfactual)), reflecting the insight that the understanding of complex machine learning models is an elementary ingredient for a wide and safe technology adoption.

Figure 1: ArXiv submissions over time on explainability, counterfactual explanations and algorithmic recourse.

In practice, the counterfactual explanation (CE) that an individual receives crucially depends on the method that computes the recourse suggestions. Hence, there is a substantial need for a standardized benchmarking platform, which ensures that methods can be compared in a transparent and meaningful way. Researchers need to be able to easily evaluate their proposed methods against the overwhelming diversity of already available methods and practitioners need to make sure that they are using the right recourse mechanism for the problem at hand. Therefore, a standardized framework for comparison and quality assurance is an essential and indispensable prerequisite.

In this work, we present CARLA (Counterfactual And Recourse LibrAry), a python library with the following merits: First, CARLA provides competitive baselines for researchers to benchmark new counterfactual explanation and recourse methods for the standardized and transparent comparison of CE methods on different integrated data sets. Second, CARLA is a common framework with more than 10 counterfactual explanation methods in combination with the possibility to easily integrate new methods into a commonly accessible and easily distributable Python library. Moreover, the built-in integrated evaluation measures allow users to plug-in their custom black-box predictive models into the available counterfactual explanation methods and conduct extensive evaluations in comparison with other recourse mechanisms across different data sets. The same is true for researchers, who can use CARLA to extensively benchmark available counterfactual methods on popular data sets across various ML models. Third, CARLA supports popular optimization frameworks

such as Tensorflow


and PyTorch


, and provides a generic abstraction layer to support custom implementations. Users can can define problem–specific data set characteristics like immutable features and explicitly specify hyperparameters for the chosen counterfactual explanation method.

The remainder of this work is structured as follows: Section 2 presents related work, Section 3 formally introduces the recourse problem, Section 4 presents the benchmarking process. In Section 5, we describe our main findings, before concluding in Section 6. Appendices A - E describe CARLA’s software architecture and usage instructions, as well as additional experimental results, used ML classifiers, data sets and hyperparameters settings.

2 Related Work

Explainable machine learning is concerned with the problem of providing explanations for complex ML models. Towards this goal, various streams of research follow different explainability paradigms which can be categorized into the following groups guidotti2018survey; gade2019explainable.

2.1 Feature Highlighting Explanations

Local input attribution techniques seek to explain the behaviour of ML models instance by instance. Those methods aim to understand how all inputs available to the model are being used to arrive at a certain prediction. Some popular approaches for model explanations aim at explainability by design lou2012intelligible; alvarez2018towards; broelemann2018gradient; wang2019designing. For white-box models – the internal model parameters are known – gradient-based approaches, e.g. kasneci2016licon; chattopadhay2018grad

(for deep neural networks), and rule-based or probabilistic approaches for tree ensembles, e.g. 

hara2018making; deng2019interpreting have been proposed. In cases where the parameters of the complex models cannot be accessed, model-agnostic approaches can prove useful. This group of approaches seeks to explain a model’s behavior locally by applying surrogate models ribeiro2016should; lundberg2017unified; ribeiro2018anchors; lundberg2020local, which are interpretable by design and are used to explain individual predictions of black-box ML models.

2.2 Counterfactual Explanations

The main purpose of counterfactual explanations is to suggest constructive interventions to the input of a complex model so that the output changes to the advantage of an end user. By emphasizing both the feature importance and the recommendation aspect, counterfactual explanation methods can be further divided into three different groups: independence-based, dependence-based, and causality-based approaches.

In the class of independence-based methods

, where the input features of the predictive model are assumed to be independent, some approaches use combinatorial solvers or evolutionary algorithms to generate recourse in the presence of feasibility constraints

ustun2019actionable; russell2019efficient; rawal2020individualized; karimi2019model; kenny2020generating; dandl2020multi. Notable exceptions from this line of work are proposed by tolomei2017interpretable; laugel2017inverse; lash2017generalized; gupta2019equalizing; ghazimatin2020prince

, who use decision trees, random search, support vector machines (SVM) and information networks that are aligned with the recourse objective. Another line of research deploys gradient-based optimization to find low-cost counterfactual explanations in the presence of feasibility and diversity constraints

Dhurandhar2018; mittelstadt2019explaining; mothilal2020fat; schut2021generating; van2019interpretable; pawelczyk2021connections

. The main problem with these approaches is that they abstract from input correlations. That implies that the intervention costs (i.e., the costs of changing the input to achieve the proposed counterfactual state) are too optimistically estimated. In other words, the estimated costs do not reflect the true costs that an individual would need to incur in practical scenarios, where feature dependencies are usually present: e.g.,

income is dependent on tenure, and if income changes, tenure also changes (see Figure 2 for a schematic comparison).

In the class of causality-based approaches, all methods make use of Pearl’s causal modelling framework (pearl2009causality). As such, they usually require knowledge of the system of causal structural equations (joshi2019towards; goyal2019explaining; karimi2020intervention; oshaughnessy2020generative) or the causal graph (karimi2020probabilistic). The authors of karimi2020intervention show that these models can generate minimum-cost recourse, if the access to the true causal data generating process was available. However, in practical scenarios, the guarantee for such minimum-cost recommendations is vacuous, since, in complex settings, the causal model is likely to be miss-specified (karimi2020probabilistic). Since these methods usually require the true causal graph – which is the limiting factor in practice – we have not considered them at this point, but we plan to do that in the future.

Dependence-based methods bridge the gap between the strong independence assumption and the strong causal assumption. This class of models builds recourse suggestions on generative models pawelczyk2019; downs2020interpretable; joshi2019towards; mahajan2019preserving; pmlr-v124-pawelczyk20a

. The main idea is to change the geometry of the intervention space to a lower dimensional latent space, which encodes different factors of variation while capturing input dependencies. To this end, these methods primarily use variational autoencoders (VAE)

(kingma2013auto; nazabal2018handling). In particular, mahajan2019preserving demonstrate how to encode various feasibility constraints into VAE-based models. Most recently, antoran2020getting proposed CLUE, a generative recourse model that takes a classifier’s uncertainty into account. Work that deviates from this line of research was done by Poyiadzi2020; Kanamori2020. The authors of Poyiadzi2020 provide FACE, which uses a shortest path algorithm on graphs to find counterfactual explanations. In contrast, Kanamori2020 use integer programming techniques to account for input dependencies.

3 Preliminaries

In this Section, we review the algorithmic recourse problem and draw a distinction between two observational (i.e., non–causal) methods.


(a) Indep. recourse intervention


(b) Dep. recourse intervention
Figure 2: Different views on recourse generation. In (fig:world_independent) a change to only impacts through , while in (fig:world_dependent) the same change induces indirect effects on , if is correlated with other inputs.

3.1 Counterfactual Explanations for Independent Inputs

Let be the data set consisting of input data points, . We denote by the fixed classifier for which recourse is to be determined. We denote the set of outcomes by , where indicates the desirable outcome. Moreover, is the predicted class, where denotes the indicator function and is a threshold (e.g., 0.5). Our goal is to find a set of actionable changes in order to improve the outcomes of instances , which are assigned an undesirable prediction under . Moreover, one typically defines a distance measure in inputs space . We discuss typical choices for in Section 4.

Assuming inputs are pairwise statistically independent, the recourse problem is defined as follows:


where is the set of admissible changes made to the factual input . For example, could specify that no changes to sensitive attributes such as age or sex may be made. For example, using the independent input assumption, existing approaches (ustun2019actionable)

use mixed-integer linear programming to find counterfactual explanations. In the next paragraph, we present a problem formulation that relaxes the strong independence assumption by introducing generative models.

3.2 Recourse for Correlated Inputs

We assume the factual input is generated by a generative model such that:

where are latent codes. We denote the counterfactual explanation in an input space by . Thus, we have . Assuming inputs are dependent, we can rewrite the recourse problem in () to faithfully capture those dependencies using the generative model :


where is the set of admissible changes in the -dimensional latent space. For example, would ensure that the counterfactual latent space lies within range of . The problem in () is an abstraction from how the problem is usually solved in practice: most existing approaches first train a type of autoencoder model (e.g., a VAE), and then use the model’s trained decoder as a deterministic function to find counterfactual explanations (joshi2019towards; pawelczyk2019; mahajan2019preserving; downs2020interpretable; antoran2020getting). Our benchmarked explanation models roughly fit in one of these two categories.

(a) Adult Data
(b) Give Me Some Credit Data
Figure 3: Evaluating the distribution of costs of counterfactual explanations on 2 different data sets (the results on COMPAS are relegated to Appendix B). For all instances with a negative prediction (), we plot the distribution of and costs of algorithmic recourse as defined in (1

) for a logistic regression and an artificial neural network classifier. The white dots indicate the medians (lower is better), and the black boxes indicate the interquartile ranges. We distinguish between independence based and dependence based methods. The results are discussed in Section


4 Benchmarking Process

In this Section, we provide a brief explanation model overview and introduce a variety of explanation measures used to evaluate the quality of the generated counterfactual explanations. In Table 1 we present a concise explanation model overview.

Approach Method Model Type Algorithm Immutable Categorical Other
Independent (I) AR Linear Integer Prog. Yes Binary Direction of change
AR--LIME Agnostic Integer Prog. Yes Binary Direction of change
CEM Gradient based Gradient based No No None
DICE Gradient Based Gradient based Yes Binary Generative model
GS Agnostic Random search Yes Binary None
Wachter Gradient based Gradient based No Binary None
Dependent (D) CEM-VAE Gradient based Gradient based No No Gen. Model regularizer
CLUE Gradient based Gradient based No No Generative model
FACE--EPS Agnostic Graph search Binary Binary CE is from data set


Agnositc Graph search Binary Binary CE is from data set
REVISE Gradient based Gradient based Binary Binary Generative model
Table 1: Explanation method summary: we categorize different approaches based on their underlying assumptions and list what kind of ML model they work with (Model Type), the Method’s underlying algorithm (Algorithm), whether the method can handle immutable features (Immutable), whether it can handle categorical features (Categorical) and any other outstanding characteristics (Other).

4.1 Counterfactual Explanation Methods


Ustun2019ActionableRI provide a method to generate minimal cost actions for linear classification models such as logistic regression models. AR requires the linear model’s coefficients, and uses these coefficients for its search for counterfactual explanations. To provide reasonable actions it is possible to restrict to user–specified constraints (e.g., has_phd can only change from False to True) or to set a subset of inputs as immutable (e.g., age). The problem to find these changes is a discrete optimization problem. Given a set of actions, AR finds the action which minimizes a defined cost function, using integer programming solvers like CPLEX or CBC.


Most classification tasks do not have linearly separable classes and complex non–linear models usually provide more accurate predictions. Non–linear models are not per se interpretable and usually do not provide coefficients similar to linear models. We use a reduction to apply AR to non–linear models by computing a local linear approximation for the point of interest , using LIME ribeiro2016should. For an arbitrary black–box model , LIME estimates post–hoc local explanations in form of a set of linear coefficients per instance. Using the coefficients we apply AR.


Dhurandhar2018 use an elastic–net regularization inspired objective to find low-cost counterfactual instances. Different weights can be assigned to and norms, respectively. There exists no immutable feature handling. However, we provide support for their VAE type regularizer, which should help ensure that counterfactual instances look more realistic.


antoran2020getting propose CLUE, a generative recourse model that takes a classifier’s uncertainty into account. This model suggests feasible counterfactual explanations that are likely to occur under the data distribution. The authors use a variational autoencoder (VAE) to estimate the generative model. Using the VAE’s decoder, CLUE uses an objective that guides the search of CEs towards instances that have low uncertainty measured in terms of the classifier’s entropy.


Mothilal2020ExplainingML suggest DICE, which is an explanation model that seeks to generate minimum costs counterfactual explanations according to () subject to a diversity constraint which aims to promote a diverse set of counterfactual explanations. Diversity is achieved by using the whole range of suggested changes, while still keeping proximity to a given input. Regarding the optimization problem, DICE uses gradient descent to find a solution that trades-off proximity and diversity. Domain knowledge – in form of feature ranges or immutability constraints – can be added.


The authors of Poyiadzi2020 provide FACE, which uses a shortest path algorithm (for graphs) to find counterfactual explanations from high–density regions. Those explanations are actual data points from either the training or test set. Immutability constraints are enforced by removing incorrect neighbors from the graph. We implemented two variants of this model: the first variant uses an epsilon–graph (FACE--EPS), whereas the second variant uses a knn–graph (FACE--KNN).

Growing Spheres (GS)

Growing Spheres – suggested in (laugel2017inverse) – is a random search algorithm, which generates samples around the factual input point until a point with a corresponding counterfactual class label was found. The random samples are generated around using growing hyperspheres. For binary input dimensions, the method makes use of Bernoulli sampling. Immutable features are readily specified by excluding them from the search procedure.


joshi2019towards propose a generative recourse model. This model suggests feasible counterfactual explanations that are likely to occur under the data distribution. The authors use a variational autoencoder (VAE) to estimate the generative model. Using the VAE’s decoder, REVISE uses the latent space to search for CEs. No handling of immutable features exists.

Wachter et al. (Wachter)

The optimization approach suggested by Wachter2017CounterfactualEW generates counterfactual explanations by minimizing an objective function using gradient descent to find counterfactuals which are as close as possible to . Closeness is measured in -norm.

Artificial Neural Network Logistic Regression
Data Set Method yNN redund. violation success yNN redund. violation success
Adult AR(--LIME) 0.62 0.00 0.14 0.28 1.59 0.72 0.67 0.13 0.52 10.49
CEM 0.26 3.96 0.66 1.00 1.10 0.20 3.98 0.66 1.00 0.92
DICE 0.71 0.53 0.17 1.00 0.13 0.58 0.51 0.23 1.00 0.13
GS 0.30 3.77 0.09 1.00 0.01 0.30 3.94 0.10 1.00 0.01
Wachter 0.23 4.45 0.83 0.50 15.72 0.16 1.67 0.94 1.00 0.03
GMC AR(--LIME) 0.89 0.00 0.29 0.07 0.55 1.00 2.33 0.14 0.39 3.42
CEM 0.95 5.46 0.65 1.00 0.97 0.74 5.07 0.67 1.00 0.87
DICE 0.90 0.58 0.27 1.00 0.28 0.88 0.61 0.27 1.00 0.29
GS 0.40 6.64 0.17 1.00 0.01 0.49 5.29 0.17 1.00 0.01
Wachter 0.58 6.56 0.71 1.00 0.02 0.59 6.12 0.83 1.00 0.01
(a) Independence based methods
Artificial Neural Network Logistic Regression
Data Set Method yNN redund. violation success yNN redund. violation success
Adult CEM--VAE 0.12 9.68 1.82 1.00 0.93 0.43 10.05 1.80 1.00 0.81
CLUE 0.82 8.05 1.28 1.00 2.70 0.33 7.30 1.33 1.00 2.56
FACE--EPS 0.65 5.19 1.45 0.99 4.36 0.64 5.11 1.44 0.94 4.35
FACE--KNN 0.60 5.11 1.41 1.00 4.31 0.57 4.97 1.38 1.00 4.31
REVISE 0.20 8.65 1.33 1.00 8.33 0.62 7.92 1.23 1.00 7.52
GMC CEM--VAE 1.00 8.40 0.66 1.00 0.87 1.00 8.54 0.36 1.00 0.88
CLUE 1.00 9.39 0.90 0.93 1.91 1.00 9.56 0.96 1.00 1.76
FACE--EPS 0.99 8.06 0.99 1.00 19.44 0.98 7.98 0.96 1.00 19.50
FACE--KNN 0.98 9.00 0.98 1.00 15.87 0.98 7.88 0.95 1.00 16.09
REVISE 1.00 9.50 0.97 1.00 4.56 1.00 9.59 0.96 1.00 3.76
(b) Dependence based methods
Table 2: Summary of a subset of results for independence and dependence based methods. For all instances with a negative prediction (), we compute counterfactual explanations for which we then measure yNN (higher is better), redundancy (lower is better), violation (lower is better), success rate (higher is better) and time (lower is better). We distinguish between a logistic regression and an artificial neural network classifier. Detailed descriptions of these measures can be found in Section 4. The results are discussed in Section 5.

4.2 Evaluation Measures for Counterfactual Explanation Methods

As algorithmic recourse is a multi–modal problem we introduce a variety of measures to evaluate the methods’ performances. We use six baseline evaluation measures. Besides distance measures it is important to consider measures that emphasize the quality of recourse.


When answering the question of generating the nearest counterfactual explanation, it is essential to define the distance of the factual to the nearest counterfactual . The literature has formed a consensus to use either the normalized or norm or any convex combination thereof (see for example rawal2020individualized; mothilal2020fat; pmlr-v124-pawelczyk20a; karimi2019model; ustun2019actionable; wachter2017counterfactual). The norm puts a restriction on the number of feature changes between factual and counterfactual instance, while the norm restricts the average change:


Constraint violation

This measure counts the number of times the CE method violates user-defined constraints. Depending on the data set, we fixed a list of features which should not be changed by the used method (e.g., sex, age or race).


We use a measure that evaluates how much data support CEs have from positively classified instances. Ideally, CEs should be close to positively classified individuals which is a desideratum formulated by laugel2019dangers. We define the set of individuals who received an undesirable prediction under as . The counterfactual instances (instances for which the label was successfully changed) corresponding to the set are denoted by . We use a measure that captures how differently neighborhood points around a counterfactual instance are classified:


where kNN denotes the -nearest neighbours of , and

is the binarized classifier. Values of yNN close to 1 imply that the neighbourhoods around the counterfactual explanations consists of points with the same predicted label, indicating that the neighborhoods around these points have already been reached by positively classified instances. We use a value of

, which ensures sufficient data support from the positive class.


We evaluate how many of the proposed feature changes were not necessary. This is a particularly important criterion for independence–based methods. We measure this by successively flipping one value of after another back to , and then we inspect whether the label flipped from back to : e.g., we check whether flipping the value for the second dimension would change the counterfactual outcome back to the predicted factual outcome of : . If the predicted outcome does not change, we increase the redundancy counter, concluding that a sparser counterfactual explanation could have been found. We iterate this process over all dimensions of the input vector.111We do not consider all possible subsets of changes. A low number indicates few redundancies across counterfactual instances.

Success Rate

Some generated counterfactual explanations do not alter the predicted label of the instance as anticipated. To keep track how often the generated CE does hold its promise, the success rate shows the fraction of respective models’ correctly determined counterfactuals.

Average Time

By measuring the average time a CE method needs to generate its result, we evaluate the effectiveness and feasibility for real–time prediction settings.

5 Experimental Evaluation

Using CARLA we conduct extensive empirical evaluations to benchmark the presented counterfactual explanations methods using three real-world data sets. Our main findings are displayed in Figure 3, and Table 2. We split the benchmarking evaluation by CE method category. In the following Sections, we provide an overview over the used data sets (see Table 3) and the classification models. Detailed information on hyperparameter search for the CE methods is provided in Appendix E.

Data sets

The Adult data set Dua:2019 originates from the 1994 Census database, consisting of 14 attributes and 48,842 instances. The classification consists of deciding whether an individual has an income greater than 50,000 USD/year. Since several CE methods cannot handle non-binary categorical data, we binarized these features by partitioning them into the most frequent value, and its counterpart (e.g., US and Non-US, Husband and Non-Husband). The features age, sex and race are set as immutable. The Give Me Some Credit (GMC) data set Kaggle2011 from a 2011 Kaggle Competition is a credit scoring data set, consisting of 150,000 observations and 11 features. The classification task consists of deciding whether an instance will experience financial distress within the next two years (SeriousDlqin2yrs is 1) or not. We dropped missing data, and set age as immutable.

Data Set Task Positive Class Size Features Immutable Features
Adult Predict Income High Income (24%) (45,222 | 20) Work, Education, Income Sex, Age, Race
COMPAS Predict Recidivism No Recid. (65%) (10,000 | 8) Crim. History, Jail & Prison Time Sex, Race
GMC Predict Financial Distress No deficiency (93%) (150,000 | 11) Pay. History, Balance, Loans Age
Table 3: Summarizing the used data sets, where and are the number of samples and input dimension after processing of the data. Results on the COMPAS data set are relegated to Appendix B.

Black-box models

We briefly describe how the black–box classifiers were trained. CARLA

supports different ML libraries (e.g., Pytorch, Tensorflow) to estimate these classifiers as the implementations of the various explanation methods work particular ML libraries only. The first model is a multi-layer perceptron, consisting of three hidden layers with 18, 9 and 3 neurons, respectively. To allow a more extensive comparison (

AR only works on linear models) between CE methods, we chose logistic regression models as the second classification model for which we evaluate the CE methods. Detailed information on the classifiers’ training for each data set is provided in Appendix C.


For independence based methods, we find that no one single CE method outperformed all its competitors. This is not too surprising since algorithmic recourse is a multi–modal problem. Instead, we found that some methods dominated certain measures across all data sets. AR, AR--LIME, DICE performed strongest with respect to (see the top left panels in Figures 2(a) and 2(b)). AR--LIME does so despite our use of the LIME reduction. Therefore, it makes sense that AR, AR-LIME and DICE offer the lowest redundancy scores (Table 1(a)). CEM performed strongest with respect to the overall cost measure across data sets. GS is the clear winner when it comes to the measurement of time (Table 1(a)). Since the algorithm behind GS is based on a rather rudimentary sampling strategy, we expect that savvier sampling strategies should boost its cost performance significantly.

For dependence based methods, the results are mixed as well. While CLUE and REVISE are the winner with respect to the cost of recourse (), the margins between these generative recourse models and the graph-based ones (FACE) are small (Figure 3). The FACE-EPS method performs strongest with respect to the measure (usually well above 0.60) (Table 1(b)

) indicating that the generated CEs have sufficient data support from positively classified individuals relatively to the remaining dependence–based methods. As expected, the ynn measures are on average higher for the dependence based methods. This suggest that dependence based CEs are less often outliers. Notably,

CLUE and REVISE perform best with respect to (with REVISE being the clear winner on 3 out of 4 cases), while they perform worst on – likely due to the decoder’s imprecise reconstruction. In this respect, it is not surprising that these methods have average redundancy values that are up to twice as high as those by FACE. Finally, the generative model approaches (CEM-VAE, CLUE, REVISE) performed best with respect to time since the autoencoder training time amortizes with more samples.

6 Conclusion and Broader Impact of Carla

The current implementations of recourse methods, mentioned in Section 4.1 are based on the original implementation of the respective research groups. Researchers mostly implement their experiments and models for specific ML frameworks and data sets. For example, some explanation methods are restricted to Tensorflow and are not applicable to Pytorch models. In the future, we will extend CARLA to decouple each recourse method from the frameworks and data contraints.

When trying to combine different CE methods into a common benchmarking framework we encountered the following issues: First, a great number of repositories only contain remarks about installation and script calls to recreate the results from the corresponding research papers. Second, missing information about interfaces for data sets or black–box models further complicated the process of integrating different CE methods into the benchmarking workflow. In order to add more CE methods and data sets to CARLA, we are currently in contact with several authors in this exciting and rapidly growing field. With a growing open-source community, CARLA can evolve to be the main library for generating counterfactual explanations and benchmarks for recourse methods. Therefore we are continuously expanding the catalog of explanation methods and data sets, and welcome researchers to add their own recourse methods to the library. To facilitate this process, we provide a step-by-step user-guide to integrate new CE methods into CARLA, which we present in Appendix A.

The rapidly growing number of available CE methods calls for standardized and efficient ways to assure the quality of a new technique in comparison with other approaches on different data sets. Quality assurance is a key aspect of actionable recourse, since complex models and CE mechanisms can have a considerable impact on personal lives. In this work, we presented CARLA, a versatile benchmarking platform for the standardized and transparent comparison of CE methods on different integrated data sets. In the explainability field, CARLA bears the potential to help researchers and practitioners alike to efficiently derive more realistic and use–case–driven recourse strategies and assure their quality through extensive comparative evaluations. We hope that this work contributes to further advances in explainability research.



The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default to , , or . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

  • Did you include the license to the code and datasets? See Section LABEL:gen_inst.

  • Did you include the license to the code and datasets? The code and the data are proprietary.

  • Did you include the license to the code and datasets?

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? . As we state in the abstract, our goal is to provide a Python framework for benchmarking counterfactual explanation methods. Users can easily evaluate our results by accessing our Github repository, where we host our Python framework and our benchmarking results.

    2. Did you describe the limitations of your work? . In Section 6, we discuss the current limitations of our approach. The counterfactual explanation methods are based on the original implementation of the respective research groups. Researchers mostly implement their experiments and models for specific ML frameworks and data sets. For example, some explanation methods are restricted to Tensorflow and are not applicable to Pytorch models.

    3. Did you discuss any potential negative societal impacts of your work? . We discuss the broader impact of our benchmarking library in Section 6; we mainly see positive impacts on the literature of algorithmic recourse.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them? . We have read the ethics review guidelines and attest that our paper conforms to the guidelines.

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results? . We did not provide theoretical results.

    2. Did you include complete proofs of all theoretical results? . We did not provide theoretical results.

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? . Details of implementations, data sets and instructions can be found here: Appendices A, C, E, and our Github repository.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? . Please see Appendices E and C.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? . Error bars have been reported for our cost comparisons in terms of the 25th and 75ht percentiles of the cost distribution, see for example Figure 3.

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? . All models are evaluated on an i7-8550U CPU with 16 Gb RAM, running on Windows 10.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators? . The data sets, which are publicly available are appropriately cited in Section 5. We cite and link to any additional code used, for example antoran2020getting.

    2. Did you mention the license of the assets? . All assets are publicly available and attributed.

    3. Did you include any new assets either in the supplemental material or as a URL? . Our implementation and code is accessible through our Github repository.

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating? . We use publicly available data sets without any personal identifying information.

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? . We use publicly available data sets without any personal identifying information.

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable? . We did not use crowdsourcing or conduct research with human subjects.

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? . We did not use crowdsourcing or conduct research with human subjects.

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? . We did not use crowdsourcing or conduct research with human subjects.

Appendix A Carla’s Software Interface

In the following, we introduce our open-source benchmarking software CARLA. we describe the architecture in more detail and provide examples of different use-cases and their implementation.

a.1 Carla’s High Level Software Architecture

The purpose of this Python library is to provide a simple and standardized framework to allow users to apply different state-of-the-art recourse methods to arbitrary data sets and black-box-models. It is possible to compare different approaches and save the evaluation results, as described in Section 4.2. For research groups, CARLA provides an implementation interface to integrate new recourse methods in an easy-to-use way, which allows to compare their method to already existing methods.

Figure 4: Architecture of the CARLA python library. The silver boxes show the individual objects that will be created to generate counterfactual explanations and evaluate recourse methods. Useful explanations to specific processes are illustrated as yellow notes. The dashed arrows are showing the different implementation possibilities, either use pre-defined catalog objects or provide custom implementation. All dependencies between these objects are visualised by solid arrows with an additional description.

A simplified visualization of the CARLA software architecture is depicted in Figure 4. For every component (Data, MLModel, and RecourseMethod) the library provides the possibility to use existing methods from our catalog, or extend the users custom methods and implementations. The components represent an interface to the key parts in the process of generating counterfactual explanations. Data provides a common way to access the data across the software and maintains information about the features. MLModel wraps each black-box model and stores details on the encoding, scaling and feature order specific to the model. The primary purpose of RecourseMethod is to provide a common interface to easily generate counterfactual examples.

Besides the possibility to use pretrained black-box-models and preprocessed data, CARLA provides an easy way to load and define own data sets and model structures independent of their framework (e.g., Pytorch, Tensorflow, sklearn). The following sections will give an overview and provide example implementations of different use cases.

a.2 Carla for Research Groups

One of the most exciting features of CARLA is, that research groups can make use of the RecourseMethod-wrapper to implement their own method to generate counterfactual examples. This opens up a way of standardized and consistent comparisons between different recourse methods. Strong and weak points of new algorithms can be stated, benchmarked and analysed in forthcoming publications with the help of CARLA.

In Figure 5, we show how an implementation of a custom recourse method can be structured. After defining the recourse method in the shown way, it can be used with the library to generate counterfactuals for a given data set and benchmark its results against other methods. Research groups have the choice to do this using our provided catalog of data sets, recourse methods and black-box models (Figure 6) or use their own models and data sets (see Figures 7 and 8).

2    from carla import RecourseMethod
4    # Custom recourse implementations need to
5    # inherit from the RecourseMethod interface
6    class MyRecourseMethod(RecourseMethod):
7        def __init__(self, mlmodel):
8            super().__init__(mlmodel)
10        # Generate and return encoded and
11        # scaled counterfactual examples
12        def get_counterfactuals(self, factuals: pd.DataFrame):
13        [...]
14        return counterfactual_examples
Figure 5: Pseudo-implementation of the CARLA recourse method wrapper

a.3 Carla as a Recourse Library

A common usage of the package is to generate counterfactual examples. This can be done by loading black-box-models and data sets from our provided catalogs, or by user-defined models and datasets via integration with the defined interfaces. Figure 6 shows an implementation example of a simple use-case, applying a recourse method to a pre-defined data set and model from our catalog. After importing both catalogs, the only necessary step is to describe the data set name (e.g., adult, give me some credit, or compas) and the model type (e.g., ann, or linear) the user wants to load. Every recourse method contains the same properties to generate counterfactual examples.

2    from carla import DataCatalog, MLModelCatalog
3    from carla.recourse_methods import GrowingSpheres
5    # 1. Load data set from the DataCatalog
6    data_name = "adult"
7    dataset = DataCatalog(data_name)
9    # 2. Load pre-trained black-box model from the MLModelCatalog
10    model = MLModelCatalog(dataset, "ann")
12    # 3. Load recourse model with model specific hyperparameters
13    gs = GrowingSpheres(model)
15    # 4. Generate counterfactual examples
16    factuals = dataset.raw.sample(10)
17    counterfactuals = gs.get_counterfactuals(factuals)
Figure 6: Example implementation of CARLA, using the data and model catalog.

To give users the possiblity to explore their own black-box-model on a custom data set, we implemented in CARLA easy-to-use interfaces, that are able to wrap every possible model or data set. These interfaces specify particular properties users have to implement, to be able to work with the library. Figure 7 shows an example implementation of the data wrapper, and Figure 8 depicts the same for an arbitrary black-box-model. After defining data set and black-box model classes, users simply need to call the canonical methods and generate counterfactual examples, similar to the process in Figure 6.

2    from carla import Data
3    from carla.recourse_methods import GrowingSpheres
5    # Custom data set implementations need to inherit from the Data interface
6    class MyOwnDataSet(Data):
7        def __init__(self):
8            # The data set can e.g. be loaded in the constructor
9            self._dataset = load_dataset_from_disk()
11        # List of all categorical features
12        def categoricals(self):
13            return [...]
15        # List of all continous features
16        def continous(self):
17            return [...]
19        # List of all immutable features which
20        # should not be changed by the recourse method
21        def immutables(self):
22            return [...]
24        # Feature name of the target column
25        def target(self):
26            return "label"
28        # Non-encoded and  non-normalized, raw data set
29        def raw(self):
30            return self._dataset
Figure 7: Pseudo-implementation of the CARLA data wrapper
2    from carla import MLModel
4    # Custom black-box models need to inherit from
5    # the MLModel interface
6    class MyOwnModel(MLModel):
7        def __init__(self, data):
8            super().__init__(data)
9            # The constructor can be used to load or build an
10            # arbitrary black-box-model
11            self._mymodel = load_model()
13            # Define a fitted sklearn scaler to normalize input data
14            self.scaler = MySklearnScaler().fit()
16            # Define a fitted sklearn encoder for binary input data
17            self.encoder =
19        # List of the feature order the ml model was trained on
20        def feature_input_order(self):
21            return [...]
23        # The ML framework the model was trained on
24        def backend(self):
25            return "pytorch"
27        # The black-box model object
28        def raw_model(self):
29            return self._mymodel
31        # The predict function outputs
32        # the continous prediction of the model
33        def predict(self, x):
34            return self._mymodel.predict(x)
36        # The predict_proba method outputs
37        # the prediction as class probabilities
38        def predict_proba(self, x):
39            return self._mymodel.predict_proba(x)
Figure 8: Pseudo-implementation of the CARLA black-box-model wrapper

a.4 Benchmarking Recourse Methods

Besides the generation of counterfactual examples, the focus of CARLA lies on benchmarking recourse methods. Users are able to compute evaluation measures to make qualitative statements about usability and applicability.

All measurements, which are described in Section 4.2, are implemented in the Benchmarking class of CARLA and can be used for every wrapped recourse method. Figure 9 shows an example implementation of a benchmarking process based on the variables of Figure 6.

2    from carla import Benchmark
4    # 1. Initilize the benchmarking class by passing
5    # black-box-model, recourse method, and factuals into it
6    benchmark = Benchmark(model, gs, factuals)
8    # 2. Either only compute the distance measures
9    distances = benchmark.compute_distances()
11    # 3. Or run all implemented measurements and create a
12    # DataFrame which consists of all results
13    results = benchmark.run_benchmark()
Figure 9: Pseudo-implementation of the CARLA recourse method wrapper

Appendix B Additional Experimental Results

In this Section, we depict the missing experiments from the COMPAS data set in Figure 11 and Table 4. These results underline the trends that we have already highlighted in Section 5.

Figure 10: COMPAS Data
Figure 11: Evaluating the distribution of costs of counterfactual explanations on the COMPAS dataset. For all instances with a negative prediction (), we plot the distribution of and costs of algorithmic recourse as defined in (1) for a logistic regression and an artificial neural network classifier. The white dots indicate the medians (lower is better), and the black boxes indicate the interquartile ranges. We distinguish between independence based and dependence based methods.
Artificial Neural Network Logistic Regression
Data Set Method yNN redund. violation success yNN redund. violation success
COMPAS AR(--LIME) 0.91 0.00 0.02 0.53 0.06 0.00 0.01
CEM 0.98 2.29 0.43 1.00 0.89 0.93 1.88 0.99 1.00 0.86
DICE 0.89 0.88 1.03 1.00 0.09 0.95 0.94 0.90 1.00 0.09
GS 0.44 0.97 0.03 1.00 0.01 0.60 0.64 0.02 1.00 0.01
Wachter 0.56 1.77 0.74 0.66 10.90 0.50 1.21 0.79 1.00 0.02
(a) Independence based methods
Artificial Neural Network Logistic Regression
Data Set Method yNN redund. violation success yNN redund. violation success
COMPAS CEM--VAE 1.00 5.59 1.98 1.00 0.89 1.00 6.91 2.14 1.00 0.87
CLUE 0.99 4.06 1.08 1.00 2.03 1.00 4.62 1.25 1.00 1.88
FACE--EPS 0.94 3.71 1.55 0.99 0.45 0.97 3.94 1.62 0.99 0.45
FACE--KNN 0.94 3.83 1.63 1.00 0.44 0.97 3.86 1.57 1.00 0.44
REVISE 1.00 3.29 1.29 1.00 6.06 0.92 3.15 1.03 1.00 5.35
(b) Dependence based methods
Table 4: Summary of COMPAS results for independence and dependence based methods. For all instances with a negative prediction (), we compute counterfactual explanations for which we then measure yNN (higher is better), redundancy (lower is better), violation (lower is better), success rate (higher is better) and time (lower is better). We distinguish between a logistic regression and an artificial neural network classifier. Detailed descriptions of these measures can be found in Section 4. The results are discussed in Appendix B.

Appendix C ML Classifiers

In this section, we describe how the black–box models were fitted. CARLA supports different ML libraries to estimate these models (e.g., Pytorch, Tensorflow) as the implementations of the various explanation methods work with a particular ML library. We note that the various explanation methods rely on different binary feature encodings. DICE, for example, requires that binary inputs are supplied as one–hot vectors, while FACE needs binary features encoded in a single column. If this was the case, we fitted two ML models, using the same hyperparameters, and generated CEs with respect to the same set of samples.

To ensure similar behavior between the different ML libraries and encoding variations, each black-box model type has the same structure (e.g., number of hidden layer, number of neurons), and training parameters (e.g., learning rate, epochs, etc.).

The first model is a multi-layer perceptron, consisting of three hidden layers with 18, 9 and 3 neurons, respectively. We use ReLu activation functions and binary cross entropy to calculate class probabilities. Optimization of the loss function is done by RMSProp

tieleman2012lecture using a learning rate of 0.002 for every data set. By performing 25 epochs on COMPAS and 10 epochs on Adult and GMC we reached acceptable performance. Further increasing epochs gave rise to very marginal performance increases. For Adult we use a batch–size of 1024, for COMPAS 25 and for GMC 2048.

To allow a more extensive comparison between CE methods, we choose linear models as the second black–box model category for which we evaluate the CE methods. Again, we optimized these models with RMSProp using a binary cross entropy loss. For Adult, we used 100 epochs and a batch–size of 2048, for COMPAS we choose 25 epochs and batch–size of 128, and for GMC we chose 10 epochs with a batch–size of 2048. The learning rate on every data set is set 0.002. Table 5 provides an overview of the model’s classification accuracies.

Adult COMPAS Give Me Credit
Logistic Regression 0.83 0.84 0.92
Neural Network 0.84 0.85 0.93
Table 5: Performance of classification models used for generating algortihmic recourse.

Appendix D COMPAS Data Set Description

The COMPAS data set Angwin2016 contains data for more than 10,000 criminal defendants in Florida. It is used by the jurisdiction to score defendant’s likelihood of reoffending. We kept a small part of the raw data as features like name, id, casenumbers or date-time were dropped. The classification task consists of classifying an instance into high risk of recidivism (score_text is high). By converting the feature race into white and non-white, we keep the categorical input binary. Similar to Adult, the immutable features for COMPAS are age, sex and race.

Appendix E Hyperparameter Search for the Counterfactual Explanation and Recourse Methods

We generated counterfactual explanations for instances from , the set of factuals with negative class predictions.

Ar and Ar--Lime

It frequently occurred that the action with the lowest cost did not flip the prediction of the black-box classifier. To overcome this problem, we let AR compute a flipset of 150 actions per instance, and subsequently search this set for low–cost CEs. For AR--LIME, we used LIME (ribeiro2016should) and required sampling around the instance to make sure that the coefficients at were truly local.


After performing grid search, we set the weight to 0.9 and the weight to 0.1, yielding the strongest performance. For CEM-VAE we set the weight to 0.1, and the VAE–weight to 0.9.


We use the default hyperparameters from antoran2020getting, which are set as a function of the data set dimension . Performing hyperparameter search did not yield results that were improving distances while keeping the same success rate.


Since DICE is able to compute a set of counterfactuals for a given instance, we only chose to generate one CE per input instance. We use a grid search for the proximity and diversity weights.


To determine the strongest hyperparameters for the graph size we conducted a grid search. We found that values of gave rise to the best balance of success rate and costs. For the epsilon graph, a radius of 0.25 yields the strongest results to balance between high yNN and low cost.


We chose 0.02 as the step size with which the sphere is grown. Lower values yield similar results at the costs of higher computational time, while higher values gave worse results.


The grid search to find an acceptable learning rate and similarity weight yielded and for about 1500 iterations.


For the target loss, we choose the Binary Cross Entropy with a learning rate of and an initial of . For the distance loss, we use the - norm to measure the similarity between the factual input and the counterfactual point .