Multi-Objective Counterfactual Explanations

by   Susanne Dandl, et al.
Universität München

Counterfactual explanations are one of the most popular methods to make predictions of black box machine learning models interpretable by providing explanations in the form of `what-if scenarios'. Current approaches can compute counterfactuals only for certain model classes or feature types, or they generate counterfactuals that are not consistent with the observed data distribution. To overcome these limitations, we propose the Multi-Objective Counterfactuals (MOC) method, which translates the counterfactual search into a multi-objective optimization problem and solves it with a genetic algorithm based on NSGA-II. It returns a diverse set of counterfactuals with different trade-offs between the proposed objectives, enabling either a more detailed post-hoc analysis to facilitate better understanding or more options for actionable user responses to change the predicted outcome. We show the usefulness of MOC in concrete cases and compare our approach with state-of-the-art methods for counterfactual explanations.



There are no comments yet.


page 1

page 2

page 3

page 4


Optimal Counterfactual Explanations for Scorecard modelling

Counterfactual explanations is one of the post-hoc methods used to provi...

Consequence-aware Sequential Counterfactual Generation

Counterfactuals have become a popular technique nowadays for interacting...

Multi-objective Explanations of GNN Predictions

Graph Neural Network (GNN) has achieved state-of-the-art performance in ...

Prototype-based Counterfactual Explanation for Causal Classification

Counterfactual explanation is one branch of interpretable machine learni...

Multi-Domain Transformer-Based Counterfactual Augmentation for Earnings Call Analysis

Earnings call (EC), as a periodic teleconference of a publicly-traded co...

Is Disentanglement all you need? Comparing Concept-based Disentanglement Approaches

Concept-based explanations have emerged as a popular way of extracting h...

CARE: Coherent Actionable Recourse based on Sound Counterfactual Explanations

Counterfactual explanation methods interpret the outputs of a machine le...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Interpretable machine learning methods have become very important in recent years to explain the behavior of black box machine learning (ML) models. A useful method for explaining single predictions of a model are counterfactual explanations. ML credit risk prediction is a common motivation for counterfactuals. For people whose credit applications have been rejected, it is valuable to know why they have not been accepted, either to understand the decision making process or to assess their actionable options to change the outcome. Counterfactuals provide these explanations in the form of “if these features had different values, your credit application would have been accepted”. For such explanations to be plausible, they should only suggest small changes in few features. Therefore, counterfactuals can be defined as close neighbors of an actual data point, but their predictions have to be sufficiently close to a (usually quite different) desired outcome. Counterfactuals explain why a certain outcome was not reached, can offer potential reasons to object against an unfair outcome and give guidance on how the desired prediction could be reached in the future [wachter17]. Note that counterfactuals are also valuable for predictive modelers on a more technical level to investigate the pointwise robustness and the pointwise bias of their model.

2 Related Work

Counterfactuals are closely related to adversarial perturbations. These have the aim to deceive ML models instead of making the models interpretable [su17]. Attribution methods such as Local Interpretable Model-agnostic Explanations (LIME) [ribeiro16] and Shapley Values [lundberg17] explain a prediction by determining how much each feature contributed to it. Counterfactual explanations differ from feature attributions since they generate data points with a different, desired prediction instead of attributing a prediction to the features.

Counterfactual methods can be model-agnostic or model-specific. The latter usually exploit the internal structure of the underlying ML model, such as the trained weights of a neural network, while the former are based on general principles which work for arbitrary ML models - often by only assuming access to the prediction function of an already fitted model. Several model-agnostic counterfactual methods have been proposed

[laugel17, grath18, dhurandhar19, karimi19, poyiadzi19, sharma19, white19]. Apart from Grath et al. [grath18], these approaches are limited to classification. Unlike the other methods, the method of Poyiadzi et al. [poyiadzi19] can obtain plausible counterfactuals by constructing feasible paths between data points with opposite predictions. Their approach only works for numerical features.

A model-specific approach was proposed by Wachter et al. [wachter17], who also introduced and formalized the concept of counterfactuals in predictive modeling. Like many model-specific methods [joshi19, looveren19, mothilal19, russell19, ustun19] their approach is limited to differentiable models. The approach of Tolomei et al. [tolomei17]

generates explanations for tree-based ensemble binary classifiers. As with

[wachter17] and [looveren19], it only returns a single counterfactual per run.

3 Contributions

In this paper, we introduce Multi-Objective Counterfactuals (MOC) which, to the best of our knowledge, is the first method to formalize the counterfactual search as a multi-objective optimization problem. We argue that the mathematical problem behind the search for counterfactuals should be naturally addressed as multi-objective. Most of the above methods optimize a collapsed, weighted sum of multiple objectives to find counterfactuals, which are naturally difficult to balance a-priori. They carry the risk of arbitrarily reducing the solution set to a single candidate without the option to discuss inherent trade-offs – which should be especially relevant for model interpretation that is by design very hard to precisely capture in (a single) mathematical formulation.

Compared to Wachter et al., we use a distance metric for mixed feature spaces and two additional objectives: one that measures the number of feature changes to obtain sparse and therefore more interpretable counterfactuals, and one that measures the closeness to the nearest observed data points for more plausible counterfactuals. MOC returns a Pareto set of counterfactuals that represent different trade-offs between our proposed objectives, and which are constructed to be diverse in feature space. This seems preferable because changes to different features can lead to a desired counterfactual prediction111Rashomon effect [breiman01] and it is more likely that some counterfactuals meet the (hidden) preferences of a user. A single counterfactual might even suggest a strategy that is interpretable but not actionable (e.g., ‘reduce your age’) or counterproductive in more general contexts (e.g., ‘raise your blood pressure to reduce the risk of diabetes’). In addition, if multiple otherwise quite different counterfactuals suggest changes to the same feature, the user may have more confidence that the feature is an important lever to achieve the desired outcome. Due to space constraints here, we refer the reader to Appendix LABEL:ap:examples for two concrete examples illustrating the above.

Compared to other counterfactual methods, MOC is model-agnostic, handles classification, regression and mixed feature spaces, which (in our opinion) furthermore increases its practical usefulness in general applications.

4 Methodology

Before we explain the counterfactual search in more detail, we define a counterfactual explanation and propose four objectives associated with this definition.

Definition 1 (Counterfactual Explanation)

Let be a prediction function, the feature space and a set of desired outcomes. The latter can either be a single value or an interval of values. Then, a counterfactual explanation for the data point is a close neighbor of whose prediction is in, or sufficiently close to, the desired outcome set . To improve interpretability, should differ from

only in a few features and should be a plausible data point according to the probability distribution


For classification models, we assume that returns the probability for a user-selected class and has to be the desired probability (range).

4.1 Counterfactual Objectives

Definition 1 of a counterfactual explanation can be translated into a multi-objective optimization task with four objectives to be minimized:


with and as the observed data. The first component quantifies the distance between and . We define it as:222We chose the norm over the norm for a natural interpretation. Its numerical disadvantage is negligible for evolutionary optimization.

The second component quantifies the distance between and . using the Gower distance to account for mixed features [gower71]:

with being the number of features. The value of depends on the feature type:

with as the value range of feature , extracted from the observed dataset.

Since the Gower distance does not take into account how many features have been changed, we introduce objective which counts the number of changed features using the norm:

To obtain counterfactuals close to the observed data , the fourth objective measures the weighted average Gower distance between and the nearest observed data points :

Further procedures to increase the plausibility of the counterfactuals are integrated into the optimization algorithm and are described in Section 4.3.

Balancing the four objectives is difficult since the objectives contradict each other. For example, minimizing the distance between counterfactual outcome and desired outcome () becomes more difficult when we require counterfactual feature values close to ( and ) and to the observed data ().

4.2 Counterfactual Search

The Nondominated Sorting Genetic Algorithm II (NSGA-II) [deb02] is an evolutionary multi-objective genetic algorithm consisting of two steps that are repeated in each generation: generating new candidates by selection, recombination and mutation of candidates of the current population, and selecting the best candidates by nondominated sorting and crowding distance sorting. Nondominated sorting favors solutions near the Pareto front and crowding distance sorting preserves the diversity of the candidates.

MOC directly uses the NSGA-II with modifications to optimize its four objectives. First, unlike the original NSGA-II, it uses mixed integer evolutionary strategies (MIES) [li13]

to work with the mixed discrete and continuous search space. Furthermore, a different crowding distance sorting algorithm is used, and we propose some optional adjustments tailored to the counterfactual search in the upcoming section. For MOC, each candidate is described by its feature vector (the ‘genes’), and the objective values of the candidates are evaluated by Eq. (

1). Crowding distance sorting reflects the diversity of each candidate as the sum of distances to its neighbors across the four objectives. In contrast to NSGA-II, the crowding distance is computed not only in the objective space ( norm), but also in the feature space (Gower distance), and the distances are summed up with equal weighting. As a result, candidates are more likely kept if they differ greatly from another candidate in their features values although they are similar in the objective values. Diversity in is desired because the chances of obtaining counterfactuals that meet the (hidden) preferences of users are higher. This approach is based on Avila et al. [avila06].

MOC stops if either a predefined number of generations is reached (default) or the performance no longer improves for a given number of successive generations.

4.3 Further Modifications

4.3.1 Initialization

By default, MOC randomly initializes the features values of a candidate in the first population and randomly selects if they differ from the values of

. However, if a feature has a large influence on the prediction, it should be more likely that a counterfactual will propose changes to this feature. The importance of a feature for an entire dataset can be measured as the standard deviation of the partial dependence plot

[greenwell18]. Analogously, we propose to measure the feature importance for a single prediction with the standard deviation of the Individual Conditional Expectation (ICE) curve of . ICE curves show for one observation and for one feature how the prediction changes when the feature is changed, while other features are fixed to the values of the considered observation [goldstein15]. The greater the standard deviation of the ICE curve, the higher we set the probability that the feature value is initialized with a different value than the one of . Therefore, the standard deviation of each feature is transformed into probabilities within :

with . and

are hyperparameters with default values 0.01 and 0.99.

4.3.2 Actionability

To get more actionable counterfactuals, extreme values of numerical features outside a predefined range are capped to the upper or lower bound after recombination and mutation. The ranges can either be derived from the minimum and maximum values of the features in the observed dataset or users can define these ranges. In addition, users can identify non-actionable features such as the country of birth or age. The value of these features are permanently set to the values of for all candidates within MOC.

4.3.3 Penalization

Furthermore, candidates whose predictions are further away from the target than a predefined distance can be penalized. After the candidates have been sorted into fronts to using nondominated sorting, the solution that violates the constraint least will be reassigned to the front , the candidate with the second smallest violation to , and so on. The concept is based on Deb et al. [deb02]. Since the constraint violators are in the last fronts, they are less likely to be selected for the next generation.

4.3.4 Mutation

Features of the candidates are mutated with a pre-defined (low) probability – one of the hyperparameters of MOC. By default, numerical features are altered by the scaled Gaussian mutator, which adds a normally distributed random value to the current feature value


. Discrete features are mutated by uniformly sampling their possible values. Binary and logical features are flipped. Instead, we propose to sample a new feature from its conditional distribution given all other features, to obtain counterfactuals that are consistent with the observed data distribution. The order in which the features are changed is randomized since a mutation depends on the previous mutations. The conditional distribution is estimated by conditional transformation trees derived from the observed dataset


4.4 Evaluation Metric

A common measure to evaluate the set of nondominated solutions (Pareto set) returned by genetic algorithms is the dominated hypervolume indicator [zitzler98]. It is defined as the volume of the region of the objective space – bounded by a reference point – which contains candidates that are weakly dominated by at least one of the members of the Pareto set. Higher values of the hypervolume are preferred. For MOC, the values of the reference point are

representing the maximal values of the objectives. The dominated hypervolume should monotonically increase over the generations since MOC improves the adaptive fit of the population of candidates to the Pareto front.

4.5 Tuning of Parameters

The dominated hypervolume measures also the performance for tuning the hyperparameters of MOC using iterated F-racing [lopez16]. The parameters are the population size and the probabilities of recombining and mutating the features of a candidate. The parameters were tuned on six binary classification datasets from OpenML [vanschoren13] – which were not used in the benchmark – and on single desired targets. A summary of the tuning setup and results can be found in Table LABEL:tab:psirace in Appendix LABEL:ap:irace. The tuned parameters were used for the credit data application and the benchmark study.

5 Credit Data Application

This section demonstrates the usefulness of MOC to explain the prediction of credit risk using the German credit dataset [kaggle16]

. The dataset has 522 observations and nine features containing credit and customer information. Categories with few case numbers were combined. The binary target indicates whether a customer has a ‘good’ or ‘bad’ credit risk. We tuned a support vector machine (with radial-basis (RBF) kernel) on 80 % of the data with the same tuning setup as for the benchmark (Appendix

LABEL:ap:bench). To obtain a single numerical outcome, only the predicted probability for the class ‘good’ credit risk was returned. On the 20 % hold-out set, the model had an accuracy of 0.6, which is enough for illustration purposes. We chose the first observation of the dataset as , with a predicted probability for ‘good’ credit risk of and the following feature values:

Feature age sex job housing saving checking credit.amount duration purpose
Value 22 female 2 own little moderate 5951 48 radio/TV

We set the desired outcome interval to which indicates a change to a ‘good’ credit risk. We generated counterfactuals using MOC with the modified initialization and mutation strategies and the parameters tuned by iterated F-racing. Candidates with a prediction below were penalized.

A total of 59 counterfactuals were found by MOC. All had a prediction greater than 0.5 for class ‘good’. Credit duration was changed for all counterfactuals, followed by credit amount (98 %). The person’s age was only changed for one counterfactual. Since a user might not want to investigate all returned counterfactuals individually (in feature space), we provide a visual summary of the Pareto set in Figure 1, either as a parallel coordinate plot or a response surface plot along two numerical features.333This is equivalent to a 2-D ICE-curve through [goldstein15]. We refer to Section 4.3 for a general definition of ICE curves. All counterfactuals had values equal to or smaller than the values of for duration, credit amount and age.

(a) Feature values
(b) Response surface plot
Figure 1: Visualization of counterfactuals for the prediction of the first data point of the credit dataset. Left: Feature values of counterfactuals. Only changed features are shown. The blue line corresponds to the values of . The given numbers are the minimum and maximum values of the features over and the counterfactuals. Right: Response surface plot for the model prediction along features duration and credit amount, holding other feature values constant at the value of . Colors and contour lines indicate the predicted value. The white point is and the black points are the counterfactuals that only proposed changes in duration and/or credit amount. The histograms show the marginal distributions of the features in the observed dataset.

The response surface plot illustrates why the counterfactuals with changes in features duration, amount or both are consistent with our multi-objective task. The colour gradient and contour lines indicate that either duration or both amount and duration must be decreased to reach the desired outcome. Due to the fourth objective and the modified mutator, we obtained counterfactuals in high density areas (indicated by the histograms), even though is in a low density area. Counterfactuals in the lower left corner seem to be in a less favorable region far from but they are close to the observed data points.

6 Experimental Setup

In this section, the performance of MOC is evaluated in a benchmark study for binary classification. The datasets for this benchmark are from the OpenML platform [vanschoren13] and are briefly described in Table LABEL:tab:databench. We selected datasets with no missing values, with up to 1500 observations and a maximum of 40 features.

Task Name Obs Cont Cat
3718 boston 506 12 1
3846 cmc 1473 2 7
145976 diabetes 768 8 0
9971 ilpd 583 9 1
3913 kc2 522 21 0
3 kr-vs-kp 3196 0 36
3749 no2 500 7 0
3918 pc1 1109 21 0
3778 plasma_retinol 315 10 3
145804 tic-tac-toe 958 0 9
Table 2: MOC’s coverage rate of methods to be compared per dataset averaged over all models. The number of nondominated counterfactuals for each method are given in parentheses. Higher values of coverage indicate that MOC dominates the other method. The indicates that the binomial test with that a counterfactual is covered by MOC is significant at the level.
DiCE Recourse Tweaking
boston 1* (16) 0.7 (10) 1* (8)
cmc 0.95* (20) 1* (9)
diabetes 1* (65) 0.8* (20) 1* (7)
ilpd 1* (18) 0.95* (22) 1* (6)
kc2 1* (47) 0.79* (24) 1 (1)
kr-vs-kp 1* (4) 0.14 (7)
no2 1* (45) 1* (25) 1* (10)
pc1 1* (42) 0.5 (14)
plasma_retinol 1* (6) 0.88* (8)
tic-tac-toe 1* (2) 0 (7)
Table 1: Description of benchmark datasets. Legend: task: OpenML task id; Obs: Number of rows; Cont/Cat: Number of continuous/categorical features.

For each dataset, we tuned and trained the following models on 80 % of the data: logistic regression, random forest, xgboost, RBF support vector machine and a one-hidden-layer neural network. The tuning parameter set and the performance on the 20 % hold-out are in Table

LABEL:tab:perf in Appendix LABEL:ap:bench. Each model returned only the probability for one class. Ten observed data points per dataset were randomly selected as . The desired target for each was set to the opposite of the predicted class: