Convex Density Constraints for Computing Plausible Counterfactual Explanations

02/12/2020 ∙ by André Artelt, et al. ∙ Bielefeld University 0

The increasing deployment of machine learning as well as legal regulations such as EU's GDPR cause a need for user-friendly explanations of decisions proposed by machine learning models. Counterfactual explanations are considered as one of the most popular techniques to explain a specific decision of a model. While the computation of "arbitrary" counterfactual explanations is well studied, it is still an open research problem how to efficiently compute plausible and feasible counterfactual explanations. We build upon recent work and propose and study a formal definition of plausible counterfactual explanations. In particular, we investigate how to use density estimators for enforcing plausibility and feasibility of counterfactual explanations. For the purpose of efficient computations, we propose convex density constraints that ensure that the resulting counterfactual is located in a region of the data space of high density.



There are no comments yet.


page 11

Code Repositories



view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As research on machine learning (ML) is making more and more progress and ML models constitute state-of-the-art approaches in domains such as machine translation, image and text classification, we observe an increased deployment of ML technology in practice  [12, 16, 32]. At the same time, ML models are vulnerable to unexpected behavior such as adversarial attacks [29] and behavior which is regarded as unfair by humans [21], hence a large amount of the decision making process offered by ML is not fully understood by humans. As a consequence of this fact and due to legal regulations like EU’s GDPR [23], transparency and interpretability of ML models becomes more and more relevant. Therefore, there is a need for tools that make ML models transparent in the sense that we can explain the decision making process of a model. Accordingly, we observe an increase of research in the area of explainable AI (XAI) [11, 15, 28, 30].

Over time, researchers developed a diverse set of methods for explaining ML models [15, 22]: Model-agnostic methods [15, 25] are not tailored to a particular model or representation, hence they are (in theory) applicable to any different types of ML models; in the extreme ”truly” model-agnostic methods do not need access to the training data or model internals but they regard the model as a black-box. There exists a variety of different model-agnostic approaches, including feature interaction methods [13], feature importance methods [9], partial dependency plots [34] and local methods that approximates the model locally by an explainable model [14, 26]. This group of technologies relies on feature importance ranking or similar to express decisions of a given model. A different class of explanations relies on examples that explain a prediction by a (set of) data points [2]. Prototypes & criticisms [17] and influential instances [18] are instances of such example-based explanations.

One popular instance of example-based explanations, often realized as black-box scheme, are counterfactual explanations [22, 31]. A counterfactual explanation states a change to the original input that leads to a different prediction of a given ML model. This type of explanation is considered as particularly intuitive, because it tells the user what to do in order to achieve a desired goal [22, 31]

. Despite the huge variety of different - equally important - types of explanations, we limit ourselves to counterfactual explanations in this contribution. Counterfactual explanations can be phrased as a constrained optimization problem, aiming for minimizing the change which results in a different output. Depending on the specific setting, this optimization problem is solved by either gradient-based schemes or, in particular in agnostic settings, by black-box solvers. Thereby, approaches which rely on the specific form of the given classifier can lead to much more efficient computation schemes, as demonstrated in


Yet, stated in its simplest form, counterfactuals are very similar to adversarial examples, since there are no guarantees that the resulting counterfactual is plausible and feasible in the data domain. As a consequence, the absence of such constraints often leads to counterfactual explanations that are not plausible [8, 19, 24] - an observation that we will also confirm in this work.

In this work, we aim for an extension of counterfactual explanation schemes which restricts possible explanations to plausible regions of the data space. More specifically, we propose and study a formal definition of plausible counterfactual explanations and propose a modeling framework, which phrases such constraints in convex form, such that they can efficiently be integrated into optimization schemes, preserving uniqueness of solutions or efficiency if this is valid for the constrained version.

2 Definition and Related Work

We briefly review existing work on enforcing plausibility of counterfactual explanations (Definition 1). In the context of ML models, counterfactual explanations are formalized as follows:

Definition 1 (Counterfactual explanation [31])

Assume a prediction function is given. Computing a counterfactual for a given input is phrased as the following optimization problem:



denotes a loss function,

the requested prediction, and a penalty term for deviations of from the original input . denotes the regularization strength.

Two common regularizations are the weighted Manhattan distance and the Mahalanobis distance. The weighted Manhattan distance is defined as:


where denote feature-wise weights. The Mahalanobis distance is defined as:


where denotes a s.psd. matrix.

In general, is arbitrary, hence possibly implausible. A variety of approaches aims for a restriction of the domain to plausible patterns only. The authors of [24] propose to compute a path of intermediate counterfactuals that lead to the final counterfactual. The idea is to provide the user with a set of intermediate goals that finally lead to the desired goal - it might be easier to “go into the direction” of the final goal step by step instead of accomplishing it in a single step. In order to compute such a path of intermediate counterfactuals, different strategies for constructing a graph on the training data set are proposed - including the query point. In this graph, two samples are connected by a weighted edge if they are “sufficiently close to each other” - e.g. based on density estimation. The path of intermediate counterfactuals is then computed as the shortest path between the query point and a point that has the requested label - this data point is the final counterfactual. Therefore, the final counterfactual as well as all intermediate counterfactuals are elements of the training data set, which ensures that all counterfactuals are plausible and feasible. However, the limitation to samples from the training set can be seen as a major drawback of this method, in particular for sparsely populated data spaces.

A slightly modified version of Eq. (1) was proposed in [19]. The authors suggest that the original formalization Eq. (1) does not take into account that the counterfactual should lie on the data manifold which would enforce plausibility. Therefore, they propose to add two additional terms to the original objective in Eq. (1), which should be simultaneously optimized:

  1. The distance between the counterfactual

    and the reconstructed version of it that has been computed by using a pretrained autoencoder.

  2. The distance between the encoding of the counterfactual and the mean encoding of training samples that belong to the requested class .

The first term is supposed to ensure that the counterfactual

lies on the data manifold and thus is a plausible data instance. The second term is supposed to accelerate the solver for computing the solution of the final optimization problem. We think that this is a very promising approach - However, the objective itself still behaves like ”a heuristic” because, like the original Eq. (


), there are no guarantees that the resulting counterfactual is plausible/feasible or even valid at all - one would have to do an extensive hyperparameter tuning of the objective. Furthermore, the need of a working autoencoder can be considered as another bottleneck because building high quality and stable autoencoders can be quite challenging if only very little data are available - in particular if the autoencoder is modeled by deep neural networks. Lastly, due to the non-convexity of the autoencoder and the model itself, the resulting optimization problem is highly non-convex and thus difficult to solve.

Somehow similar to [19], the authors of [20] propose to use GANs and VAEs for creating realistic images. Although they do not talk explicitly about counterfactuals - they want to compute contrastive explanations111A contrastive explanations states a minimal amount of (present and absent) features (including their values) that are responsible for a specific prediction. Such an explanation is computed by finding a minimal perturbation to the input that yields the same (present features) or different (absent features) prediction. In order to stay close to the data manifold - enforce that the results are plausible - they propose to use an autoencoder. [8] which are similar to counterfactuals in the sense that in both cases we want to find a minimal change that leads to a specific prediction (although we have a second objective in constrastive explanations).

The authors of [5] propose a convex modeling framework for efficiently computing counterfactual explanations of different ML models. They propose to turn the optimization problem Eq. (1) into a constraint optimization problem:


By exploiting model specific structures, they are able to turn Eq. (4) into a convex program for many different ML models. The benefits of this modeling are that convex programs can be solved very efficiently [7], additional convex constraints can be added without changing the complexity of the problem, feasibility - does a solution (counterfactual), under a given set of constraints, exist? - can be verified easily. By adding additional constraints we can ensure that the counterfactual is plausible/feasible in the specific data domain. However, manually constructing plausibility constraints can be very time consuming and requires solid domain knowledge which might not be available. These approaches yield promising approaches, yet their greatest disadvantage is the potentially high computational load of the induced optimization problem. Here, we will take a different avenue by phrasing the condition of plausibility as a convex constraint.

Our contribution builds on our prior work [5], which phrases counterfactual computation in terms of efficient constrained optimization problems for many popular classifiers. Besides a formal definition of plausible counterfactuals, we propose convex density constraints that can be built from a given data set automatically and efficiently. These constraints ensure that the density of the resulting counterfactual is lower bounded by a predefined/requested threshold. Note that all proofs and derivations can be found in the appendix 0.A.

3 Plausible Counterfactual Explanations

3.1 Computation of Plausible Counterfactual Explanations

For the purpose of enforcing plausibility of counterfactuals, we propose to add a target specific density constraint to Eq. (4): equationparentequation


where denotes a class dependent density estimator.

There exists a variety of different density estimators that estimate the density based on training samples.

A kernel density estimator (KDE) is a popular choice when it comes to estimate densities from training data. A kernel density estimator is a non-parametric model and is defined as:


where denotes a suitable kernel function, denotes the -th sample in the training data set and denotes the weighting of the -th sample. However, in case of non-linear kernels (e.g. Gaussian kernel) the resulting density estimator is highly non-convex and does not induce an efficient optimization problem.

In a Gaussian mixture model (GMM) the density is modeled as a mixture of multivariate normal distributions. The density under a GMM with

components is defined as:



denotes the prior probability of the

-th component, and denote the mean and covariance of the -th component. Although the GMM Eq.(7) is much simpler (has fewer components/parameters) than a kernel density estimator Eq. (6), it still does not induce convex constraints for Eq. (5c).

Here we propose to approximate the density of a GMM Eq. (7) by a component wise maximum:




By construction, the approximation Eq. (8) is always a lower bound of the true GMM density Eq. (7). More precisely, the following bound holds:


The inequality constraint of a single component Eq. (9)


can be rewritten as a convex quadratic constraint:




By making use of the approximation Eq. (8), the original constraint Eq. (5c) becomes:


Although Eq. (14) is still non-convex, we can turn it into a set of convex constraints by observing the following:

Let be a solution of Eq. (5) where we substituted Eq. (5c) by Eq. (14). Then it holds that:


Note that there might exists more than one for which Eq. (11) holds. Because we do not know for which Eq. (11) holds, we simply try all possible and select the counterfactual that yields the smallest value of the objective Eq. (5a) - that is the closest to the original input . Note that depending on the prediction function it can happen that Eq. (5) is not feasible for all . Because each constraint Eq. (11) can be rewritten as a convex quadratic constraint, the final optimization problem Eq. (5) becomes convex iff the objective Eq.(5a) and the prediction constraint Eq. (5b) are convex. The Manhattan distance as well as the Mahalanobis distance as regularizers

together with common ML models - like generalized linear models, linear SVM, LDA, matrix LVQ, decision tree, etc. - yield convex programs 

[5] that can be solved efficiently [7].

3.2 A Formal Approach

We aim for a formal description of plausible counterfactuals as modelled in Eq. (5).

We assume a classification setting with an underlying generating process where the measurable set denotes the data domain, the discrete and finite set denotes the set of possible labels and denotes the joint density - we assume that is closed for all . Furthermore, let be a distance metric on . Following Eq. (5), we propose to define a plausible counterfactual according to Definition 2.

Definition 2 (-plausible counterfactual)

Let be a classifier. We call a counterfactual explanation of a particular sample -plausible iff the following holds:


where denotes a minimum density at which we consider a sample plausible. Note that we state the definition of an -plausible counterfactual as an optimization problem Eq.(16) which makes the definition particular appealing from a practical perspective.

Next, in Theorem 3.1 we state under what conditions -plausible counterfactuals do not depend on the classifier.

Theorem 3.1 (Model free -plausible counterfactuals under zero risk classifiers)

Let be the set of all classifiers that have zero risk on the generating process - that is: . Then the following holds :


Note that Theorem 3.1 states that in the case of perfect classifiers, -plausible counterfactuals become independent from the specific classifiers - thus we can compute the -plausible counterfactuals solely in the data domain without taking the classifiers into account.

However, in practice we usually do not have a perfect classifier because either the class wise densities are overlapping or the classifier itself can not model a zero risk decision boundary. Therefore, we state a weaker version of Theorem 3.1 in Theorem  3.2, in which we assume that a classifiers is locally -sufficient perfect at a sample (Definition 3) - that is: the classifier classifies the sample as , which is consistent with the ground truth induced by the generating process , and the decision boundary does not ”cut to deep” into the closest parts of high density regions of the other classes.

Definition 3 (Locally -sufficient perfect classifier)

Let be a classifier and denote the set of all that have a class dependent density of at least by - that is: . We call locally -sufficient perfect at a sample iff the following holds:

Theorem 3.2 (Model free -plausible counterfactual under locally
-sufficient perfect classifiers)

Let be the set of locally -sufficient perfect classifiers (Definition 3) at a sample . Then the following holds :


Note that Theorem 3.2 states that for a set of classifiers that are locally -sufficient perfect at a sample (Definition 3), the -plausible counterfactuals of this particular sample are exactly the same for all classifiers in this set. Because we only assume locally -sufficient perfectness of the classifier, Theorem 3.2 is very appealing for practice when we actually have to compute a counterfactual explanation of a particular sample under a particular model - the theorem tells us when we can drop the classification constraint and thus simplify the optimization problem Eq. (16).

In practice, when the true density (or a density estimation) is not available, one could try to check for locally -sufficient perfectness at a given sample by checking if the ”closest” training samples (incl. samples from different classes) around are classified correctly.

4 Experiments

We perform experiments on several data sets222Our source code is available on GitHub - for empirically evaluating our proposed density constraints Eq. (14). We use the ”Breast Cancer Wisconsin (Diagnostic) Data Set” [33], the ”Iris Plants Data Set” [10], the ”Wine Data Set” [27], the ”Boston Housing Data Set” [1]333We turn it into a binary classification problem by setting the target to if the price is greater or equal to 20k$. and the ”Optical Recognition of Handwritten Digits Data Set” [3]. We repeat the following procedure in a 4-fold cross validation: First, we fit class dependent kernel density estimators (we use the Gaussian kernel) and a GMM to the training data set - where we use a 5-fold cross validation grid search for hyperparameter tuning. Next, we fit a classifier (either a softmax regression or decision tree)444An implementation of the experiments including other models like LDA, linear SVM, matrix LVQ, etc. is available online. to the training data set. After this, for each sample in the test set, we compute two counterfactuals (both with the same but random target class) - one counterfactual without any additional density/plausibility constraints and another counterfactual with our proposed density constraint Eq. (12). We set the density threshold from Eq. (11) to the median density Eq. (8) of the training samples under the approximated GMM of the target class . To enforce sparsity, both counterfactuals are computed under the Manhattan distance as a regularizer . Finally, we compute the Manhattan distance to the original sample and the log-density of both counterfactuals under the kernel density estimator. We use the kernel density estimator instead of the GMM because our proposed density constraint is an approximation of the GMM which itself can be interpreted as an approximation of the kernel density estimator. In order to increase the accuracy of the classifiers and density estimators, we apply a PCA to the breast cancer data set (5 components), the house prices data set (10 components), the wine data set (8 components) and the digits data set (40 components). Since the PCA transformation is affine, it can be easily integrated into our convex programs - so that we can still compute counterfactuals in the original space.

The results of the experiments are listed in Table 1.

Without density constraints With density constraints
Data set Density Distance Density Distance

Iris -34.55 1.80 -0.75 4.06
Digits -164.03 36.74 -112.40 110.10
Wine -82.31 5.19 -37.58 49.59
Breast cancer -46.52 33.26 -27.0 81.47
House prices 39.51 5.0 -38.12 9.54

Iris -40.55 1.19 -0.73 4.06
Digits -170.25 36.69 -110.48 114.78
Wine -102.44 3.92 -34.38 66.92
Breast cancer -43.44 0.01 -25.55 22.27
House prices -40.49 0.01 -37.84 14.92
Table 1: Median log-density (under the KDE) and median Manhattan distance to the original sample of the computed counterfactuals - with vs. without density constraints. Best values are highlighted - larger densities and smaller distances are better.

We observe that our proposed density constraint consistently yields counterfactuals that have a higher density than the counterfactuals without any additional density/plausibility constraints - whereby we only observe a minor increase in computation time (e.g. from ms to ms per sample). However, the distance to the original sample is much higher for the ”more plausible” counterfactuals than for arbitrary (e.g. closest) counterfactuals. This seems reasonable because one would expect that samples from a different class look quite differently. In addition, we observe that the distances of the counterfactuals to the original samples on the Iris data set and Digits data set are more or less the same for both models, whereas the opposite is true for the wine, breast cancer and house prices data sets. This observation can be explained by the hypothesis that in the case of Iris and digits data set, both models learned a locally -sufficient perfect classifier (Definition 3) at most samples - then Theorem 3.2 states that the counterfactuals are model independent which explains the observed numbers. Conversely, this suggests that the two classifiers learned on the other three data sets are quite different in the sense that they are not all locally -sufficient perfect classifiers (Definition 3) at most samples - hence, the distances of the counterfactuals to the original samples are quite different.

Furthermore, Fig. 1 shows some samples from the digit data set and compares the counterfactuals generated with and without density constraints of both models. Most of the samples in the second block - counterfactuals without any density/plausibility constraints - look like adversarials in the sense that the original label can be still recognized but the requested label can not be inferred. However, most of the samples in the third block - counterfactuals that have been computed with our proposed density constraint - look like samples from the requested target class. This suggests that our method in fact yields plausible counterfactuals. We also observe that the two models yield different counterfactuals in the second block but more or less exactly the same counterfactuals in the third block. As already discussed in the case of the very similar distances in Table 1, this can be explained by assuming that both models are (close to) locally -sufficient perfect (Definition 3) at most samples, which confirms the observations as it is predicted by Theorem 3.2. However, please note that a visual inspection of some samples does not replace a proper evaluation by doing an expert user study and subsequent hypotheses testings.

5 Discussion and Conclusion

In this work, we proposed and studied a formal definition of plausible counterfactual explanations. In this definition we proposed to add density constraints to the optimization problem for computing counterfactual explanations to ensure that the resulting counterfactual is plausible in the given data domain. For practical purposes, we proposed convex approximations of a Gaussian mixture model to get tractable density constraints. These constraints give rise to convex optimization problems for computing plausible counterfactual explanations many common models like linear models and decision trees. In addition, these constraints allow to specify a lower bound on the density of the resulting counterfactual that is guaranteed to be full filled. Finally, we empirically evaluate our proposed methods on several data sets and observe that our method consistently yields counterfactual explanations that are located in high density regions. A visual inspection of samples from the digits data set suggests that in fact our method seems to yield plausible counterfactuals.

As future work, we plan to conduct a proper user study where humans judge the plausibility of generated counterfactual explanations - counterfactuals generated with and without density constrains. Furthermore, we want to explore density estimators for high dimensional data so that our method can be used for high dimensional data, too. We also plan to investigate how to add density constraints for computing counterfactual explanations of more complex models - in particular non-linear models (e.g. Deep neural networks). Lastly, our source code will be released as part of our open-source toolbox CEML 

[4], a Python toolbox for computing counterfactual explanations of ML models, so that our proposed method can be easily used by practitioners.

Original samples

Label: 6
Label: 9
Label: 4
Label: 5
Label: 7

  Closest counterfactuals under a softmax regression model

Label: 3
Label: 0
Label: 6
Label: 7
Label: 1

Closest counterfactuals under a decision tree model

Label: 3
Label: 0
Label: 6
Label: 7
Label: 1

  Closest plausible counterfactuals under a softmax regression model

Label: 3
Label: 0
Label: 6
Label: 7
Label: 1

Closest plausible counterfactuals under a decision tree model

Label: 3
Label: 0
Label: 6
Label: 7
Label: 1
Figure 1: Samples from the digit data set. First block: Original samples. Second block: Counterfactuals generated without any density/plausibility constraints. Third block: Counterfactuals generated with our proposed density constraint. The corresponding labels are shown below each image - note that the shown labels of the counterfactuals are the requested labels.

Appendix 0.A Proofs and Derivations

  1. Proof (Theorem 3.1)

    For a given generating process , zero risk classifiers exist iff the class-dependent densities are non-overlapping:


    Therefore, for a zero risk classifier it holds that:


    If follows that :


    Thus, the constraint in Eq. (16) becomes redundant - the counterfactuals of zero risk classifiers do not depend on these classifiers. ∎

  2. Proof (Theorem 3.2)

    For a locally -sufficient perfect classifier (Definition 3) at , it holds that:


    where is a -plausible counterfactual (Definition 2) of . It follows that:


    Thus, the constraint in Eq. (16) becomes redundant - the counterfactuals of a sample of classifiers that are locally -sufficient perfect at do not depend on these classifiers. ∎

  3. Proof (Bound in Eq. (10))

    It holds that:


    Therefore, it follows that:


    which proves the lower bound in Eq. (10).

    It holds that:


    Because of Eq. (25) and Eq. (27), it follows that:


    which proves the upper bound in Eq. (10). ∎

  4. Eq. (11) can be rewritten as the convex quadratic constraint Eq. (12):