On the computation of counterfactual explanations – A survey

11/15/2019 ∙ by André Artelt, et al. ∙ Bielefeld University 0

Due to the increasing use of machine learning in practice it becomes more and more important to be able to explain the prediction and behavior of machine learning models. An instance of explanations are counterfactual explanations which provide an intuitive and useful explanations of machine learning models. In this survey we review model-specific methods for efficiently computing counterfactual explanations of many different machine learning models and propose methods for models that have not been considered in literature so far.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

OnTheComputationOfCounterfactualExplanations

None


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to recent advances in machine learning (ML), ML methods are increasingly use in real world scenarios [1, 2, 3, 4]. Especially, ML technology is nowadays used in critical situations like predictive policing [5] and loan approval [6]. In order to increase trust and acceptance of these kind of technology, it is important to be able to explain the behaviour and prediction of these models [7] - in particular answer questions like “Why did the model do that? And why not smth. else?”. This becomes even more important in view to legal regulations like the EU regulation on GDPR [8], that grants the user a right to an explanation.

A popular method for explaining models [7, 9, 10, 11] are counterfactual explanations (often just called counterfactuals) [12]. A counterfactual explanation states changes to some features that lead to a different (specified) behaviour or prediction of the model. Thus, counterfactual explanation can be interpreted as a recommendation what to do in order to achieve a requested goal. This is why counterfactual explanations are that popular - they are intuitive and user-friendly [7, 12].

Counterfactual explanations are an instance of model-agnostic methods. Therefore, counterfactuals are not tailored to a particular model but can be computed for all possible models (in theory). Other instances of model-agnostic methods are feature interaction methods [13], feature importance methods [14], partial dependency plots [15] and local methods that approximates the model locally by an explainable model (e.g. a decisiontree) [16, 17]. The nice thing about model-agnostic methods is that they (in theory) do not need access to model internals and/or training data - it is sufficient to have an interface where we can pass data points to the model and observe the output/predictions of the model.

However, it turns out that efficiently computing high quality counterfactual explanations of black-box models can be very difficult [18]. Therefore, it is beneficial to develop model-specific methods - that use model internals - for efficiently computing counterfactual explanations. Whenever we have access to model internals, we can use the model-specific method over the model-agnostic method for efficiently computing counterfactual explanations. In this work we focus on such model-specific methods.

In particular, our contributions are:

  • We review model-specific methods for efficiently computing counterfactual explanations of different ML models.

  • We propose model-specific methods for efficiently computing counterfactual explanations of models that have not been considered in literature so far.

The remainder of this paper is structured as follows: First, we briefly review counterfactual explanations (section 2). Then, in section 3 we review and propose model-specific methods for computing counterfactual explanations. Finally, section 5 summarizes this papers. All derivations and mathematical details can be found in the appendix (section 6).

2 Counterfactual explanations

Counterfactual explanations [12] (often just called counterfactuals) are an instance of example-based explanations [19]. Other instances of example-based explanations [7] are influential instances [20] and prototypes & criticisms [21].

A counterfactual states a change to some features/dimensions of a given input such that the resulting data point (called counterfactual) has a different (specified) prediction than the original input. Using a counterfactual instance for explaining the prediction of the original input is considered to be fairly intuitive, human-friendly and useful because it tells people what to do in order to achieve a desired outcome [12, 7].

A classical use case of counterfactual explanations is loan application [6, 7]: Imagine you applied for a credit at a bank. Unfortunately, the bank rejects your application. Now, you would like to know why. In particular, you would like to know what would have to be different so that your application would have been accepted. A possible explanation might be that you would have been accepted if you would earn 500$ more per month and if you would not have a second credit card.

Although counterfactuals constitute very intuitive explanation mechanisms, there do exist a couple of problems.

One problem is that there often exist more than one counterfactual - this is called Rashomon effect [7]. If there are more than one possible explanation (counterfactual), it is not clear which one should be selected.

An alternative - but very similar in the spirit - to counterfactuals [12] is the Growing Spheres method [22]

. However, this method suffers from the curse of dimensionality because it has to draw samples from the input space, which can become difficult if the input space is high-dimensional.



According to [12], we formally define the finding of a counterfactual as follows: Assume a prediction function is given. Computing a counterfactual of a given input 222We restrict ourself to , but in theory one could use an arbitrary domain . can be interpreted as an optimization problem:

(1)

where

denotes a loss function that penalizes deviation of the prediction

from the requested prediction . denotes a regularization that penalizes deviations from the original input

and the hyperparameter

denotes the regularization strength.

Two common regularizations are the weighted Manhattan distance and the generalized L2 distance. The weighted Manhattan distance is defined as

(2)

where denote the feature wise weights. A popular choice [12] for is the inverse median absolute deviation of the -th feature median in the training data set :

(3)

The weights compensate for the (potentially) different variability of the features. However, because we need access to the training data set , this regularization is not a truly model-agnostic method - it is not usable if we only have access to a prediction interface of a black-box model.

Although counterfactual explanations are a model-agnostic method, the computation of a counterfactual becomes much more efficient when having access to the internals of the model. In this work we assume that we have access to all needed model internals as well as access to the training data set - we will only need the training data for computing the weights in the weighted Manhattan distance Eq. 2. We do not need access to the training data if we do not use the weighted Manhattan distance or if we use some other methods for computing the weights (e.g. setting all weights to ).

A slightly modified version of Eq. 1 was proposed in [23]. The authors claim that the original formalization in Eq. 1 does not take into account that the counterfactual should lie on the data manifold - the counterfactual should be a plausible data instance. To deal with this issue, the authors propose to add two additional terms to the original objective Eq. 1:

  1. The distance/norm between the counterfactual

    and the reconstructed version of it that has been computed by using a pretrained autoencoder.

  2. The distance/norm between the encoding of the counterfactual and the mean encoding of training samples that belong to the requested class .

The first term is supposed to make sure that the counterfactual lies on the data manifold and thus is a plausible data instance. The second term is supposed to accelerate the solver for computing the solution of the final optimization problem. Both claims have been evaluated empirically [23].

Recently, another approach for computing plausible/feasible counterfactual explanations was proposed [24]

. Instead of computing a single counterfactual, the authors propose to compute a path of intermediate counterfactuals that lead to the final counterfactual. The idea behind this path of intermediate counterfactuals is to provide the user with a set of intermediate goals that finally lead to the desired goal - it might be more feasible to “go into the direction” of the final goal step by step instead of accomplishing it in a single step. In order to compute such a path of intermediate counterfactuals, the authors propose different strategies for constructing a graph on the training data set - including the query point. In this graph, two samples are connected by a weighted edge if they are “sufficient close to each other” - the authors propose different measurements for closeness (e.g. based on density estimation). The path of intermediate counterfactuals is equal to the shortest path between the query point and a point that satisfies the desired goal - this is the final counterfactual. Therefore the final counterfactual as well as all intermediate counterfactuals are elements from the training data set.

Despite the highlighted issues [23, 24] of the original formalization Eq. 1, we stick to it and leave further investigations on the computation of feasible & plausible counterfactuals as future research. However, many of the approaches for computing counterfactuals - that are discussed in this paper - can be augmented to restrict the space of potential counterfactuals. These restrictions provide an opportunity for encoding domain knowledge that lead to more plausible and feasible counterfactuals.

3 Computation of counterfactuals

In the subsequent sections we explore model-specific methods for efficiently computing counterfactual explanations of many different ML models. But before looking at model-specific methods, we first (section 3.1) discuss methods for dealing with arbitrary types of models - gradient based as well as gradient free methods.

Note that for the purpose of better readability and due to space constraints, we put all derivations in the appendix (section 6).

3.1 The general case

We can compute a counterfactual explanation of any model we like by plugging the prediction function of the model into Eq. 1 and choosing a loss (eq. 0-1 loss) and regularization (e.g. Manhattan distance) function. Depending on the model, loss and regularization function, the resulting optimization problem might be differentiable or not. If it is differentiable, we can use gradient-based methods like (L-)BFGS and conjugate gradients for solving the optimization problem. If Eq. 1

is not differentiable, we can use gradient-free methods like the Downhill-Simplex method or an evolutionary algorithm like CMA-ES or CERTIFAI 

[25]

- the nice thing about evolutionary algorithms is that they can easily deal with categorical features. Another approach, limited to linear classifiers, for handling contious and discrete features is to use mixed-integer programming (MIP) 

[26]. Unfortuantely, solving a MIP is NP-hard. However, there exist solvers that can compute an approximate solution very efficiently. Popular methods are branch-and-bound and branch-and-cut algorithms [27].

When developing model-specific methods for computing counterfactuals, we always consider untransformed inputs only - since a non-linear feature transformation usually makes the problem non-convex. Furthermore, we only consider the Euclidean distance and the weighted Manhattan distance as candidates for the regularization function .

3.2 Separating hyperplane models

A model whose prediction function can be written as:

(4)

is called a separating hyperplane model. Popular instances of separating hyperplane models are SVM, LDA, perceptron and logistic regression.

Without loss of generality, we assume . Then, the optimization problem for computing a counterfactual explanation Eq. 1 can be rewritten as:

(5)

where

(6)
(7)

Depending on the regularization, the optimization problem Eq. 5

becomes either a linear program (LP) - if the weighted Manhattan distance is used - or a convex quadratic program (QP) with linear constraints - if the Euclidean distance is used. More details can be found in the appendix (section 

6.2).

If we would have some discrete features instead of contious features only, we would obtain a MIP or MIQP as described in [26].

3.3 Generalized linear model

In a generalized linear model we assume that the distribution of the response variable belongs to the exponential family. The expected value is connected to a linear combination of features by a link function, where different distributions have different link functions.

In the subsequent sections, we explore how to efficiently compute counterfactual explanations of popular instances of the generalized model.

3.3.1 Logistic regression

In logistic regression we model the response variable as a Bernoulli distribution. The prediction function

of a logistic regression model is given as

(8)

where is the discrimination threshold (often ) and

(9)

When ignoring all probabilities and setting

, the prediction function of a logistic regression model becomes a separating hyperplane:

(10)

Therefore, computing a counterfactual of a logistic regression model is exactly the same as for a separating hyperplane model (section 3.2).

3.3.2 Softmax regression

In softmax regression we model the distribution of the response variable as a generalized Bernoulli distribution. The prediction function of a softmax regression model is given as:

(11)

In this case, the optimization problem for computing a counterfactual explanation Eq. 1 can be rewritten as:

(12)

where

(13)
(14)

Depending on the regularization, the optimization problem Eq. 12 becomes either a LP - if the weighted Manhattan distance is used - or a convex QP with linear constraints - if the Euclidean distance is used. More information can be found in the appendix (section 6.3.1).

3.3.3 Linear regression

In linear regression we model the distribution of the response variable as a Gaussian distribution. The prediction function

of a linear regression model is given as:

(15)

The optimization problem for computing a counterfactual explanation Eq. 1 can be rewritten as:

(16)

where

(17)

and denotes the tolerated deviation from the requested prediction .

Depending on the regularization, the optimization problem Eq. 16 becomes either a LP (if the weighted Manhattan distance is used) or a convex QP with linear constraints (if the Euclidean distance is used). More information can be found in the appendix (section 6.3.2).

3.3.4 Poisson regression

In Poisson regression we model the distribution of the response variable as a Poisson distribution. The prediction function

of a Poisson regression model is given as:

(18)

In this case, the optimization problem for computing a counterfactual explanation Eq. 1 can be rewritten as:

(19)

where

(20)

and denotes the tolerated deviation from the requested prediction .

Depending on the regularization, the optimization problem Eq. 19 becomes either a LP (if the weighted Manhattan distance is used) or a convex QP with linear constraints (if the Euclidean distance is used). More information can be found in the appendix (section 6.3.3).

3.3.5 Exponential regression

In exponential regression we model the distribution of the response variable as a exponential distribution. The prediction function

of an exponential regression model is given as:

(21)

Then, the optimization problem for computing a counterfactual explanation Eq. 1 can be rewritten as:

(22)

where

(23)

and denotes the tolerated deviation from the requested prediction .

Depending on the regularization, the optimization problem Eq. 22 becomes either a LP (if the weighted Manhattan distance is used) or a convex QP with linear constraints (if the Euclidean distance is used). More information can be found in the appendix (section 6.3.4).

3.4 Gaussian naive Bayes

The Gaussian naive Bayes model makes the assumption that all features are independent of each other and follow a normal distribution. The prediction function

of a Gaussian naive Bayes model is given as:

(24)

where denotes the a-priori probability of the -th class.

The optimization problem for computing a counterfactual explanation Eq. 1 can be rewritten as:

(25)

where

(26)
(27)
(28)

Because we can not make any statement about the definiteness of , the quadratic constraints in Eq. 25 are non-convex. Therefore, the optimization problem Eq. 25 is a non-convex quadratically constrained quadratic program (QCQP).

We can approximately solve Eq. 25 by using an approximation method like the Suggest-Improve framework [28]. Furthermore, if we have a binary classification problem, we can solve a semi-definite program (SDP) whose solution is equivalent to Eq. 25. More details can be found in the appendix (sections 6.4,6.9.1 and 6.9.2).

3.5 Quadratic discriminant analysis

In quadratic discriminant analysis (QDA) we model each class distribution as an independent Gaussian distribution - note that in contrast to LDA each class distribution has its own covariance matrix. The prediction function of a QDA model is given as:

(29)

where denotes the a-priori probability of the -th class.

In this case, the optimization problem for computing a counterfactual explanation Eq. 1 can be rewritten as:

(30)

where

(31)
(32)
(33)

Because we can not make any statement about the definiteness of , the quadratic constraints in Eq. 30 are non-convex. Thus, like in Gaussian naive Bayes (section 3.4), the optimization problem Eq. 30 is a non-convex QCQP.

Like in the case of the previous non-convex QCQPs, we can approximately solve Eq. 30 by using an approximation method. Furthermore, if we have a binary classification problem, we can solve a SDP whose solution is equivalent to Eq. 30. More details can be found in the appendix (sections 6.5,6.9.1 and 6.9.2).

3.6 Learning vector quantization models

Learning vector quantization (LVQ) models 

[29] compute a set of labeled prototypes from a given training data set - we refer to the -th prototype as and the corresponding label as . The prediction function of a LVQ model is given as:

(34)

where denotes a function for computing the distance between a data point and a prototype - usually this is the Euclidean distance:

(35)

There exist LVQ models like (L)GMLVQ [30] and (L)MRSLVQ [31] that learn a custom (class or prototype specific) distance matrix that is used instead of the identity when computing the distance between a data point and a prototype. This gives rise to the generalized L2 distance:

(36)

Because a LVQ model assigns the label of the nearest prototype to a given input, the nearest prototype of a counterfactual must be a prototype with . According to [18], for computing a counterfactual, it is sufficient to solve the following optimization problem for each prototype with and select the counterfactual yielding the smallest value of :

(37)

where denotes the set of all prototypes not labeled as . Note that the feasible region of Eq. 37 is always non-empty - the prototype is always a feasible solution.

In the subsequent sections we explore the type of constraints of Eq. 37 for different LVQ models.

3.6.1 (Generalized matrix) LVQ

In case of a (generalized matrix) LVQ model - all prototypes use the same distance matrix , the optimization problem Eq. 37 becomes [18]:

(38)

where

(39)
(40)

Depending on the regularization, the optimization problem Eq. 38 becomes either a LP (if the Euclidean distance is used) or a convex QP with linear constraints (if the weighted Manhattan distance is used). More information can be found in the appendix (section 6.6).

3.6.2 (Localized generalized matrix) LVQ

In case of a (localiced generalized matrix) LVQ model - there are different, class or prototype specific, distance matrices , the optimization problem Eq. 37 becomes [18]:

(41)

where

(42)
(43)
(44)

Because we can not make any statement about the definiteness of , the quadratic constraints in Eq. 41 are non-convex. Thus, like in Gaussian naive Bayes (section 3.4) and QDA (section 3.5), the optimization problem Eq. 41 is a non-convex QCQP.

Like the previous non-convex QCQPs, we can approximately solve Eq. 41 by using an approximation method. Furthermore, if we have a binary classification problem and each class is represented by a single prototype, we can solve a SDP whose solution is equivalent to Eq. 41. More details can be found in the appendix (sections 6.6,6.9.1 and 6.9.2).

3.7 Tree based models

Tree based models are very popular in data science because they often achieve a high predictive-accuracy 

[32]

. In the subsequent sections we discuss how to compute counterfactual explanations of tree based models. In particular, we consider decision/regression trees and tree based ensembles like random forest models.

3.7.1 Decision trees

In case of decision/regression tree models, we can compute a counterfactual by enumerating all possible paths that lead to the requested prediction [17, 33]. However, it might happen that some requested predictions are not possible because all possible predictions of the tree are encoded in the leafs. In this case one might define an interval of acceptable predictions so that a counterfactual exists.

The procedure for computing a counterfactual of a decision/regression tree is described in Algorithm 1.

Input: Original input , requested prediction of the counterfactual, the tree model
Output: Counterfactual

1:Enumerate all leafs with prediction
2:For each leaf, enumerate all paths reaching the leaf
3:For each path, compute the minimal change to that yields the path
4:Sort all paths according to regularization of the change to
5:Select the path and the corresponding change to that minimizes the regularization
Algorithm 1 Computing a counterfactual of a decision/regression tree

3.7.2 Tree based ensembles

Popular instances of tree based enesmbles are random forest and gradient boosting regression trees. It turns out that the problem of computing a counterfactual explanation of such models is NP-hard 

[33].

The following heuristic for computing a counterfactual explanation of a random forest model was proposed in 

[34]: First, we compute a counterfactual of a model from the ensemble. Next, we use this counterfactual as a starting point for minimizing the number of trees that do not outpur the requested prediction by using a gradient-free optimization method like the Downhill-Simplex method. The idea behind this approach is that the counterfactual of a tree from the ensemble is close to the decision boundary of the ensemble so that computing a counterfactual of the ensemble becomes easier. By doing this for all trees in the ensemble, we get many counterfactuals and we can select the one that minimizes the regularization the most. This heuristic seems to work well in practice [34].

Another approach for computing counterfactual explanations of an ensemble of trees was propsed in [33] - although the authors do not call it counterfactuals, they actually compute counterfactuals. Their algorithm works as follows: We iterate over all trees in the ensemble that do not yield the requested prediction. Next, we compute all possible counterfactuals of each of these trees (see section 3.7.1

). If this counterfactual turns our to be counterfactual of the ensemble, we store it so that in the end we can select the counterfactual with the smallest deviation from the original input. However, it can not be guaranteed that a counterfactual of the ensemble is found because it might happen that by changing the data point so that it becomes a counterfactual of a particular tree, the prediction of other trees in the ensemble change as well. According to the authors, this algorithm/heuristic works well in practice. Unfortunately, the worst-case complexity is exponential in the number of features and thus it is not suitable for high dimensional data.

4 Implementation

The gradient-based and gradient free methods, as well as the model specific methods for tree based models are already implemented in CEML [34]. The implementation of the LVQ specific methods are provided by the authors of [18]. The Python [35] implementation of our proposed methods is available on GitHub333https://github.com/andreArtelt/OnTheComputationOfCounterfactualExplanations and is based on the Python packages scikit-learn [36], numpy [37] and cvxpy [38].

We plan to add these model-specific methods to CEML [34] in the near future.

5 Conclusion

In this survey we extensively studied how to compute counterfactual explanations of many different ML models. We reviewed known methods from literature and proposed methods (mostly LPs and (QC)QPs) for computing counterfactuals of ML models that have not been considered in literature so far.

6 Appendix

6.1 Relaxing strict inequalities

When modeling the problem of computing counterfactuals, we often obtain strict inequalities like

(45)

Strict inequalities are not allowed in convex programming because the feasible region would become an open set. However, we could turn the into a by adding a small number to the left side of the inequality:

(46)

where is a small number.

In practice, when implementing our methods, we found that we can often safely replace all by without changing anything else - this might be because of the numerics (like round-off errors) of fixed size floating-point numbers.

6.2 Separating hyperplane

Recall that the prediction function is given as:

(47)

If we multiply the projection by the requested prediction 444Note that we assume ., the result is positive if and only if the classification is equal to . Therefore, the linear constraint for predicting class is given as

(48)

where

(49)
(50)

6.3 Generalized linear models

6.3.1 Softmax regression

Recall that the prediction function is given as:

(51)

Thus, the constraint for obtaining a specific prediction is given as:

(52)

Holding and fixed, we can simplify Eq. 52:

(53)

where

(54)
(55)

Therefore, we can rewrite Eq. 52 as a set of linear inequalities.

6.3.2 Linear regression

Recall that the prediction function is given as:

(56)

By introducing the parameter that specifies the maximum tolerated deviation from the requested prediction - we set if we do not allow any deviations - the constraint for obtaining the requested prediction is given as

(57)

where

(58)

Finally, we can rewrite Eq. 57 as two linear inequality constraints:

(59)

6.3.3 Poisson regression

Recall that the prediction function is given as

(60)

The constraint for exactly obtaining the requested prediction is

(61)

where

(62)

Finally, we obtain the following set of linear inequality constraints:

(63)

where we introduced the parameter that specifies the maximum tolerated deviation from the requested prediction - we set if we do not allow any deviations.

6.3.4 Exponential regression

Recall that the prediction function is given as:

(64)

The constraint for a specific prediction is given as:

(65)

where

(66)

Finally, we obtain the following set of linear inequality constraints:

(67)

where we introduced the parameter that specifies the maximum tolerated deviation from the requested prediction - we set if we do not allow any deviations.

6.4 Gaussian naive bayes

Recall that the prediction function is given as:

(68)

We note that Eq. 68 is equivalent to

(69)

Simplifying the term in Eq. 69 yields

(70)

where

(71)
(72)
(73)

For a sample , in order to be classified as the -th class, the following set of strict inequalities must hold:

(74)

By rearranging terms in Eq. 74, we get the final constraints

(75)

where

(76)
(77)
(78)

Because we can not make any statement about the definiteness of the diagonal matrix , the constraint Eq. 75 is a non-convex quadratic inequality constraint.

6.5 Quadratic discriminant analysis

Recall that the prediction function is given as:

(79)

We can rewrite Eq. 79 as

(80)

Working on the term yields

(81)

where

(82)
(83)

For a sample , in order to be classified as the -th class, the following set of strict inequalities must hold:

(84)

Rearranging Eq. 84 yields

(85)

where

(86)
(87)
(88)

The final constraint Eq. 85 is a non-convex quadratic constraint because we can not make any statement about the definiteness of .

6.6 Learning vector quantization

Note: The subsequent sections are taken from [18].

6.6.1 Enforcing a specific prototype as the nearest neighbor

By using the following set of inequalities, we can force the prototype to be the nearest neighbor of the counterfactual - which would cause to be classified as :

(89)

We consider a fixed pair of and :