Measurable Counterfactual Local Explanations for Any Classifier

08/08/2019 ∙ by Adam White, et al. ∙ City, University of London 2

We propose a novel method for explaining the predictions of any classifier. In our approach, local explanations are expected to explain both the outcome of a prediction and how that prediction would change if 'things had been different'. Furthermore, we argue that satisfactory explanations cannot be dissociated from a notion and measure of fidelity, as advocated in the early days of neural networks' knowledge extraction. We introduce a definition of fidelity to the underlying classifier for local explanation models which is based on distances to a target decision boundary. A system called CLEAR: Counterfactual Local Explanations via Regression, is introduced and evaluated. CLEAR generates w-counterfactual explanations that state minimum changes necessary to flip a prediction's classification. CLEAR then builds local regression models, using the w-counterfactuals to measure and improve the fidelity of its regressions. By contrast, the popular LIME method, which also uses regression to generate local explanations, neither measures its own fidelity nor generates counterfactuals. CLEAR's regressions are found to have significantly higher fidelity than LIME's, averaging over 45 paper's four case studies.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning systems are increasingly being used for automated decision making. It is important that these systems’ decisions can be trusted. This is particularly the case in mission critical situations such as medical diagnosis, airport security or high-value financial trading. Yet the inner workings of many machine learning systems seem unavoidably opaque. The number and complexity of their calculations are often simply beyond the capacities of humans to understand. One possible solution is to treat machine learning systems as ‘black-boxes’ and to then explain their input-output behaviour. Such approaches can be divided into two broad types: those providing global explanations of the entire system and those providing local explanations of single predictions. Local explanations are needed when a machine learning system’s decision boundary is too complex to allow for global explanations. This paper focuses on local explanations.

A novel method called Counterfactual Local Explanations viA Regression (CLEAR) is proposed. This is based on a concept of counterfactual explanation from the philosophy of science’s analysis of causality Woodward ; Pearl . Perhaps the most influential account is Woodward’s Woodward

. Woodward states that a satisfactory explanation consists in showing patterns of counterfactual dependence. By this he means that it should answer a set of ‘what-if-things-had been-different?’ questions, which specify how the explanandum (i.e. the phenomenon to be explained) would change if, contrary to the fact, input conditions had been different. It is in this way that a user can understand the relevance of different features, and understand the different ways in which they could change the value of the explanandum. Central to Woodward’s notion is the requirement for an explanatory generalization: [vskip=-2pt] "Suppose that M is an explanandum consisting in the statement that some variable Y takes the particular value y. Then an explanans E for M will consist of (a) a generalization G relating changes in the value(s) of a variable X (where X may itself be a vector or n-tuple of variables

) with changes in Y, and (b) a statement (of initial or boundary conditions) that the variable X takes the particular value x." In Woodward’s analysis, X causes Y. For our purposes, Y can be taken as the machine learning system’s predictions where X are the system’s input features. The required generalization can be a regression equation that captures the machine learning system’s local input-output behaviour.

CLEAR provides counterfactual explanations by building on the strengths of two state-of-the-art explanatory methods, while at the same time addressing their weaknesses. The first is by Wachter et al. Counterfactuals ; Explaining who argue that single predictions are explained by what we shall term as ‘w-counterfactuals’ that state the minimum changes needed for an observation to ’flip’ its classification. The second method is by Riberio et al. LIME who argue for Local Interpretable Model-Agnostic Explanations (LIME). These explanations are created by building a regression model that seeks to approximate the local input-output behaviour of the machine learning system.

In isolation,‘w-counterfactuals’ do not provide explanatory generalizations relating X to Y and therefore are not satisfactory explanations, as we exemplify in the next section. LIME, on the other hand, does not measure the fidelity of its regressions and cannot produce counterfactual explanations.

The contribution of this paper is three-fold. We introduce a novel explanation method capable of:

  • [noitemsep,topsep=-2pt]

  • providing counterfactuals that are explained by regression coefficients including interaction terms;

  • evaluating the fidelity of its local explanations to the underlying learning system;

  • using the values of w-counterfactual to significantly improve the fidelity of its regressions.

When applied to multi-layer perceptrons (MLPs) trained on four datasets, CLEAR improves on the fidelity of LIME by an average of over 45%.

Section 2 provides the background to CLEAR including an analysis of w-counterfactuals and LIME. Section 3 introduces CLEAR and explains how it uses w-counterfactuals to both measure and improve the fidelity of its regressions. Section 4 contains experimental results on four datasets showing that CLEAR’s regressions have significantly higher fidelity than LIME’s. Section 5 concludes the paper and discusses directions for future work.

2 Background

This paper adopts the following notation: let m be a machine learning system mapping ; m is said to generate prediction y for observation x.

2.1 W-Counterfactual Explanations

Wachter et al.’s w-counterfactuals explain a single prediction by identifying ‘close possible worlds’ in which an individual would receive the prediction they desired. For example, if a banking machine learning system declined Mr Jones’ loan application, a w-counterfactual explanation might be that ‘Mr Jones would have received his loan, if his annual salary had been $35,000 instead of the $32,000 he currently earns’. The $3000 increase would be just sufficient to flip Mr Jones to the desired side of the banking system’s decision boundary.

Wachter et al. note that a counterfactual explanation may involve changes to multiple features. Hence, an additional w-counterfactual explanation for Mr Jones might be that he would also get the loan if his annual salary was $33,000 and he had been employed for more than 5 years. Wachter et al. state that counterfactual explanations have the following form: [vskip=-2pt] "Score p was returned because variables V had values ( ) associated with them. If V instead had values (), and all other variables remained constant, score p’ would have been returned" Wachter et al. specify an objective function and optimiser that samples m and searches for w-counterfactuals.

The key problem with w-counterfactuals: w-counterfactual explanations fails to satisfy Woodward’s requirement that: a satisfactory explanation of prediction y should state a generalization relating X and Y.

For example, suppose that a machine learning system has assigned Mr Jones a probability of 0.75 for defaulting on a loan. Although stating the changes needed to his salary and years of employment has explanatory value, this falls short of being a satisfactory explanation. A satisfactory explanation also needs to explain:

  1. [label=(), leftmargin=1cm,noitemsep,topsep=-2pt]

  2. why Mr Jones was assigned a score of 0.75. This would include identifying the contribution that each feature made to the score.

  3. how the features interact with each other. For example, perhaps the number of years employed is only relevant for individuals with salaries below $34,000.

These requirements could be satisfied by stating an explanatory equation that included interaction terms and indicator variables. At a minimum, the equation’s scope should cover a neighbourhood around x that includes the data points identified by its w-counterfactuals.

2.2 Local Interpretable Model-Agnostic Explanations

Ribeiro et al. LIME propose LIME, which seeks to explain why m predicts y given x

by generating a simple linear regression model that approximates

m‘s input-output behaviour with respect to a small neighbourhood of m’s feature space centered on x. LIME assumes that for such small neighbourhoods m’s input-output behaviour is approximately linear. Ribeiro et al. recognize that there is often a trade off to be made between local fidelity and interpretability. For example, increasing the number of independent variables in a regression might increase local fidelity but decrease interpretability. LIME is becoming an increasingly popular method, and there are now LIME implementations in multiple packages including Python, R and SAS.

The LIME algorithm: Consider a model m

(e.g. a random forest or MLP) whose prediction is to be explained: The LIME algorithm: (1) generates a dataset of synthetic observations; (2) labels the synthetic data by passing it to the model

m which calculates probabilities for each observation belonging to each class. These probabilities are the ground truths that LIME is trying to explain; (3) weights the synthetic observations (in standardised form) using the kernel:
where d is the Euclidean distance from x to the synthetic observation, and the default value for kernel width is a function of the number of features in the training dataset; (4) produces a locally weighted regression, using all the synthetic observations. The regression coefficients are meant to explain m’s forecast y.

Ribeiro provides an online ’tutorial’ on LIME, which includes an example of a random forest model of the IRIS dataset111 following is an example of LIME’s output:

Figure 1: Sample output from LIME.

The left graphic shows that the random forest model predicted that x belongs to the class ‘setosa’ with a probability of 1. The middle graphic displays LIME’s regression coefficients. The right graphic displays the values of observation x that were used in the prediction, in this case the user specified that only the three most relevant variables were to be used (these are highlighted in turquoise).

Key problems with LIME: LIME does not measure the fidelity of its regression coefficients. This hides that it may often be producing false explanations. Although LIME displays the values of its regression coefficients, it does not display the predicted values y calculated by its regression model. Let us refer to these values as regression scores (they are not bounded by the interval [0,1]). As part of this paper’s analysis, LIME’s regression scores were calculated for the IRIS test set revealing large errors. For example, in 20% of explanations, the regression scores for setosa differed by more than 0.45 from the probabilities calculated by the random forest model m. Take the example in Figure 1, where m predicted probabilities {1,0,0}; LIME’s regression scores were {0.54,0.47,-0.02} i.e. in this case, LIME’s scores are completely inconsistent with m’s prediction and the corresponding LIME regression coefficients should not be used in its explanation.

It might be thought that an adequate solution would be to provide a goodness-of-fit measure such as adjusted R-squared. However, as will be explained in Section 3, such measures can be highly misleading when evaluating the fidelity of the regression coefficients for estimating


Another problem is that LIME does not provide counterfactual explanations. It might be argued that LIME’s regression equations provide the basis for a user to perform their own counterfactual calculations. However, there are multiple reasons why this is incorrect. First, as will be shown in Section 3, additional functionality is necessary for generating faithful counterfactuals including the ability to solve quadratic equations. Second, LIME does not ensure that the regression model correctly classifies x. In cases where the regression misclassifies x’s class, then any subsequent w-counterfactual will be false. Third, it does not have the means of measuring the fidelity of any counterfactual calculations derived from its regression equation. Fourth, LIME does not offer an adequate dataset for calculating counterfactuals. The data that LIME uses in a local regression need to be representative of the neighbourhood that its explanation is meant to apply to. For counterfactual explanations, this target neighbourhood needs to extend from x to the nearest points of m’s decision boundary. Furthermore, the type of Kernel being used is unsuitable; its weightings are too centered around x, when other points (e.g. at the decision boundary) are also important.

2.3 Other Related Work

Early work seeking to provide explanations to neural networks have been focused on the extraction of symbolic knowledge from trained networks survey

, either decision trees in the case of feedforward networks

Trepan or graphs in the case of recurrent networks Jacobsson ; leegiles . More recently, attention has been shifted from global to local explanation models due to the very large-scale nature of current deep networks, and has been focused on explaining specific network architectures (such as the bottleneck in auto-encoders Irina

) or domain specific networks such as those used to solve computer vision problems

dissection , although some recent approaches continue to advocate the use of rule-based knowledge extraction Corels ; sontran . The reader is referred to recentSurvey for a recent survey.

More specifically, Guidotti et al. LORE have proposed LORE – Local Rule based Explanations, which provides local explanations for binary classification tasks using decision trees. It is model-agnostic, generates local models from synthetic data, has many other similarities to LIME, but it also generates counterfactual explanations. Guidotti et al. criticise LIME for producing neighbourhood datasets whose observations are too distant from each other and have too low a density around x

. By contrast LORE uses a genetic algorithm to create neighbourhood datasets with a high density around

x and the decision boundary.

Guidotti et al. claim that their system outperforms LIME and they provide fidelity statistics comparing LORE and LIME, where fidelity is defined in terms of how well local models perform in making the same classifications as the underlying machine learning system. However, their fidelity statistics for LIME could be misconstrued; it does not follow from being able to mimic a system’s classifications that a local model will also faithfully mimic its counterfactuals (see Section 4).

Ribeiro et al. ANCHORS , the authors of the LIME paper, have subsequently proposed ‘Anchors: High Precision Model-Agnostic Explanations’. In motivating their new method they note that LIME does not measure its own fidelity and that ’even the local behaviour of a model may be extremely non-linear, leading to poor linear approximations’. An Anchor is a rule that is sufficient to ensure that a local prediction will remain with the same classification, irrespective of the values of other variables. By an Anchor being sufficient Ribeiro et al. mean that the local prediction ‘will almost always’ remain unchanged. They specify a pure-exploration multi-armed bandit algorithm that efficiently identifies Anchors. The Anchor method does not have the capacity to generate counterfactuals.

LIME has spawned several variants. For example LIME-SUP LIME-SUP and ’s K-LIME H20 both seek to explain a machine learning system’s functionality over its entire input space by partitioning the input space into a set of neighbourhoods, and then creating local models. K-LIME uses clustering and then regression, LIME-SUP just uses decision tree algorithms. LIME has also been adapted to enable novel applications, for example SLIME SLIME provides explanations of sound and music content analysis. However none of these variants address the problems identified in this paper.

3 The CLEAR Method

CLEAR is based on the view that a satisfactory explanation of a single prediction needs to both explain the value of that prediction and answer ’what-if-things-had-been-different’ questions. In doing this it needs to state the relative importance of the input features and show how they interact. A satisfactory explanation must also be measurable and state how well it can explain a model. It must know when it does not know zoubin .

CLEAR is based on the concept of w-perturbation, as follows:

Definition 5.1 Let min(x) denote a vector resulting from applying a minimum change to the value of one feature in x such that m(min(x)) = y’ and m(x) = y, class(y) class(y’). Let (x) denote the value of feature in x. A w-perturbation is defined as the change in value of feature for a target class y’, that is (min(x)) (x).

For example, for the w-counterfactual that Mr Jones would have received his loan if his salary had been $35,000, a w-perturbation for salary is $3000. If x has features and m solves a -class problem then there are w-perturbations of x; changes in a feature value may not always imply a change of classification.

CLEAR compares each w-perturbation with an estimate of that value, call it estimated w-perturbation, calculated using its local regression, to produce a fidelity error, as follows:

fidelity error estimated w-perturbation w-perturbation

CLEAR generates an explanation of prediction y for observation x by the following steps:

  1. [leftmargin=1cm]

  2. Determine x’s w-perturbations for a user-selected set of features. This is achieved by querying m with feature values starting with x and progressively moving away from x at regular intervals up to a range of possible feature values.

  3. Generate labelled synthetic observations (default is 50,000 observations). Data for numeric features is generated by sampling from a uniform distribution. Data for categorical features is generated by sampling in proportion to the frequencies found in the training set. The synthetic observations are labelled by being passed through


  4. Create a balanced neighbourhood data set (default is 200 observations). Synthetic observations that are near to x (Euclidean distance) are selected with the objective of achieving a dense cloud of points around m’s decision boundaries. For this, the neighbourhood data is selected such that it is equally distributed across classes, i.e. approximately balanced.

  5. Perform a step-wise regression on the neighbourhood data set, under the constraint that the regression curve should go through x

    . The regression can include second degree terms, interaction terms and indicator variables. CLEAR provides options for both multiple and logistic regression.

  6. Estimate the w-perturbations by substituting x’s w-counterfactual values from min(x), other than for feature , into the regression equation and calculating the value of . See example below.

  7. Measure the fidelity of the regression coefficients. Fidelity errors are calculated by comparing the actual w-perturbations determined in step 1 with the estimates calculated in step 5.

  8. Iterate to best explanation. Because CLEAR produces fidelity statistics, its parameters can be iteratively changed in order to achieve a better trade-off between interpretability and fidelity. Relevant parameters include the number of features/independent variables to consider and the use and number of quadratic or interaction terms.

  9. CLEAR also provides the option of adding x’s w-counterfactuals, min(x), to x’s neighbourhood data set. The w-counterfactuals are weighted and act as soft constraints on CLEAR’s subsequent regression. Algorithms 1 and 2 outline the entire process.

Example of using regression to estimate a w-perturbation:

An MLP with a softmax activation function in the output layer was trained on a subset of the UCI Pima Indians Diabetes dataset. The MLP calculated a probability of 0.69 for

x belonging to class 1 (having diabetes). CLEAR generated the logistic regression equation  where:

Let the decision boundary be . Thus, x is on the boundary when . The estimated w-perturbation for Glucose is obtained by substituting into the regression equation: and the value of BloodPressure in x:

Solving this equation, CLEAR selects the root equal to 0.025 as being closest to the original value of Glucose in x. The original value for Glucose was 0.537 and hence the estimated w-perturbation is -0.512. The actual w-perturbation (step 1) for Glucose to achieve a probability of 0.5 of being in class 1 was -0.557; hence, the fidelity error was 0.045.

In summary, a CLEAR explanation has two parts: the first provides x’s w-counterfactuals and the second states the corresponding regression and its fidelity errors. Figure 2 shows excerpts from a CLEAR report.

Figure 2: Example of CLEAR’s w-counterfactual report for a single prediction.

A CLEAR prototype has been developed in Python for binary classification tasks222 CLEAR can be run either in batch mode on a test set or it can explain the prediction of a single observation. In batch mode, CLEAR reports the proportion of its estimated w-counterfactuals that have a fidelity error lower than a user-specified error threshold T, as follows:

Definition 5.3 (% fidelity): A w-perturbation is said to be feasible if the resulting feature value is within the range of values found in m’s training set. The percentage fidelity given a batch and error threshold T is the number of w-perturbations with fidelity error smaller than T divided by the number of feasible w-perturbations.

input :  t (training data), x,m,T
output : expl (set of explanations)
for each target class tc do
       for each feature  do
       end for
      Balanced_Neighbourhood(S, x, m)
       return expl
end for
Algorithm 1 CLEAR Algorithm
input :  (synthetic dataset), x,m
output :  (neighbourhood dataset)
for   do
       Euclidean_Distance ) m
end for
with lowest s.t. with lowest s.t. with lowest s.t. return
Algorithm 2 Balanced_Neighbourhood

Notice that for CLEAR an explanation (expl) is a tuple , where and are w-perturbations (actual and estimated), is a regression equation and are fidelity errors. In Algorithm 2, the values of and are fixed to create a balanced neighbourhood assuming as decision boundary, as in the above example.

4 Experimental Results

Experiments were carried out with four UCI datasets of binary classification problems: Pima Indians Diabetes (with 8 numeric features), Default of Credit Card Clients (20 numeric features, 3 categorical), and subsets of Adult (with 2 numeric features, 5 categorical features), and Breast Cancer Wisconsin (9 numeric features). For the Adult dataset, some of the categorical features values were merged and features with little predictive power were removed. With the Breast Cancer dataset only the mean value features were kept. For reproducibility, the code for pre-processing the data is included with the CLEAR prototype on GitHub.

For each dataset, an MLP with a softmax output layer was trained using Tensorflow. Each dataset was partitioned into: an MLP training dataset (out of which 100 observations were selected for determining the total number of synthetic observations and neighbourhood data to be generate by CLEAR) and an MLP test dataset (out of which 100 observations were selected for calculating the % fidelity of CLEAR and LIME). The code for the MLP training is also included on GitHub. Experiments were carried out with different test sets, with each experiment being repeated 20 times for different generated synthetic data. The experiments were carried out on a Windows i7-7700HQ 2.8 GHz PC. A single run of a 100 observations took 40-80 minutes, depending on the dataset. The CLEAR prototype has not yet been developed for multi-class datasets.

In order to enable comparisons with LIME, CLEAR includes an option to run the LIME algorithms for creating synthetic data and generating regression equations. CLEAR then takes the regression equations and calculates the corresponding w-counterfactuals and fidelity errors.

Pima Adult Credit Breast
CLEAR- not using w-counterfactuals 57%  0.8 80%  0.9 39%  1.3 54%  1.1
CLEAR- using w-counterfactuals 77%  0.8 80%  0.8 55%  1.7 81%  1.3
LIME algorithms 20%  0.4 26%  0.6 12%  0.5 20%  0.5
Table 1: Comparison of % fidelity of CLEAR and LIME: the use of a balanced neighbourhood, centering and quadratic terms allow CLEAR, in general, to achieve a considerably higher fidelity to w-counterfactuals than LIME, even without training with w-counterfactuals. Including training with w-counterfactuals (optional step 8 of the CLEAR method), % fidelity can be increased further.

CLEAR’s regressions are significantly better than LIME’s. The best results are obtained by including w-counterfactuals in the neighbourhood datasets (step 8 of the CLEAR method). Overall, the best configuration comprised: using balanced neighbourhood data, forcing the regression curve to go through x (i.e. ’centering’), including both quadratic and interaction terms, and using logistic regression for Pima and Breast Cancer and multiple regression for Adult and Credit data sets. Unless otherwise stated % fidelity is for the error threshold T = 0.25. Table 1 compares the % fidelity of CLEAR and LIME (i.e. using LIME’s algorithms for generating synthetic data and performing the regression). This used LIME’s default parameter values except for the following beneficial changes: the number of synthetic data points was increased to 15,000 (further increases did not improve fidelity), the data was not discretized, a maximum of 15 features were allowed, several kernel widths in the range from 1.5 to 4 were evaluated. By contrast, CLEAR was run with its best configuration and with 14 features. As an example of LIME’s performance: with the Credit dataset, the adjusted averaged 0.7, the classification of the test set observations was over 94% correct. However, the absolute error between y and LIME’s estimate of y was 8% (e.g. the MLP forecast , while LIME estimated it at 0.48) and this by itself would lead to large errors when calculating how much a single feature needs to change for y to reach the decision boundary. LIME’s fidelity of only 2%, illustrates that CLEAR’s measure of fidelity is far more demanding than just classification accuracy. Of course, LIME’s poor fidelity was due, in part, to its kernel failing to isolate the appropriate neighbourhood data sets necessary for calculating w

-counterfactuals accurately. A further problem is that LIME converts categorical features into Boolean variables, often loosing valuable information. This can be seen in the very poor fidelity statistics for the Adult and Credit data sets which contain categorical variables, differently from Pima and Breast data.

Table 2 shows how CLEAR’s fidelity (not using w-counterfactuals) varied with the maximum ’number of independent variables’ allowed in a regression. At first, fidelity sharply improves but then plateaus.

Table 2: No. variables vs % Fidelity
No. PIMA Adult Credit Breast
8 42% 35% 27% 43%
11 53% 76% 38% 46%
14 57% 80% 39% 54%
17 59% 78% 40% 55%
20 62% 78% 39% 56%
Figure 3: Different configurations

Despite the regression fitting the neighbourhood data well, a significant number of the estimated w-counterfactuals have large fidelity errors. For example, in one of the experiments with the Adult dataset where CLEAR’s multiple regression did not center the data, the average adjusted was 0.97, classification accuracy 98% but the ’% fidelity error < 0.25’ was 59%. This points to a more general problem: sometimes the neighbourhood data sets do not represent the regions of its feature space that are central for its explanations. With CLEAR, this discrepancy can at least be measured.

CLEAR was tested in a variety of configurations. These included the best configuration, and configurations where a single option was altered from the default, e.g. by using a imbalanced neighbourhood of points nearest to x. Figure 3 displays the results when CLEAR used a maximum of 14 independent variables.

CLEAR’s fidelity was sharply improved by adding x’s w-counterfactuals to its neighbourhood datasets. In the previous experiments, CLEAR created a neighbourhood dataset of at least 200 synthetic observations, each being given a weighting of 1. This was now altered so that each w-counterfactual identified in step 1 was added and given a weighting of 10. For example, for the Pima dataset, an average of 3 w-counterfactuals were added to each neighbourhood dataset. The consequent improvement in fidelity indicates as expected that adding these weighted data points results in a dataset capable of representing better the relevant neighbourhood, with CLEAR being able to provide a regression equation that is more faithful to w-counterfactuals.

5 Conclusion and Future Work

CLEAR explains a prediction y for data point x by stating x’s w-counterfactuals and providing a regression equation. The regression shows the patterns of counterfactual dependencies in a neighbourhood that includes both x and the w-counterfactual data points. CLEAR represents a significant improvement both on LIME and on just using w-counterfactuals. Key to CLEAR’s performance is the ability to generate relevant neighbourhood data bounded by its w-counterfactuals. Adding these to x’s neighbourhood led to sharp fidelity improvements. Other data points could also be added, for example w-counterfactuals involving changes to multiple features. A user might also include some perturbations that are important to their project. CLEAR could guide this process by reporting regression and fidelity statistics. And it could replace step 8 of the algorithm by increasingly complex and more sophisticated learning algorithms. Constructing neighbourhood datasets in this way would seem a better approach than randomly generating data points and then selecting those closest to x. The prototype will also be extended to multi-class tasks. A criticism of CLEAR might be that its regression equations are not sufficiently interpretable. This is because they include a large number of terms, including 2nd degree and interaction variables. A first response is that the inclusion of these terms is necessary (though not sufficient) for CLEAR’s regression to be faithful. A second response is that CLEAR’s equations can easily be simplified by substituting in the values of any feature in x that are not of interest to the user. This is to be evaluated in practice in the context of comprehensibility studies.