Recent interest in explaining the output of complex machine learning models has been characterized by a wide range of approaches[9, 13]. Many of these approaches are model specific; for example attempts to explain neural networks that rely on interpreting the flow of gradient information through the model [18, 14, 6]
, or decision trees, which might be considered directly interpretable.
Model agnostic approaches, however, are attempts to formulate a general framework for per-instance explanation of a model’s outputs regardless of the type of model being used. This can be beneficial both in circumstances where choice of model may change over time, or where the original model is costly or impossible to access.
One group of model-agnostic explainers focuses on providing an explanation of a model’s output as either a subset of input features [17, 1], or a weighting of input features [16, 11] of the instance to be explained. Another group of models [20, 19] instead proposes that counterfactual instances, or groups of instances, are a useful proxy to ’explanation’, where the claim is that local explanations are expected to contain both the outcome of a prediction, and how that prediction would change if the input changed. Many of these approaches use sampling procedures to either estimate local decision boundaries (and their corresponding parameters), or to find proximate counterfactual instances, and are thus computationally expensive. The computational cost of sampling the local decision boundaries for each new explanation makes these methods slow to scale, and of limited use in real-world applications.
We propose EMAP, Explanation by Minimal Adversarial Perturbation, a model that returns the direction that an instance would have to be perturbed the least in order for the classification of the underlying model to change. EMAP’s contribution is threefold:
EMAP combines elements of both the feature weighting and counterfactual paradigms of model explanation, and is fully model-agnostic.
EMAP is faster than alternative methods by 5 orders of magnitude, once constant overheads are taken into account, allowing for model explanations in time-critical applications where sampling-based methods are infeasible.
EMAP naturally indicates regions of low classifier confidence, or potential user interest, as a consequence of its design.
The paper is structured as follows. In Related Work we provide an summary of recent alternative approaches to instance-wise model-agnostic explanation. The Model section includes a justification of our approach, and separately describes how we handle continuous and categorical input variables. Experimental Setup details the datasets and model choices made. The Results section analyses the model’s performance on two synthetic datasets with continuous features, a more complex continuous dataset, and a standard dataset with both continuous and categorical features. We summarise our findings in the Conclusion.
2 Related Work
One of the most widely-used feature weighting approaches to per-instance explanation of a black box model’s outputs is LIME , which learns a local surrogate approximation to the original model’s output, centered on the instance to be explained. It does this by generating a new dataset of permuted samples and corresponding predictions of the black box model, and then trains an interpretable linear model on this new dataset, where each point is weighted by its proximity to the point of interest. The weights of the linear model are then considered to be the explanations of the black box model’s output at that point. LIME can also be considered to be slow; its reliance on sampling afresh for every data point reduces the speed at which explanations can be collected for large numbers of instances of interest.
Separate work has shown that those explanation methods that return a weighting of input features, including LIME, can all be considered as additive feature attribution methods, with an explanation model that is a linear function of binary variables. This unified framework is called SHAP, and accompanying methods exist to estimate feature importance values for instance predictions on particular models .
One attempt to produce fast instance-based explanations is L2X , where the authors train a neural network to output a binary mask over instance features, and a second network to return the original black box model output from the masked input. By training on a cross entropy objective, they argue that they are effectively maximising the mutual information between some subset of input features and the true model output. The subset of features chosen once the explainer is trained should be the maximally informative subset, and thus a good explanation of the black box model output. This approach shares some similarity with ours, insofar as the second network can be thought of as learning a differentiable surrogate to the true model, although the authors do not consider their model in these terms. A crucial drawback of L2X is that it does not provide weighting of feature importances, nor does it provide the direction in which a given feature would impact classification.
An example of the fact that adversarial examples can be good explanations of underlying models is the work of wachter2017counterfactualwachter2017counterfactual. Here the approach, assuming a trained model , is to minimise
where the first term is the quadratic distance between the output of the model under some counterfactual input and a new target , and the second term is a measure of the distance between the true input to be explained, , and its possible counterfactual instance . This approach is similar in spirit to ours, but differs in several important ways.
Firstly the method returns a set of counterfactual instances, rather than a counterfactual direction. Secondly, the procedure to generate one counterfactual example for one point requires iterating between minimising the above objective and increasing , and the authors recommend initialising a sample of potential counterfactuals and repeating the process on all of them, to avoid getting stuck in local minima. This means the process is slow. Thirdly, optimising the above objective assumes that is tractable (for example, a gradient based optimiser would need the gradient of with respect to ). This limits the approach to only those models where this is the case, whereas by training a differentiable approximation to the black box model, we circumvent this issue.
Another similar approach can be found in CLEAR, , which includes an interesting model of fidelity, although again the process of extracting an explanation requires sampling, and iterative solving.
In short, LIME, SHAP, and other sampling-based models require thousands of model-evaluations for each instance that needs to be explained. L2X needs only one forward pass of a neural network per explanation, but does not provide a weighting of feature importances, nor directionality of explanations. With EMAP, we provide a method that retains the benefits of LIME and SHAP, while providing computational efficiency on par with L2X.
In the domain of explaining the outputs of neural networks, particularly for image classification, there are several examples of papers which use adversarial or perturbatory approaches [2, 3, 5]. These approaches often rely on dividing images into regions, which places a strong modelling prior on correlations between input features (in this case, pixels). As our approach is fundamentally more general, we are not able to make similar assumptions, and likely would have substantially different use-cases. Two of these papers [2, 3] also assume differentiability, whilst the third treats ’perturbations’ as one of three regional noise masks; instead of learning feature-specific meaningful perturbations as in our approach.
Our general approach to the problem of explaining an instance’s classification by a model is to find the minimal adversarial perturbation of that instance. This can be thought of as an answer to the question ’what is the smallest change we can make to this instance to change its classification?’. We argue that this is a useful measure for two reasons.
First, it is locally meaningful. An instance’s classification depends on its location relative to the classifier’s decision boundary or boundaries. The minimal adversarial perturbation will ’point’ directly to the nearest decision boundary. Features that contribute substantially to this minimal perturbation must also be features that have contributed substantially to the instance’s classification. If we imagine perturbing the features of an instance equally, those with relatively large contributions to the original classification will be just those that have a relatively large contribution to subsequent misclassification.
Secondly, it is useful for an end-user. The outputs of a model often require explanation due to a desire for improvement, or, more specifically, instances that require further justification are often instances which have been wrongly classified, or are suspected to have been wrongly classified. Indicating what should be changed to allow an instance to be alternatively classified satisfies this requirement directly, and in a manner which is arguably more interpretable than providing the weights of a local linear model.
3.2 Continuous input features
Let us assume we have access to a set of outputs of some model , where for binary classification,
will be the probability that input instancebelongs to the target class 111For the sake of clarity, we will initially assume a binary classification. Multi-class classification is dealt with below, and regression is discussed in the conclusion, or a corresponding indicator function (). For each we wish to explain, our goal is to find the smallest adversarial perturbation; i.e. the smallest perturbation such that . Here, , and if minimal, can be thought of as the shortest distance from to the decision boundary of .
The space of possible perturbations is prohibitively large for an exhaustive search per instance to be explained, and so we will assume a restricted class of models , mapping data space to perturbations. Our approach in this paper is to represent such a mapping as , , a differentiable function described by a neural network with parameters . Ideally, we would then like to compute the optimal adversarial parameter settings by standard gradient-based methods, using:
is a hyperparameter restricting the size of generated perturbations, andare the adversarial labels.
However, in a model-agnostic setting, we cannot assume to be differentiable222Or at least, we cannot assume we have access to the gradients of ., or even that we have access to itself to compute . We therefore further define a surrogate , also a neural network, which is trained to be a differentiable approximation to by cross entropy loss:
Substituting for in (1) finally gives us a tractable objective:
Note that remains unchanged, as it does not depend on , and we have assumed we know for all in our data.
In practice, training is carried out in two stages; firstly we train on the original inputs and original labels to approximate the black box model . Secondly, we freeze the weights of and train on the original inputs and flipped labels; the perturbations output by are added to the original inputs and passed through the surrogate . As is a differentiable model, back-propagation provides the gradients of the loss with respect to the perturbations, and hence with respect to . We can therefore train directly using the original dataset.
3.3 Discrete input features
For many applications, however, some or all of the input features of will be discrete, rather than continuous. For some categorical feature , which takes values outputting a continuous value from our perturbation generator is unhelpful. We first consider the case in which all input features are categorical.
We take the general approach that perturbing a categorical feature means sampling from a corresponding categorical distribution and assigning the feature the sampled value. For each categorical , our mapping from data space to perturbation space contains the corresponding sub-mapping , assuming a 1-hot encoding, where each of the real valued outputs is treated as the log class probability of the value of the categorical feature.
To train to find adversarial samples, we can use the softmax function as a continuous differentiable approximation to , which allows us to use the Gumbel-Softmax trick to generate
-dimensional sample vectorswhere the element is given by:
where is a hyperparameter governing the temperature of the distribution; as it approaches 0, the Gumbel-Softmax distribution approaches the Categorical distribution. , where(see jang2016categoricaljang2016categorical for more details).
When training , these samples are then concatenated into a perturbed instance, , which is passed through the pre-trained surrogate model as before.
The only other difference to the training procedure is that the term in the objective intended to minimise the size of the adversarial perturbations in (3), , must be changed to account for the fact we are no longer perturbing by adding small vectors to an input in . We make the simplest assumption that if perturbed feature takes on the same value as the original feature , it has a perturbation cost of 0, and otherwise has a cost proportional to a hyperparameter . This yields the following regularisation term:
Where and are 1-hot vectors of length (which may be different for different ), and is the number of categorical variables in .
Our approach also supports a hybrid of both categorical and continuous variables, by combining the two objectives outlined above, where each affects the appropriate variables. The main challenge here is the relative magnitudes of and . We found (see discussion in Results, below), that for simple datasets setting to around an order of magnitude smaller than yielded good results.
4 Experimental Setup
For all experiments below, expect otherwise stated, the neural network parameterising
consists of four fully connected layers of size 100 with ReLU nonlinearities and a ’partial gumbel layer’ that combines standard additive perturbations for continuous variables with a collection of Gumbel-Softmax outputs for categorical variables, as discussed in the ’Model’ section, above. We used a dropout percentage of 20 for every layer.
The neural network parameterising the surrogate consists of three fully connected layers of size 200, with the first two nonlinearities being ReLU, and the final Softmax. We used a cross-entropy loss, as is standard for classification, and trained both models using Adam 
, with a learning rate of 1e-3. On simple synthetic datsets, both networks converge in under 15 epochs.
Network architecture and hyperparameters were chosen to be as simple as possible whilst providing reasonable results on a variety of datasets. Our intention was to showcase the generality and robustness of our model, so we avoided hyperparameter tuning or intensive model selection. Several similar architectures (more layers, wider fully connected layers) worked equally well, and an analysis of their relative merits is not pertinent to this initial presentation of the model.
For simple synthetic data we used 10000 samples from the half moons dataset, available on scikit-learn 
, with Gaussian noise with standard deviation 0.2 added to the data. Our second synthetic dataset was handcrafted, and is described in the Results section, below. A more realistic continuous dataset was MNIST, which we converted to a binary classification task by using only the digits 8 and 3, which gave a train/test split of 11982/1984, and training a classifier to predict between them. This approach was followed by both lundberg2017unifiedlundberg2017unified and chen2018learningchen2018learning.
Finally, to test the performance of our method on a mix of categorical data and continuous data, we used a dataset available from the UCI machine learning repository 
. This was a subset of the Adult dataset, where in a similar fashion to white2019measurablewhite2019measurable uninformative or highly skewed features (’fnlwgt’, ’education’, ’relationship’, ’native-country’, ’capital-gain’, ’capital-loss’) were removed, along with instances with missing values. The two classes were then balanced by undersampling the larger class, yielding a 17133/5711 train/test split. This left 3 continuous features, which were normalised to have zero mean and unit variance, and 5 categorical features (see Table1 for example instances).
5.1 Continuous Features - Comparison to LIME
We first demonstrate that in simple continuous input spaces, EMAP closely approximates LIME on standard a synthetic dataset, and succeeds in highlighting regions of interest in a manner unavailable to LIME.
We trained a Random Forest with 200 trees to classify the half-moons dataset (with a train/test split of 8000/2000) provided as standard with the scikit-learn toolset
. The classifier had an f-score of 0.97 on the test set. We then generated explanation coefficients for the classification 2000 randomly sampled points in the dataset using the off-the-shelf LIME toolkit. Figure 1(left) shows 750 of these coefficients plotted as vectors starting at the location of the point to be explained.
Secondly, we trained our surrogate on the input/output pairs of the Random Forest classifier, again with a 8000/2000 split. Our surrogate achieved a recovery accuracy of 0.981 on the test set333It recovered the Random Forest’s classification 98% of the time.. We then trained our perturbation network on the opposite class labels, and it achieved an adversarial accuracy of 0.977 on the test set. Figure 1(right) shows the negative minimal perturbations returned by the perturbation network for the same 750 points explained by LIME.
We present the negative perturbations for ease of comparison - by construction, minimal perturbations will point towards the nearest decision boundary whilst the weights of LIME’s fitted logistic regression will point away from the nearest decision boundary. If presenting EMAP’s outputs as explanations of the actual classificationa la LIME, this negation is necessary. If presenting EMAP’s outputs as the perturbations required to cause a miss-classification, the outputs of the perturbation network can be directly reported.
In this simple continuous space, the explanations output by EMAP correspond closely with those output by LIME. The mean cosine similarity between the 2000 LIME explanations and the 2000 EMAP explanations is 0.936.
In addition, EMAP has two clear advantages over LIME on this sort of data; it is faster, and it indicates how close an instance is to a decision boundary, which can be treated as a proxy to how confident we should be in the black box classifier’s prediction. In terms of speed, the time for LIME to generate the 2000 explanations above was 214 seconds. EMAP took 53.5 seconds to train once, and subsequently generated 2000 explanations in 1.32e-2 seconds.
As a consequence of regularising to return the minimal perturbation instances which have perturbations with small magnitude, relative to the average for the dataset, are instances close to a decision boundary. This might be an indication that these instances are worth further examination; either by a preferred but slower explanation model, or directly by a user attempting to diagnose the behaviour of the black box model. In Figure 1(right), the smallest 10% of perturbation vectors have been highlighted in red, and clearly track the decision boundary. Removing them from the cosine comparison improves mean cosine similarity with LIME’s explanations to 0.964.
That this functionality has the potential to highlight regions of interest can be demonstrated using a simple handcrafted dataset, which we have called the ’offset blob’ dataset (see Figure 2
). Here, a binary classification problem with a simple linear boundary is complicated by a region of positive instances within the general negative region. Data is generated from a standard 2d Normal distribution, and classified as class 2 if, and class 1 otherwise. Additionally, a smaller amount of data (20%), all class 1, is generated by , where and . The intention was to simulate a dataset where the black box classifier’s decision boundary will necessarily be somewhat uncertain in a particular region.
As before, we trained a Random Forest classifier on the synthetic dataset, and it achieved an f-score of 0.910. We then trained our surrogate and perturbation network on the Random Forest’s classifications, with a recovery accuracy of 0.935 on the test set, and an adversarial accuracy of 0.921 on the test set, respectively. The lower recovery and adversarial accuracies might be an indication that something is amiss - mean cosine similarity between 1100 LIME explanations and 1100 EMAP negative perturbations is also substantially lower, at 0.688.
Figure 3 shows the explanation vectors generated by EMAP for the critical region of the space. Note that whilst the minimal perturbations given for points in the unclear region are partly incorrect (as the region contains instances from both class 1 and class 2, there is no single ’true’ solution for EMAP or the underlying black box model to find) they are also extremely small - to the extent that we were forced to normalise the vector lengths in Figure 3 to make them visible.
In comparison, LIME’s explanations of the points in the critical region are uninformative. LIME’s explanations of class 1 points in the critical region are identical in direction to those of the in the general class 1 region - LIME’s linear fit to sampled data for those points captures the general trend of the data only; and fails to indicate that there is anything amiss. Thus we might characterise LIME’s output for a point in the critical region as correct but misleading, whereas EMAP’s output is incorrect but indicative of low confidence.
5.2 Continuous Features - MNIST Pixel Perturbation
To demonstrate EMAP’s performance on more complex data with much larger feature spaces, we trained a 200 tree Random Forest classifier on a two-class subset of the MNIST dataset, where the classes were ’8’ and ’3’. The Random Forest achieved an f-score of 0.98. We then trained EMAP on the label provided by the Random Forest. The structure of both surrogate and perturbation networks was identical to that in the simple synthetic cases detailed above (see Data and Methods section for an overview). The surrogate model achieved a recovery accuracy of 0.983 on the test set, and the perturbation network an adversarial accuracy of 0.975 on the test set.
Figure 4 shows examples of the minimal perturbation required to change the surrogate’s classification to the incorrect label. Both instances shown also flip the classification of the unseen Random Forest. As can be seen, EMAP has learned to either remove part of the left hand strokes of 8s, or partly fill the gaps for 3s. That it does not do so fully is due to its remit to recover minimal perturbations - it does not need to fully remove or redraw the relevant part of the letter to flip the classifier’s decision.
5.3 Categorical Features
Lastly, we show how EMAP handles a mixture of continuous and categorical variables. On a subset of the UCI Adult Dataset , our Random Forest achieved an f-score of 0.81, and EMAP a surrogate accuracy of 0.882, and an adversarial accuracy of 0.853.
Table 1 shows three example perturbations produced by EMAP. An open question when dealing with data with a mixture of variables is: to what extent are perturbations comparable? In Example 1, Table 1, EMAP reduces the age of the individual by around 6 years, and changes their marital status from ’Married’ to ’Widowed’. Which is a more substantial change? When searching for a minimal perturbation, the model’s relative weighting of age (a continuous variable) and marital status (a discrete variable) is dependant on the value ascribed to the relative magnitudes of and , the hyperparameters weighting the minimsing regularisation terms for continuous and discrete variables, respectively (see Equation (5)).
In practice, we found that it was necessary to set to around an order of magnitude larger than (the values for the above perturbations were , ), to prevent the model from making such substantial changes to the categorical variables of each instance so as to be uninformative. With this setting, we found that changes to marital status and occupation dominated the minimal perturbations for those individuals who were already close to the boundary. Both Example 1 and Example 2 in Table 1, for instance, remain white and male. However, Example 3 requires substantial changes to almost every variable to convince the classifier to change its decision.
More fine-grained analysis could involve comparing the values of the
values passed to the gumbel softmax layer directly, and regularising these outputs. A second approach, in a setting where we had access to a learned embedding for categorical variables, might be to use the distance travelled in that embedding space tofrom as a proxy to size of perturbation. We intend to pursue these avenues in further work.
One question about our method might be that because gradient-based optimization can lead to a local minimum, the outputs with respect to the same input, or two inputs with small changes can change drastically. Our initial approach to using the differentiable surrogate was to use the gradients of the input to the surrogate directly; such that the direction of minimal perturbation required to flip the classifier was taken to be the local gradient of the input with respect to the negative loss. In development, however, we ran into exactly the problem described above - multiple minima lead to instability on retraining, and (particularly) along decision boundaries of the underlying classifier.
We found that adding a second network trained to output the perturbations directly helped smooth these instabilities out substantially, particularly when the output of the second network was heavily regularised. (Having a second network is also slightly faster; we get an explanation with a single forward pass, rather than a forward pass and a backwards pass). For example, the mean cosine similarity between perturbations output by two runs of EMAP on the MNIST dataset is 0.9497.
Combined with our approaches’ ability to highlight areas of potential instability, we consider it to demonstrate reasonable robustness, at least on the presented datasets. With regard to optimization algorithms, on our data we observe little difference as long as both networks train.
Whilst we have compared ourselves to the literature on model-agnostic instance-wise explanation, we are not necessarily in competition with it. EMAP can be thought of an additional tool in the model development toolbox; useful both for its speed, and its ability to indicate regions of data space where further investigation of the behaviour of the underlying classifier is warranted.
EMAP’s speed is one of its primary assets; where sampling based explanation methods may be too slow to provide instance-wise explanations of a large dataset in a reasonable amount of time, once trained EMAP merely needs a single forward pass to output a perturbation vector. As data can be batched before input, EMAP can handle large numbers of instances rapidly.
Secondly, EMAP can be thought of as a novel approach in that it proposes minimal adversarial perturbations as a useful explanatory tool. This aligns it with the literature on counterfactuals as explanations , as the minimal adversarial perturbation can also be thought of as the minimal counterfactual direction - the direction in which one could perturb an instance to cause a classifier to change its classification.
Thirdly, EMAP provides novel functionality with its ability to highlight regions of space of potential interest to a user, or that pose potential problems for the underlying classifier.
Finally, as we have shown in continuous feature spaces, EMAP also produces results comparable (under a change of sign) to the output of additive feature attribution methods, such as LIME and SHAP 444We should be clear that EMAP is not itself an additive feature attribution method.. This means that we can think of EMAP as an empirical demonstration of the relationship between two distinct paradigms of explanation; that the vector of feature contributions to the output of some model for some instance is the negative of the direction of perturbation to that instance required to recover its nearest counterfactual .
-  (2018) Learning to explain: an information-theoretic perspective on model interpretation. arXiv preprint arXiv:1802.07814. Cited by: §1, §2.
-  (2017) Real time image saliency for black box classifiers. In Advances in Neural Information Processing Systems, pp. 6967–6976. Cited by: §2.
-  (2018) Explanations based on the missing: towards contrastive explanations with pertinent negatives. In Advances in Neural Information Processing Systems, pp. 592–603. Cited by: §2.
-  (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Cited by: §4, §5.3.
Interpretable explanations of black boxes by meaningful perturbation.
Proceedings of the IEEE International Conference on Computer Vision, pp. 3429–3437. Cited by: §2.
-  (2015) Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078. Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.
-  (2016) The mythos of model interpretability. arXiv preprint arXiv:1606.03490. Cited by: §1.
-  (2018) Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888. Cited by: §2.
-  (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774. Cited by: §1, §2.
-  (2019) Interpretable machine learning. bookdown. Cited by: §1.
-  (2018) Methods for interpreting and understanding deep neural networks. Digital Signal Processing 73, pp. 1–15. Cited by: §1.
-  (2017) Feature visualization. Distill 2 (11), pp. e7. Cited by: §1.
-  (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4, §5.1.
-  (2016) Why should i trust you?: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. Cited by: §1, §2, §5.1.
Anchors: high-precision model-agnostic explanations.
AAAI Conference on Artificial Intelligence, Cited by: §1.
-  (2017) Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3145–3153. Cited by: §1.
-  (2017) Counterfactual explanations without opening the black box: automated decisions and the gpdr. Harv. JL & Tech. 31, pp. 841. Cited by: §1, §6.
-  (2019) Measurable counterfactual local explanations for any classifier. arXiv preprint arXiv:1908.03020. Cited by: §1, §2.