Causal Explanation (CXPlain) is a method for explaining the predictions of any machine-learning model.
Feature importance estimates that inform users about the degree to which given inputs influence the output of a predictive model are crucial for understanding, validating, and interpreting machine-learning models. However, providing fast and accurate estimates of feature importance for high-dimensional data, and quantifying the uncertainty of such estimates remain open challenges. Here, we frame the task of providing explanations for the decisions of machine-learning models as a causal learning task, and train causal explanation (CXPlain) models that learn to estimate to what degree certain inputs cause outputs in another machine-learning model. CXPlain can, once trained, be used to explain the target model in little time, and enables the quantification of the uncertainty associated with its feature importance estimates via bootstrap ensembling. We present experiments that demonstrate that CXPlain is significantly more accurate and faster than existing model-agnostic methods for estimating feature importance. In addition, we confirm that the uncertainty estimates provided by CXPlain ensembles are strongly correlated with their ability to accurately estimate feature importance on held-out data.READ FULL TEXT VIEW PDF
Causal Explanation (CXPlain) is a method for explaining the predictions of any machine-learning model.
Explanation methods for machine-learning models play an important role in researching, developing, and using predictive models as information on what features were important for a given output enable us to better understand, validate, and interpret model decisions Shrikumar et al. (2017); Lipton (2016); Kindermans et al. (2017); Smilkov et al. (2017); Doshi-Velez and Kim (2017)
. However, complex models, such as ensemble models and deep neural networks, are often difficult to interrogate. To address this apparent dichotomy between performance and interpretabilityLundberg and Lee (2017), researchers have developed a number of attribution methods that provide estimates of the importance of input features towards a model’s output for specific types of models Baehrens et al. (2010); Simonyan et al. (2014); Zeiler and Fergus (2014); Smilkov et al. (2017); Sundararajan et al. (2017); Xu et al. (2015); Choi et al. (2016); Schwab et al. (2017, 2019); Schwab and Karlen (2019), and for any machine-learning model Ribeiro et al. (2016); Lundberg and Lee (2017).
However, providing fast and accurate feature importance estimates for any machine-learning model is challenging because there exists a wide variety of intricate machine-learning models with different underlying model structures, algorithms, and decision functions, which makes it difficult to develop an optimised and unified approach to importance attribution. Furthermore, importance estimates of state-of-the-art methods are typically associated with significant uncertainty Kindermans et al. (2017); Adebayo et al. (2018); Fen et al. (2019); Ghorbani et al. (2019), and it is therefore difficult for users to judge when importance estimates can be expected to be accurate.
In this work, we present a new approach to estimating feature importance for any machine-learning model using causal explanation (CXPlain) models. CXPlain uses a causal objective to train a supervised model to learn to explain another machine-learning model. This approach can be applied to any machine-learning model, since it has no requirements on the predictive model to be explained. In particular, it does not require retraining or adapting the original model. We demonstrate experimentally that CXPlain is significantly more accurate than most existing methods, fast, and able to produce accurate uncertainty estimates. Source code is available at https://github.com/d909b/cxplain.
Contributions. This work contains the following contributions:
We introduce causal explanation (CXPlain) models, a new method for learning to accurately estimate feature importance for any machine-learning model.
We present a methodology based on bootstrap resampling for deriving uncertainty estimates for the feature importance scores provided by CXPlain.
Our experiments show that CXPlain is significantly more accurate and significantly faster (at evaluation time) than existing model-agnostic methods, and that the uncertainty estimates for its assigned feature importance scores are strongly correlated with the accuracy of the provided importance scores on previously unseen test data.
|CXPlain||SG Simonyan et al. (2014) / IG Sundararajan et al. (2017)||DeepSHAP Shrikumar et al. (2017); Lundberg and Lee (2017)||LIME Ribeiro et al. (2016)||SHAP Lundberg and Lee (2017)|
Existing methods for feature importance estimation can be subdivided into (1) gradient-based methods, (2) methods based on sensitivity analysis, (3) methods that measure the change in model confidence when removing input features, and (4) mimic models. Simple Gradient (SG) Simonyan et al. (2014), Integrated Gradients (IG) Sundararajan et al. (2017), DeepLIFT Shrikumar et al. (2017), and DeepSHAP Lundberg and Lee (2017) are examples of gradient-based methods. Gradient-based methods are only applicable to differentiable models, such as neural networks, and their computation is typically fast. Methods that quantify a model’s sensitivity to changes in the input, such as LIME Ribeiro et al. (2016) or SHAP Lundberg and Lee (2017), and more specifically Kernel SHAP, are applicable to any machine-learning model but typically slow to compute, as large numbers of model evaluations are necessary to assess a model’s sensitivity. Methods based on masking parts of the input and measuring the model’s resulting change in confidence Štrumbelj et al. (2009) include conditional multivariate models for visualising deep neural networks Zintgraf et al. (2017), analysing the effects of erasing parts of their representations Li et al. (2016), image interpretation by identifying the regions for which the model most strongly responds to perturbations Fong and Vedaldi (2017), and image masking models trained to manipulate the outputs of a predictive model by occluding parts of the input Dabkowski and Gal (2017). The fourth main category of approaches to explaining model decisions is to train interpretable models that mimic the decisions of a black-box model that we wish to explain. Tree- Schwab and Hlavacs (2015); Che et al. (2016); Bastani et al. (2017) and rule-based Andrews et al. (1995) models have been used as mimic models. However, mimic models are not guaranteed to match the behavior of the original model. Besides these four established categories of feature importance estimation methods, structural causal models (SCMs) Chattopadhyay et al. (2019) and Deep Taylor Decomposition (DTD) Montavon et al. (2017) have also recently been proposed as explanation methods. However, these methods are designed for specific types of models. In addition, the L2X method that uses a variational approximation of mutual information Chen et al. (2018) and Bayesian nonparametrics Guo et al. (2018) have been proposed to explain a target model. Tsang et al. (2017)
detected statistical interactions by interpreting the weights learned in neural networks. Beyond feature attribution, testing with concept activation vectors (TCAV)Kim et al. (2018)
was proposed to visualise the internal state of deep learning models, and influence functionsKoh and Liang (2017) have been used to identify the training data most responsible for a given model decision. A major limitation of most existing methods for feature importance estimation is that they do not inform users when their estimates are significantly uncertain and can not be expected to be accurate.
Although reliability is necessary for model explanations to be trustworthy, relatively few studies have been concerned with quantifying the uncertainty and robustness of explanation methods. For example, it has been shown that multiple importance estimation methods incorrectly attribute when a constant vector shift is applied to the input Kindermans et al. (2017), that the attributions provided by interpretation methods may themselves contain significant uncertainty Fen et al. (2019), that some explanation methods are independent of both the model and the data-generating process and, thus, can not be relied upon for important interpretation tasks Adebayo et al. (2018), and that imperceptibly small perturbations of the input can significantly alter the explanations provided by state-of-the-art explanation methods without changing the explained model’s prediction Ghorbani et al. (2019). These studies highlight the importance of informing users when a given explanation is uncertain and should be discounted.
In contrast to existing works, CXPlain is an explanation model trained with a causal objective to learn to explain the decisions of any machine-learning model without the need to retrain, adapt, or have in-depth knowledge of the explained model. To the best of our knowledge, CXPlain is the first feature importance estimation method that is simultaneously (1) significantly more accurate than most existing methods, (2) compatible with any machine-learning model and data modality, (3) able to provide uncertainty estimates via bootstrap resampling, and (4) fast at evaluation time (Table 1).
We consider a setting in which we are given a predictive model which processes inputs consisting of input features, or groups of features, with to produce outputs of any dimensionality . The predictive model is scored according to an objective function that computes a scalar loss after comparing the model’s predictive output to a ground-truth output . The mean squared error (MSE) for regression models and the categorical crossentropy for classification models are commonly used examples of such objectives. We note that we specifically do not require access to, or knowledge of, the process by which produces its output, nor do we require to be differentiable or of any specific form. Additionally, we are given independent and identically distributed (i.i.d.) pairs of sample covariates and ground-truth outputs as training data. Given this setting, our goal is to train an explanation model that produces accurate estimates with elements corresponding to the importances assigned to each of the input features to the predictive model .
The main idea behind CXPlain is to train a separate explanation model to explain the predictive model (Figure 1). This flexible framework has the advantage that we do not need to retrain or adapt the predictive model to explain its decisions. To train the explanation model, we utilise a causal objective function that quantifies the marginal contribution of either a single input feature or group of input features towards the predictive model’s accuracy Štrumbelj et al. (2009); Schwab et al. (2019)
. This approach, in essence, transforms the task of producing feature importance estimates for a given predictive model into a supervised learning task that we can address with existing supervised machine-learning models.
The core component of CXPlain is the causal objective that enables us to optimise explanation models to learn to explain another predictive model. The causal objective we build on was first introduced to jointly learn to produce accurate predictions and estimates of feature importance in a single neural network model Schwab et al. (2019). However, the original formulation of the causal objective required a specific attentive mixture of experts architecture. In this work, we contribute an adapted version of the causal objective from Schwab et al. (2019) that does not require a specific model structure, and that can be used to train explanation models to learn to explain any machine-learning model. The causal objective introduced in Schwab et al. (2019) was based on the Humean definition of causality used by Granger (1969), who defined a causal relationship
between random variablesand to exist if we are better able to predict using all available information than if the information apart from had been used Schwab et al. (2019). i.e. if the absence of as a feature decreases our ability to predict . Granger (1969)’s definition of causality was based on two key assumptions: (1) That our set of available variables contains all relevant variables for the causal problem being modelled, and (2) that temporally precedes Granger (1969). In the general setting, these assumptions can not be verified from observational data Stone (1993). However, in our specific setting, we know a priori that the inputs of the predictive model mathematically always precede its output, and that the explained model’s output, on deterministic hardware and software, is not influenced by variables other than those present in its set of input features. We can therefore use the given definition to quantify the degree to which an input feature caused a marginal improvement in the predictive performance of the predictive model . Given input covariates , we therefore denote as the predictive model’s error without including any information from the th input feature and as the predictive model’s error when considering all available input features. To calculate and , we first compute the outputs and of the predictive model without and with the th input feature , respectively:
using the predictive model’s loss functionto calculate and :
Lastly, we normalise the importance scores to relative contributions with Schwab et al. (2019):
We then arrive at our causal objective Schwab et al. (2019) that aims to minimise the Kullback-Leibler (KL) divergence Kullback (1997) between the target importance distribution with for a given sample , and the distribution of importance scores with as estimated by based on . Using , we can train supervised learning models to learn to explain any other machine-learning model based solely on its outputs, and without the need to retrain the model to be explained. Precomputing the importances for each training sample takes evaluations of the target predictive model at training time. For high-dimensional images, it is sensible to group non-overlapping regions of adjacent pixels into feature groups, since removing single pixels in high-dimensional images is unlikely to strongly affect a predictive model’s output Zintgraf et al. (2017). This also significantly limits the number of feature groups for which importances have to be precomputed. We note that estimating is not necessary in situations in which ground truth labels are readily available, e.g. during model development. In those situations, can directly be used to explain .
In principle, any supervised machine learning model that can be trained with a custom objective could be used as a causal explanation model. In this work, we focus on neural explanation models. Using deep neural networks as causal explanation models has the advantage that these models are able to extract high-level feature representations from high-dimensional and unstructured data Goodfellow et al. (2016)
, and thus remove the need to perform manual feature engineering. We leave the exploration of other classes of explanation models to future work. A priori, it is not clear which architectures would be most suitable to be used in neural explanation models. Absent any prior knowledge about the structure of the input data, multilayer perceptrons (MLPs) are likely a sensible default choice. However, since architectures that exploit the spatial or temporal structure of input data have been shown to be efficacious, we reason that, depending on the data modality of the input features of the model to be explained, special-purpose architectures, such as convolutional neural networksSzegedy et al. (2016) for images and attentive neural networks for texts Kaiser et al. (2017), could perform better than MLPs. In particular, U-nets Ronneberger et al. (2015) that have been designed for image segmentation, a task that involves mapping input pixels to segmentation labels, may perform well as causal explanation models for images since segmentation is semantically similar to explanation, which involves mapping input pixels to importance scores. To determine whether or not specialised model architectures can achieve better performances in neural explanation models, we experimentally evaluate both MLPs and U-nets.
In addition to producing accurate estimates of feature importance, we wish to provide uncertainty estimates that quantify the uncertainty associated with each individual feature importance estimate
produced by a CXPlain model. In particular, we would like to calculate confidence intervalswith lower bounds and upper bounds at confidence level for each assigned feature importance estimate . The width of can subsequently be used to quantify the uncertainty of . To derive uncertainty estimates for causal explanation models, we propose the use of bootstrap ensemble methods, specifically using bootstrap resampling Efron (1982); Breiman (2001). To train bootstrap ensembles of causal explanation models, we first draw training samples at random with repeats from the original training set. We then train an explanation model using the before-mentioned causal objective until convergence on the selected subset of the training set. We repeat this process times to obtain a bootstrap ensemble of explanation models (Algorithm in Appendix B). We use the median of the attributions of the ensemble members as the assigned importance of the bootstrap ensemble, and the and quantiles as lower and upper bounds of its CI, respectively. The efficacy of bootstrap ensembles for estimating the uncertainty in outputs of neural networks has been demonstrated in, e.g., Lakshminarayanan et al. (2017), but this work is, to the best of our knowledge, the first to consider using bootstrap ensembles of explanation models to quantify the uncertainty in assigned importance scores. We note that Monte Carlo dropout Gal and Ghahramani (2016), which uses dropout Srivastava et al. (2014) at evaluation time, is an alternative method for estimating uncertainty for the outputs of neural networks that does not require explicitly training an ensemble of models, but may not always produce uncertainty estimates of the same quality as ensembles Lakshminarayanan et al. (2017).
Our experiments aimed to answer the following questions:
How does the feature importance estimation performance of CXPlain compare to that of existing state-of-the-art methods?
How does the computational performance of CXPlain compare to existing model-agnostic and model-specific methods for feature importance estimation?
Are uncertainty estimates computed via bootstrap resampling of CXPlain models qualitatively and quantitatively correlated with their ability to accurately determine feature importance?
To answer these questions, we performed extensive experiments on several benchmarks that compare both the computational as well as the estimation performance of CXPlain to existing state-of-the-art methods for feature importance estimation. To enable a meaningful comparison, we focus most of our experiments on image classification tasks, where we are best able to visualise and quantify the performance of feature importance estimation methods, and on neural network models as models to be explained, since most existing model-specific attribution methods that we wish to compare to were developed exclusively for neural networks. However, we note that CXPlain as a method is compatible with any machine-learning model, data modality, and both regression as well as classification tasks. We used Mann–Whitney–Wilcoxon (MWW) tests Hollander and Wolfe (1973) to calculate -values for the main comparisons.
To compare the accuracy of CXPlain to existing state-of-the-art methods for feature importance estimation, we evaluated its ability to identify important features in MNIST LeCun et al. (2010) and ImageNet Deng et al. (2009) images. To do so, we followed the experimental design first proposed by Shrikumar et al. (2017), and trained binary classification models to distinguish between two digit types (8 vs. 3) on MNIST (model accuracy: ), and two object categories (Gorilla vs. Zebra) on ImageNet (model accuracy: ). As a preprocessing step, pixel values were scaled to be in the range of prior to training. We then used several importance estimation methods to determine which input pixels were most important for the classification models’ decisions on test images. We masked the top 10 and 30% of those most important pixels for MNIST and ImageNet, respectively, and measured the resulting change in the classification models’ confidences by computing the difference in log odds
where , and and are the classification models’ outputs for the original image and the masked image with the top pixels removed, respectively. To ensure that the explanations of all methods are on the same scale, we normalised them to the range of using the transformation . We plotted the assigned importances and the resulting masked images to qualitatively assess each methods’ ability to determine the salient features in the original image (Figures 4 and 5
). We additionally recorded the mean and standard deviation of the time taken (in seconds) to compute the feature importance estimates for each method on the same hardware (Appendix C) over 10 and 5 runs with the same parameters and random seed for MNIST and ImageNet, respectively (Figures6 and 7). Further training details are given in Appendix A.
To quantitatively and qualitatively assess the accuracy of the uncertainty estimates provided by bootstrap ensembles of CXPlain models, we analysed whether their uncertainty estimates are correlated with their errors in feature importance estimation on held-out MNIST test samples. We evaluated several numbers of bootstrap resampled models in order to determine how the number of ensemble members affects the uncertainty estimation performance of bootstrap ensembles of CXPlain models. In addition, we also evaluated the performance of randomly selected uncertainty estimates as a baseline for comparison. In general settings, it is difficult to evaluate uncertainty estimates for feature importance estimation methods, since we typically do not have per-feature ground-truth attributions to evaluate against. However, by comparing the ranking implied by the ground-truth change in log-odds to the ranking implied by the explanation model we are able to define a rank error RE for each . Formally, the rank error is the difference in rank between the true implied by , and the estimated implied by the explanation model, where defines the rank of from to implied by .
As correlation metric, we used Pearson’s to measure the correlation between the rank error RE and the uncertainty estimates defined by the bootstrap resampled CIs for each importance estimate in the top of pixels by log-odds across unseen images from the MNIST test set. We limited the evaluation to all pixels with a greater than 0. If our uncertainty estimates are well calibrated, we would expect to see a high correlation between the uncertainty estimates and the magnitude of rank errors RE, since that would indicate that the uncertainty estimates accurately quantify how certain the feature importance estimates
are on previously unseen sample images. For the comparison of the resulting distributions of correlation scores, we applied the Fisher z-transform to the correlation scores in order to correct for the skew in the distribution of the sample correlationSilver and Dunlap (1987). Figure 9
depicts visualisations of the calculated ground-truth log odds, the rank errors of the explanation model’s importance estimates, and the uncertainty for each importance estimate for three test set images. We used the same hyperparameters as in the previous experiment to train the ensembled CXPlain (MLP) models (Appendix A).
We found that, on the MNIST benchmark, CXPlain (U-net) was competitive with the best competing state-of-the-art feature importance estimation method, DeepSHAP. We also found that CXPlain (U-net) produced significantly (, MWW) more accurate feature importance estimates than CXPlain (MLP) - indicating that model architectures specifically tailored for the image domain are more effective than MLPs in neural explanation models (Figure 2). On the ImageNet benchmark, CXPlain significantly (, MWW) outperformed the best competing feature importance estimation method, LIME (Figure 3). We also found that the model-specific attribution methods Simple Gradient and Integrated Gradients performed relatively poorly across both benchmarks, and were consistently outperformed by the model-agnostic attribution methods, CXPlain, and DeepSHAP. Qualitatively, we found that the estimates of feature importance provided by CXPlain were more focused on the subjectively more important semantic regions of the sample images from both MNIST and ImageNet (Figures 4 and 5; more in Appendix D). Other methods, in contrast, produced more superfluous attributions. This behavior is exhibited in Figure 5 where SHAP and LIME both attribute significant importance to the wall behind the gorilla, whereas CXPlain focused nearly all its attention on the gorilla itself, with the exception of the window frame receiving some importance outside the top 30% of importances of that sample image. We believe this could be due to the fact that the causal objective strongly penalises attributions outside regions of interest - leading to qualitatively more focused estimates of importance.
In terms of computational performance, we found that CXPlain computed feature importance estimates significantly faster than the state-of-the-art model-agnostic attribution methods, LIME and SHAP, on both the MNIST and ImageNet benchmarks (Figures 6 and 7). Gradient-based attribution methods and CXPlain performed similarly. On ImageNet, the gap between LIME and SHAP and the faster methods was considerably larger than on MNIST, since the large numbers of model evaluations for LIME and SHAP were slower on higher-dimensional images.
We found that, quantitatively, even relatively small CXPlain ensembles with just bootstrap resampled models produce uncertainty estimates that are significantly (, MWW, compared to Random) correlated with its ability to accurately estimate feature importances on previously unseen test images (Figure 8). We also found that increasing the size of the bootstrap ensemble further significantly ( for to
, MWW) increases this correlation, and, thus, the quality of the provided uncertainty estimates. Qualitatively, there was a high visual similarity between the uncertainty estimatesprovided by the CXPlain ensembles for each input feature and the magnitude of rank errors RE committed by its importance estimates (Figure 9). The large differences in importance estimation accuracy between state-of-the-art feature importance estimation methods shown in the MNIST and ImageNet benchmarks indicate that many of the importance estimates they provide are not truthful to the predictive model to be explained, and that measures of uncertainty are necessary to fully understand the expected reliability of feature importance estimates.
While they are fast at evaluation time, a limitation of CXPlain models is that they have to be trained to learn to explain a predictive model. However, this one-off compute cost typically amortises quickly, since CXPlain is significantly faster at evaluation time than existing model-agnostic importance estimation methods. Another important point to note is that the associations identified by CXPlain models are only causal in the sense that they quantify the degree to which the input features caused a marginal improvement in the predictive performance of the predictive model . Associations reported by CXPlain, in particular, do not in any way indicate that there is a causal relationship between the explained model’s input and output variables in the real world.
We presented CXPlain, a new method for learning to estimate feature importance for any machine-learning model. CXPlain is based on the idea of training a separate explanation model to learn to estimate which features are important for a given output of a target predictive model using a causal objective. This approach has several advantages over existing ones: It is compatible with any machine-learning model, can produce estimates of feature importance quickly after training, and may be combined with bootstrap resampling to obtain uncertainty estimates for the provided feature importance scores. We showed experimentally that CXPlain is significantly more accurate in estimating feature importance than existing model-agnostic methods on both MNIST and ImageNet benchmarks, while being orders of magnitude faster at providing importance estimates than state-of-the-art model-agnostic methods. We also found that, analogous to standard supervised learning tasks, special-purpose model architectures may improve the performance of neural explanation models in images, and that the bootstrap resampled uncertainty estimates for the importance scores of an explanation model are significantly correlated with CXPlain’s ability to accurately estimate feature importance - indicating that bootstrap resampling is a suitable approach for quantifying the uncertainty of importance estimates. Causal explanation models that both produce accurate estimates of feature importance and their uncertainties quickly for any machine-learning model and data modality may enable users to better understand, validate, and interpret machine-learning models, while also informing them when their explanations can not be expected to be accurate.
This work was partially funded by the Swiss National Science Foundation (SNSF) project No. 167302 within the National Research Program (NRP) “Big Data”. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPUs used for this research. Patrick Schwab is an affiliated PhD fellow at the Max Planck ETH Center for Learning Systems. We additionally thank the anonymous reviewers whose comments helped improve this manuscript.
European Conference on Computer Vision, pages 818–833. Springer, 2014.
AAAI Conference on Artificial Intelligence, 2019.
Data & Knowledge Engineering, 68(10):886–904, 2009.