As the applications of deep neural networks continue to expand, the intrinsic black-box nature of neural networks creates a potential trust issue. For application domains with high cost of prediction error, such as healthcare [Phan2017], it is necessary that human users can verify that a model learns reasonable representation of data and the rationale for its decisions are justifiable according to societal norms [DBLP:conf/icml/KohL17, fong_cvpr_2018, BoleiZhou2018, Lipton2016, Langley2019].
An interpretable model, such as a linear sparse regression, lends itself readily to model explanation. Yet due to limited capacity, these interpretable models cannot approximate the behavior of neural networks globally. A natural solution, as utilized by LIME [lime:kdd16], is to develop multiple interpretable models, each approximating the large neural network locally on a small region of the data manifold. Global explanations can be obtained by extracting common explanations from multiple local approximations. However, how to best combine local approximations remains an open problem.
Extending this line of research, we propose a novel and simple feature scoring metric, NormLIME
, which estimates the importance of features based on local model explanations. In this paper, we empirically verify the new metric using two complementary tests. First, we examine if the NormLIME explanations agree with human intuition. In a user study, participants favored the proposed approach over three baselines (LIME, SmoothGrad and VarGrad) with NormLIME receiving 30% more votes than all the baselines combined. Second, we numerically examine if explanations created by NormLIME accurately capture characteristics of the machine learning problem at hand, using the same intuition proposed by Evaluating_Feature_Importance (Evaluating_Feature_Importance). Empirical results indicate that NormLIME identifies features vital to the classification performance more accurately than several existing methods. In summary, we find strong empirical support for our claim that NormLIME provides accurate and human-understandable explanations for deep neural networks.
The paper makes the following contributions:
We propose a simple yet effective extension of LIME, called NormLIME, for aggregating interpretations around local regions on the data manifold to create global and class-specific interpretations. The new technique outperforms LIME and other baselines in two complementary evaluations.
We show how feature importance from LIME can be aggregated to create class-specific interpretations, which stands between the fine-grained interpretation at the level of data points and the global interpretation at the level of entire datasets, enabling a hierarchical understanding of machine learning models. The user study indicates that NormLIME excels at this level of interpretation.
A machine learning model can be interpreted from the perspective of how much each input feature contributes to a given prediction. In computer vision, this type of interpretation is often referred to assaliency maps or attribution maps. A number of interpretation techniques, such as SmoothGrad [sg:icml_workshop], VarGrad [vg:nips2018], Integrated Gradients [Sundararajan:IG2017]
, Guided Backpropagation[Springenberg:2015:guided-bp], Guided GradCAM [Selvaraju2017], and Deep Taylor Expansion [Montavon:2017:deeptaylor], exploit gradient information, as it provides a first-order approximation of the input’s influence on the output [Simonyan2013, Ancona2018]
. vg:icml_higher_derivitives (vg:icml_higher_derivitives) analyze the theoretical properties of SmoothGrad and VarGrad. When the gradient is not easy to compute, Baehrens2010HowTE (Baehrens2010HowTE) places Parzen windows around data points to approximate a Bayes classifier, from which gradients can be derived. DeepLift[shrikumar17] provides a gradient-free method for saliency maps. Though gradient-based techniques can interpret individual decisions, aggregating individual interpretations for a global understanding of the model remains a challenge.
Local interpretations are beneficial when the user is interested in understanding a particular model decision. They become less useful when the user wants a high-level overview of the model’s behavior. This necessitates the creation of global interpretations. LIME [lime:kdd16] first builds multiple sparse linear models that approximate a complex model around small regions on the data manifold. The weights of the linear models can then be aggregated to construct a global explanation using Submodular Pick LIME (SP-LIME). anchors:aaai18 (anchors:aaai18) introduce anchor rules to capture interaction between features. Tan2018 (Tan2018) approximate the complex model using a sum of interpretable functions, such as trees or splines, which capture the influence of individual features and their high-order interactions. The proposed NormLIME technique fits into this “neighborhood-based” paradigm. We directly change the normalization for aggregating weights rather than the function forms (such as rules or splines).
Ibrahim:2019 (Ibrahim:2019) rank feature importance and cluster data points based on their ranking correlation. The interpretations for cluster medoids are used in the place of a single global interpretation. Instead of identifying clusters, in this paper, we generate interpretations for each class in a given dataset. The class-level interpretation provides an intermediate representation so that users can grasp behaviors of machine learning models at different levels of granularity.
Proper evaluation of saliency maps can be challenging. vg:nips2018 (vg:nips2018) show that, although some techniques produce reasonable saliency maps, such maps may not faithfully reflect the behavior of the underlying model. Thus, visual inspection by itself is not a reliable evaluation criterion. PatternNet2017 (PatternNet2017) adopt linear classifiers as a sanity check. Feng2019 (Feng2019) propose cooperative games, where humans can see the interpretation of their AI teammate, as a benchmark. Evaluating_Feature_Importance (Evaluating_Feature_Importance) propose ablative benchmarks for the evaluation of feature importance maps. When features that are considered important are removed from the input, the model should experience large drops in performance. The opposite should happen when features deemed unimportant are removed. In this paper, we evaluate the proposed technique using the Keep-and-Retrain (KAR) criterion from Evaluating_Feature_Importance.
As a method for explaining large neural networks, LIME [lime:kdd16] first builds shallow models where each approximates the complex model around a locality on the data manifold. After that, the local explanations are aggregated using SP-LIME. In this work, we extend the general paradigm of LIME with a new method, which we call NormLIME, for aggregating the local explanations.
Building Local Explanations
LIME constructs local interpretable models to approximate the behavior of a large, complex model within a locality on the data distribution. This process can be analogized with understanding how a hypersurface changes around
by examining the tangent hyperplane.
Formally, for a given model , we may learn an interpretable model , which is local to the region around a particular input
. To do this, we first sample from our dataset according to a Gaussian probability distributioncentered around . Repeatedly drawing from and applying yield a new dataset
. We then learn a sparse linear regressionusing the local dataset
by optimizing the following loss function withas a measure of complexity.
where is the squared loss weighted by
For , we impose an upper limit for the number of non-zero components in , so that . The optimization is intractable, but we approximate it by first selecting
features with LASSO regression and performing regression on only the topfeatures.
This procedure yields , which approximates the complex model around
. The components of the weight vectorindicate the relative influence of the individual features of in the sample and serve as the local explanation of . Figure 2 illustrates such a local explanation.
After a number of local explanations have been constructed, we aim to create a global explanation. NormLIME is a method for aggregating and normalizing multiple local explanations and estimating the global relative importance of all features utilized by the model. NormLIME gives a more holistic explanation of a model than the local approximations of LIME.
We let denote the feature, or the component in the feature vector . Since the local explanation weights are sparse, not all local explanations utilize . We denote the set of local explanations that do utilize as , which is a set of weight vectors computed at different locales . In other words, for all , the corresponding weight component .
The NormLIME “importance score” of the feature , denoted by , is defined as the weighted average of the absolute values of the corresponding weight , .
where the weights are computed as follows.
Here, represents the relative importance of the feature in the local model built around the data point . If a feature is not utilized in any local models, we set its importance to .
We now introduce a slightly different perspective of NormLIME, which helps us understand the difference between this approach and the aggregation and feature importance approach used in LIME. Consider the global feature weight matrix , whose rows are the local explanation computed at different locales. Let be the column of the matrix, which contains the weights for the same feature in different local explanations. Let be the vector representing the L1 norms of the rows of M. We can express the NormLIME global feature importance function as
where denotes a matrix with on the diagonal and zero everywhere else. is the L0 norm, or the number of non-zero elements.
As discussed above, NormLIME estimates the overall relative importance assigned to feature by the model. For binary classification problems, this is equivalent to a representation of the importance the model assigns to the feature in distinguishing between the two classes, i.e., recognizing a class label. In multi-class problems, however, this semantic meaning is lost as the salience computation above does not distinguish between classes, and class-relevant information becomes muddled together.
It is straightforward to recover the salience information associated with individual class labels, by partitioning based on the class label of the initial point that the local approximation is built around. The partition contains the local explanation if and only if assigns the label to . More formally,
It is easy to see that if , and is a single-label classification problem, then the family of sets forms a partition of :
Computing salience of for a given label is performed via
Compared to the global interpretation, the class-specific salience yields higher resolution information about how the complex model differentiates between classes. We use as prediction-independent class-level explanations in the human evaluation, described in the next section.
In order to put the interpretations generated by NormLIME to test, we administered a human user study across Amazon Mechanical Turk users. In the study, we compare class-specific explanations generated by other standard feature importance functions and the proposed salience method on the MNIST dataset.
Saliency Map Baselines
To avoid showing too many options to the human participants, which may cause cognitive overload, we selected a few baseline techniques that we consider to be the most promising for saliency maps [vg:icml_higher_derivitives]. We select SmoothGrad and VarGrad because they aim to reduce noise in the gradient, which should facilitate the aggregation of individual decisions to form a class-level interpretation. The aggregation of these individual interpretations are performed by taking the mean importance scores over interpretations from a sample of datapoints corresponding to each label. The details are discussed below.
This technique generates a more “interpretable” salience map by averaging out local noise typically present in gradient interpretations. We add random noise to the input . Here we follow the default implementation using the “SmoothGrad Squared” formula, which is an expectation over :
as noted in [Evaluating_Feature_Importance]. In practice, we approximate the expectation with the average of 100 samples of drawn from where is 0.3. The class-level interpretation is computed as the average of the saliency maps for 10 images randomly sampled from the target class.
Bottom: example attention checks from the user study. The correct choice is option 2. Options 1 and 4 are duplicates and cannot distinguish well between labels “3” and “8”.
Similar to SmoothGrad, VarGrad uses local sampling of random noise to reduce the noise of the standard gradient map interpretation. VarGrad perturbs an input randomly via
, and then computes the component-wise variance of the gradient over the sample
Similar to SmoothGrad, we compute the variance using 100 samples of
We compute importance as as in Eq. (6), but conditioned on the label to capture feature that were positively correlated with the specific label.
For both LIME and NormLIME, we only show the input features that are positively correlated with the prediction. That is, when or is positive. The purpose is to simplify the instructions given to human participants and avoid confusion, since most participants may not have sufficient background in machine learning.
The design of the study is as follows: We administered a questionnaire featuring 25 questions, each containing four label-specific explanations for the same digit. We were able to restrict participants through Mechanical Turk to users who had verified graduate degrees. Survey takers were instructed to evaluate the different explanations based on how well they captured the important characteristics of each digit in the model’s representation. Figure 3 shows the instructions. To account for response ordering bias, the order of the methods presented for each question was randomized. In order to catch participants who cheat by making random choices, we included 5 attention checks with a single acceptable answer that is relatively obvious. We only include responses that pass at least 4 of the 5 attention checks and disregard the rest.
We conducted the experiment on MNIST with
single-channel images. We trained a 5-layer convolutional network that achieves 99.04% test accuracy. This model consisted of two blocks of convolution plus max-pooling operations, followed by three fully connected layers with dropout in-between, and a final softmax operation. The number of hidden units for the three layers was 128, 128, and 10, respectively. Class-specific explanations were generated for each of the digits from 0 to 9.
It is important to note that none of the explanations generated for the study represented a particular prediction on a particular image, but instead represented how well the importance functions captured the important features for a label (digit) in the dataset.
Results and Discussion
After filtering responses that failed the attention check, we ended up with 83 completed surveys. From their responses, the number of votes for each method were: 939 for NormLIME, 438 for LIME, 151 for VarGrad, and 132 for SmoothGrad. We analyzed the data by examining each user’s response as a single sample of the relative proportions of the various explanation methods for that user, and performed a standard one-way ANOVA test against the hypothesis that the explanations were preferred uniformly. We obtained a statistically significant result, with an F statistic of and a p-value of
, allowing us to reject the null hypothesis. We conclude that a significant difference exist between how users perceived the explanations.
A subsequent Tukey HSD post hoc test confirms that the differences between NormLIME and all other methods are highly statistically significant (). It also shows that the difference between LIME and the gradient-based interpretations is statistically significant (). We conclude that overall, the NormLIME explanations were preferred over all other baseline techniques, including LIME and that NormLIME and LIME were preferred over the gradient-based methods.
Observing Figure 1, the interpretations of SmoothGrad and VarGrad do not appear to resemble anything semantically meaningful. This may be attributed to the fact that these methods are not designed with class-level interpretation in mind. LIME captures the shape of the digits to some extent but cannot differentiate the most important pixels. In contrast, the normalization factor in NormLIME helps to illuminate the differences among the important pixels, resulting in easy-to-read interpretations. This suggests that proper normalization is important for class-level interpretations.
Numerical Evaluation with KAR
Visual inspection alone may not be sufficient for evaluating saliency maps [vg:nips2018]. In this section, we further evaluate NormLIME using a technique akin to Keep And Retrain (KAR) proposed by Evaluating_Feature_Importance (Evaluating_Feature_Importance). The underlying intuition of KAR is that features with low importance are less relevant to the problem under consideration. This gives rise to a principled “hypothesis of least damage”: removal of the least important features as ranked by a feature importance method should impact the model’s performance minimally. Thus, we can compare two measures of feature importance by comparing the predictive performance after removing the same number of features as ranked by each method as the least important.
Specifically, we first train the same convolutional network as in the human evaluation with all input features and use one of the importance scoring method to rank the features. We remove a number of least important features and retrain the model. Retraining is necessary as we want to measure the importance of the removed features to prediction rather than how much one trained model relies on the removed features. After that, we measure the performance drop caused by feature removal; a smaller performance drop indicates more accurate feature ranking from the interpretation method.
We perform KAR evaluation on two set of features. The first set is the raw pixels from the images. The second set of features are the output from the second convolutional layers of the network. The baselines and results are discussed below.
We evaluated NormLIME against various baseline interpretation techniqes on the MNIST dataset [MNIST]. For NormLIME and LIME, we use the absolute value of as the feature importance. In addition to the existing baselines used in the human evaluation, we introduce the following baselines.
The Shapley value measures the importance of a feature by enumerating all possible combinations of features and compute the average of performance drop when the feature is removed. The value is well suited to situations with heavy interactions between the features. While theoretically appealing, the computation is intractable. A number of techniques [Chen2019:Shapley, NIPS2017_Lundberg, Strumbelj:2010] have been proposed to approximate the Shapley value. Here we use SHapley Additive exPlanations (SHAP), an approximation based on sampled least-squares.
This baseline randomly assigns feature importance. This serves as a “sanity check” baseline. Notable, in the experiments of Evaluating_Feature_Importance (Evaluating_Feature_Importance), some commonly used saliency maps perform worse than random.
Results and Discussion
shows the error gained after a number of least important features are removed averaged over 5 independent runs. We use removal thresholds from 10% to 90%. When 50% of the features or less are removed, NormLIME performs better or similarly with the best baselines, though it picks up more error when more features are removed. The best prediction accuracy among all is achieved by NormLIME at 50% feature reduction with 0% error gain. This is matched by SHAP also at 50% feature reduction and VarGrad at 20% feature reduction. All other baselines observe about at least 0.25% error gain at 50% feature reduction,. SHAP and LIME perform better than other methods, including NormLIME, when 60% or more features are removed. The gradient-based methods, are outperformed by NormLIME and LIME.
Figure 6 shows the same measure on the convolutional features. On these features, NormLIME outperforms the other methods by larger margins, compared to the input features. NormLIME achieves better results than the original model (at 99.06%) when 70% or less features are removed, underscoring the effectiveness of dimensionality reduction. The best performance is achieved at 30% removal with a classification accuracy of 99.31%. The second best is achieved by LIME at 99.1% improvement when filtering 40% of features, comparable with NormLIME performance at the same level.
When 80% of features are removed, NormLIME demonstrates zero error gain, whereas the second best method, LIME, gains 0.3% absolute error. When 90% of features are removed, NormLIME shows 0.45% error gain while LIME observes .6% error gain and all others receive at least 1.25%.
Overall, gradient ensemble methods, SmoothGrad and VarGrad, perform better than Random, but they compare unfavorable against “additive local model approximation” schemes like SHAP, LIME, and NormLIME.
The advantage of NormLIME is more pronounced when pruning the convolutional features than the input features. Further, we can achieve better performance by removing convolutional features but not the input features. This suggests that there is more redundancy in the convolutional features and NormLIME is able to exploit that phenomenon.
Proper interpretation of deep neural networks is crucial for state-of-the-art AI technologies to gain the public’s trust. In this paper, we propose a new metric for feature importance, named NormLIME, that helps human users understand a black-box machine learning model.
We extend the LIME / SP-LIME technique [lime:kdd16], which generates local explanations of a large neural network and aggregates them to form a global explanation. NormLIME adds proper normalization to the computation of global weights for features. In addition, we propose label-based salience, which provides finer-grained interpretation in a multi-class setting compared to LIME which in contrast focuses on selecting an optimal selection of individual predictions to explain a model.
Experimental results demonstrate that the NormLIME explanations agree with human intuition of features that separate different digits in MNIST. The human evaluation study shows that explanations generated by the salience metric are strongly favored over comparable ones generated by LIME, SmoothGrad, and VarGrad with strong statistical significance. Further, using the Keep-And-Retrain evaluation method, we show that explanations formed by the NormLIME metric are faithful to the problem at hand, as it identifies input features and convolutional features whose removal is not only harmless, but may improve the prediction accuracy. By improving the interpretability of machine learning models, the proposed salience metric lays the groundwork for further adoption of AI technologies for the benefits of all of society.
This research is partially supported by the NSF grant CNS-1747798 to the IUCRC Center for Big Learning. We thank Peter Lovett for valuable input to the human study.