Improving Interpretability of CNN Models Using Non-Negative Concept Activation Vectors

06/27/2020 ∙ by Ruihan Zhang, et al. ∙ The University of Melbourne 14

Convolutional neural network (CNN) models for computer vision are powerful but lack explainability in their most basic form. This deficiency remains a key challenge when applying CNNs in important domains. Recent work for explanations through feature importance of approximate linear models has moved from input-level features (pixels or segments) to features from mid-layer feature maps in the guise of concept activation vectors (CAVs). CAVs contain concept-level information and could be learnt via Clustering. In this work, we rethink the ACE algorithm of Ghorbani et al., proposing an alternative concept-based explanation framework. Based on the requirements of fidelity (approximate models) and interpretability (being meaningful to people), we design measurements and evaluate a range of dimensionality reduction methods for alignment with our framework. We find that non-negative concept activation vectors from non-negative matrix factorization provide superior performance in interpretability and fidelity based on computational and human subject experiments. Our framework provides both local and global concept-level explanations for pre-trained CNN models.



There are no comments yet.


page 2

page 3

page 6

Code Repositories


Invertible Concept-based Explanation (ICE)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learners such as convolutional neural networks (CNNs) [lee1999learning] are widely used across important domains like computer vision due to demonstrated performance in numerous tasks. However, when applying to critical domains like medicine, justice, and finance, explainability has become a key enabler and mitigation. While commentators like rudin2019stop

argue deep learning approaches should not be used for these risky domains, using deep learning to discover features for more ‘interpretable’ models requires explainability to determine what features have been discovered.

Recent CNN explanation methods attempt to quantify the importance of each feature. Feature importance usually corresponds to a linear approximation of highly complex models. Different methods use different features, for instance, saliency maps use gradients with respect to pixel-level feature importance [shrikumar2017learning, smilkov2017smoothgrad]. Linear models are relatively easy to understand, with only features and correlation of importance. But when targeting image input into highly non-linear models, only local explanations are provided. To mitigate this limitation, work has explored feature maps from layers within CNNs, as such layers may represent higher-level features (concepts) [kim2018interpretability, zhou2018interpretable]

. Such work learns concept representations for a layer and then estimates concept importance based on a labelled dataset.

kim2018interpretability named these representations concept activation vectors (CAV). One of the most obvious limitations of these is the need of a labelled concept dataset to learn the target CAVs. ghorbani2019towards

introduced unsupervised learning for CAVs named Automated Concept-based Explanation (ACE). ACE provides concept explanations through

-means Clustering of CNN feature map activations. These concept-based methods separate the complex model into two parts, concept extractor and classifier. The concept extractor part is explained by CAVs, to establish that the concept extractor learns human understandable concepts. For the classifier, a linear model is used to approximate the classifier and estimate concept importance per CAV.

In sum, previous works are variations on a key idea: providing feature/concept weights from a linear model approximating a CNN, or part thereof. ribeiro2016should claim two important requirements for linear approximations: interpretability and fidelity. Interpretability represents that the feature representation used in the approximate model needs to be meaningful to human observers. Fidelity prescribes that the approximate model should make similar predictions to the model under explanation. A third requirement we adopt is that the complexity of the approximate model should be minimal. Complexity reflects the number of features in the linear approximate model. Human observers may only be capable of accepting a limited amount of information in their short-term memory [miller1956magical]. Thus, supporting a variable number of features in linear models for different users is desirable.

Following concept-based explanations and the aforementioned requirements for approximate models, Clustering methods like -means used in ACE can be viewed as a dimensionality reduction method to satisfy the requirement of this flexible complexity for the linear model. It could also provide interpretable CAVs for a given dataset in the form of sample segments [ghorbani2019towards]

. However, Clustering methods lack fidelity compared to other dimensionality reduction methods like principal component analysis (PCA) and non-negative matrix factorization (NMF). Due to the use of one hot vector to store information, more information might be lost when inverting to the original dimension. We aim to address these shortcomings of Clustering as a dimensionality reduction method in interpretability and fidelity by introducing a new concept-based explanation framework.

Our contributions

: We propose a concept-based explanation framework with corresponding evaluation metrics. Our framework applies to previously proposed concept-based explanation methods like ACE. We demonstrate that learned non-negative concept activation vectors (NCAV) from NMF provide more interpretable and faithful concepts compared with concepts from Clustering methods and PCA as feature map reductions. We run computer-based and human behavioural experiments that evaluate interpretability and fidelity. Sample local and global explanations are shown in Figure 


Figure 1: Explanation for an image with a dog and cat. For each concept, the framework provides prototypes based on the training set and correlates areas as global explanations. The explainer decomposes the final prediction to concept scores and weights through a linear model to explain locally. Explanation is based on ResNet50 CNN model.

2 A Framework for Concept-based Explanations

Concept-based explanations may be approached as linear approximations for separate CNN models as follows. First, we separate the CNN classifier into concept extractor and classifier from a single CNN layer, the explanations are based on the feature maps from that layer. Input-level explanations like LIME [ribeiro2016should] and saliency maps [shrikumar2017learning, smilkov2017smoothgrad] skip this step and use input images as features for approximate models directly. Second, we apply dimensionality reduction to the feature maps to provide CAVs for the next step. A reducer is trained with a target concepts related dataset. Note that the middle-layer feature maps may contain too many dimensions, and information in each dimension isn’t enough to be meaningful. Therefore, the reducer may gather information separated in all dimensions to provide CAVs and reduce the complexity of the approximate model. For the final step, we build a linear approximation to the classifier and estimate the concept importance for each CAV. The explanation is based on the learned reducer and estimated weights for each CAV. For explanations of new inputs, reducers provide meaningful concept descriptions and concept scores from the feature maps. A diagram of the framework is shown in Figure 2.

Figure 2: A depiction of our framework. The CNN model is separated into concept extractor and classifier by chosen middle layer with a reducer. The concept extractor provides concept visualizations, instance correlated areas and similarity scores. The classifier provides concept weights and generates linear approximations as an explanation.

The selection of the target layer is important. Higher layers focus more on concepts (high-level features) and lower layers focus more on edges and textures of the image (low-level features) [zeiler2014visualizing]. Higher target layers usually mean a classifier with fewer layers (simpler) and a concept extractor with higher-level concepts. If the reduced concepts are to be meaningful, the selection of a higher layer is potentially better. One special case is when using feature maps from the last layer. Assuming the usage of a global average pooling (GAP) layer and only one dense layer as the final layers, the classifier under explanation will reduce to a simple linear model. Estimated weights will be accurate as they are constant at any position under any CAVs. Previous layers’ weight estimates take the average weights of all instances. Weights could vary for different inputs. The last layer is generally a good choice for concept-based explanations.

A benefit of using reducers instead of Clustering methods is that reducers provide scores for concepts as outputs instead of predictions of clusters’ centroids. Then reduced concept scores could be applied into the approximate model to analyse the contribution distribution for each feature more accurately. In ACE, concept scores can only be binary. Thus reducers provide better fidelity when inverting the reduction process. This could help when evaluating the fidelity of the learned CAVs.

3 Methodology

Given a pre-trained CNN classifier with training images , the prediction process will be

. Here we simply remove any final softmax layer (if present) so that each

is a scalar but not a probability. Let

be the feature map from the target layer , then separate the CNN into two parts and from the target layer. Feature map should be of shape where and reflect the size of the feature map and is the number of channels. Let be a vector from at position . CNN models share weights, so vector at each position in could be considered to be information on the original images after equivalent processing but with different receptive fields.

Non-Negative Concept Activation Vectors: Non-negative matrix factorization (NMF) reduces dimensions. Given a non-negative matrix , NMF provides non-negative and such that . Here, , and , and is the residual. The aim is to minimize the errors which is given formally as .

Each vector from feature map is an instance for the reducer. Given a training set , fitting input matrix containing instances with dimensions from feature maps , NMF provides NCAVs . Each row in reflects a single NCAV, totalling NCAVs from given parameters , the number of concepts in the reducer. Applying the decomposition with the learned , provides similarity scores under each NCAV. Another important parameter is the number of classes in the training set. Concepts appear in related images from the same class. Faced with a question like “Why this class but not others?", we can use images from different classes to learn related concepts (Figure 1 uses a cat class and a dog class as they are contexts in the image). For Clustering and PCA, the processes are similar.

Weight Estimation: For weights or concepts importance, other interpretability methods such as saliency maps use the gradients of output scores for some classes with respect to individual input features. Considering the feature map as input features, using learned NCAVs and directional derivatives, the weights could be estimated. This is the same method for estimating importance in TCAV [kim2018interpretability]. For a layer , given learned NCAV and correlated concepts score from decomposed feature map , consider class as the target class, the estimated weight is: . The estimate is based on some small over all training instances. Here is the matrix with the same size as , every vector at position is . Finally, the average weight of training instances are taken to be the final estimated weight.

One special case is the last feature map before a GAP layer and a dense layer. Since they are linear, having weights with bias from the last dense layer for target class and learned NCAV Parameter , the estimated weight will be:

The estimated weight will be . The last layer is a reasonable choice when explaining a CNN model: it contains the highest level of concepts; they are highly centralised and require the least number of concepts for the same fidelity; and estimated weights for the last layer are the most accurate, especially when having GAP and dense layers at the end.

Vector Visualisation:

 There are many ways to visualise a vector from a layer. For instance, having a vector in a middle layer, Deep Dream (concept vector visualization) 

[olah2017feature] could provide a pattern-enhanced image based on gradients for the vector. In this work, we use the method of prototypes [kim2016examples], choosing images containing target concepts and highlighting these concepts. Applying GAP to the decomposed feature maps, we can provide a score for each concept. Images with high concept scores are taken as the prototypes. Previous work shows that middle-layer feature maps have spatial correlations with input images, as was used in image segmentation to replace input masks with feature map masks [dai2015convolutional]. Decomposed feature maps for a single CAV could be presented as heatmaps for target concepts. Combining a heatmap and an image, we can apply a threshold for the heatmap and highlight only areas with high concept value in the image. In this paper, the threshold is taken to be 0.5, and only regions with values higher than 0.5 (after a minmax normalization) are considered to be related. Concept prototypes from Figure 1 are visualized in this way.

4 Experiments and Results

Following the desiderata of this work, we aim to measure both fidelity and interpretability. Interpretability will be measured through human surveys.

For both the computational and human-subject experiments, we use well-known CNN models for image classification. We consider two different datasets, ILSVRC2012 (ImageNet

[ILSVRC15] and CUB [WahCUB_200_2011]

. The implementation is based on PyTorch and scikit-learn. CNN models used for the ILSVRC2012 dataset are from torchvision pre-trained models. Here top1 error of ResNet50, ILSVRC2012 is

and for Inception-V3, ILSVRC2012 is . For the CUB dataset, we use the ResNet50 [he2016deep] structure and apply a fine-tune based on ImageNet pre-trained weights. The top1 error is . Other than NMF, we choose the baseline of Clustering (from ACE) and PCA (a popular dimensionality reduction method). Reducers are trained based on the training set and evaluated on the test set or the validation set.

4.1 Fidelity for Approximate Models

In this section, we compare the fidelity of approximate models based on three different dimensionality reduction methods: NMF, Clustering and PCA. We evaluate the fidelity for CNN pre-trained models with different for each dimensionality reduction method using both classification and regression measurements.

For fidelity, our approach is to measure the difference between the approximate and original model predictions. Measurement for classification and regression problems is different. Classification models only focus on predicting labels, errors which do not change the predicting labels will be ignored. For regression, any difference in approximate models will greatly affect the performance based on the loss function measurement.

Measures:  Given the original model and an approximate model , craven1996extracting provide a fidelity measurement for the approximate model of classification models, the 0-1 loss. It targets the difference of accuracy through predictions. For regression, ribeiro2016should define fidelity measurement as the squared error . While the squared error is appropriate as a loss function during training, for evaluation, relative error (RE) is more easily interpretable being scale-free. Given , and a set of instances , the measurement for classification and regression models based on the dataset will be:

Given a trained reducer and its inverse function for layer , the approximate model is given by .

Experimental Setup:  Our experiment is based on ResNet50 pre-trained model for ImageNet from torchvision, using the feature maps from layer4’s output. The parameter is evaluated from to , in steps of . The model could be considered as both a classification and regression (score for each single class) model. So both fidelity measures can be evaluated. Here we trained reducers for all 1,000 classes in ILSVRC2012. Only images from one class are included for one reducer which means is . For classification methods, only the top 5 classes are considered as candidates. For regression, only ground truth classes are tested, calculating the RE for the approximate models’ outputs. We take the mean RE for all 1,000 classes as final results.

Figure 3: Average fidelity for approximate linear models of ResNet50 over 1,000 classes. Left figure shows the fidelity for classification and right one is for regression. For classification, higher means better, closer to the original model’s accuracy. For regression, lower means better, reflects to lower RE. Around 1,000 compute (8 core CPU, v100 GPU) hours are needed for the evaluation.

Experimental Results:  Figure 3 shows the fidelity for different with (left) and (right). PCA provided the best fidelity result for both regression and classification. NMF’s result is close to PCA’s but diverges as increases. PCA is a popular and efficient dimensionality reduction method. NMF has two more limitations: non-negativity and no introduction of extra bias. Also, NMF finds new vector bases to achieve a new balance for each vector every time

increases, while PCA simply seeks a new basis vector iteratively, based on variance maximisation. Clustering showed the worst performance. Clustering methods can be considered as dimensionality reduction methods, but they only provide one-hot vectors as centroid predictions offering the least information. It is not designed for dimensionality reduction. When

increases, approximate models provide more faithful predictions for both classification and regression.

4.2 Interpretability through Human Survey

In this section, we evaluate the interpretability of approximate models based on three different dimensionality reduction methods: NMF, Clustering, and PCA. We hypothesised that NCAVs learned from NMF are more interpretable than CAVs learned from Clustering and PCA.

Interpretability reflects the meaningfulness of learned CAVs from dimensionality reduction methods, and therefore requires human-subject experimentation. If participants understand the concepts from learned CAVs more frequently through visualisations as explanations, CAVs and correlate dimensionality reduction methods can be considered more interpretable and meaningful to humans.

Methodology:  We use the Prediction Task [hoffman2018metrics, p. 11] and the Explanation Satisfaction Scale [hoffman2018metrics, p. 39]

for evaluation. For the prediction task, higher prediction accuracy of the model indicates that the participants are able to identify the concept in the image against the concept in the explanation more frequently. Participants should have similar descriptions for concepts if there’s a clear meaning inside. Cosine similarity is used to measure the similarity between concept descriptions. Finally, we measure the participants’ satisfaction of the explanations in terms of confidence, understanding, satisfaction, sufficiency and completeness. We obtained ethics approval from The University of Melbourne Human Research Ethics Committee (ID 1749428).

Experimental Design:  The experiment has two phases. In Phase 1, at each trial, participants are given five concept explanations from a class as candidates. Then an image with one concept highlighted is shown to participants. They are required to predict the related concept from 5 candidates for the given image and highlighted region as shown in Figure 4. For each concept candidate, participants are asked to provide a 1-2 word description of the concept. All participants are given 5 training images as a training phase followed by 15 testing images in this phase as the prediction task. They can move back to a training example at any time in the test phase. In Phase 2, they need to complete an explanation quality survey to self-report their opinion about explanations in the form of an explanation satisfaction scale. The experiment was implemented in a web-based environment and was conducted on the Amazon Mechanical Turk, a crowd-sourcing platform popular for obtaining data for human-subject experiments [buhrmester2011amazon].

Figure 4: Two samples of survey trials in the prediction phase using NMF reducer. Participants need to choose a concept on the right which is correlated to the image on the left.

Experimental Parameters:  To validate the consistency of results, we include three different scenarios: ResNet50 ( as the target layer) for ILSVRC2012 (scenario RI), Inception-V3 ( as the target layer) for ILSVRC2012 (scenario II) and ResNet50 ( as the target layer) for CUB as target CNN models (scenario RC). Three methods NMF, Clustering and PCA are applied individually in each scenario.

All 20 images are from 20 random classes chosen from all classes for each dataset. For each class, we train an explainer with of , and only the top CAVs with highest weights are chosen as selection candidates. One of these CAVs is randomly selected as the target. The concept in the target CAV is identifiable only if the sample image highly activates that CAV. So each target image is chosen from the top images in the test set which activate the target CAV mostly (with high similarity score). This can avoid the absence of the concept in images (e.g., tail concept may be considered absent when only the upper part of a dog is shown in the image). Each CAV is visualised by the prototype samples. The is : all explainers are trained for one class. Classes are the same for different models for the same dataset, but candidate concepts and target instances are different. For description measurement, we use GloVe [pennington2014glove] pre-trained word vector representations for each description, then use the average pairwise cosine similarity to measure the similarity of the concept descriptions.

We used a between-subject design: participants were randomly allocated into nine groups (3 scenarios and 3 types of reducers). There were a total of 157 participants who completed the survey. Participants with a prediction accuracy lower than (random choice) were excluded in the survey. Each experiment ran for approximately 30 minutes. We compensated each participant with 5USD and an extra bonus of 1USD for participants with high accuracy. of participants were males, were females and specified their own gender. Participants were aged between 23 and 70 ().

Scenario Reducer type Accuracy Description Similarity Confidence Quality
Understand Satisfaction sufficiency Completeness
RI NMF 74.4% 9.2% 0.59 0.1 77.7% 13.0% 4.3 0.6 4.1 0.6 3.8 0.8 3.7 1.2
Cluster 66.3% 13.8% 0.56 0.08 75.6% 13.8% 4.2 0.7 3.8 1.0 3.9 1.0 3.6 1.1
PCA 37.8% 5.9% 0.52 0.08 78.3% 14.7% 4.0 0.9 3.8 1.1 3.8 1.1 3.7 1.2
II NMF 62.6% 18.6% 0.57 0.08 69.3% 13.2% 3.5 1.0 3.4 1.3 3.3 1.1 3.4 1.3
Cluster 44.8% 13.2% 0.53 0.09 75.1% 13.7% 3.9 1.1 3.6 1.2 3.6 1.2 3.5 1.4
PCA 40.0% 8.6% 0.49 0.08 76.0% 13.0% 3.8 0.9 3.7 1.1 3.4 1.2 3.2 1.3
RC NMF 81.1% 8.4% 0.7 0.04 79.5% 10.8% 4.1 0.8 3.7 0.9 3.4 1.2 3.5 1.1
Cluster 78.6% 15.5% 0.7 0.05 75.0% 18.7% 3.9 1.0 4.1 1.0 4.0 1.0 3.9 1.1
PCA 57.0% 11.6% 0.59 0.03 61.1% 17.6% 3.6 1.0 3.0 1.2 3.4 1.2 3.2 1.2
Scenario Accuracy Description Similarity Confidence Quality
Understand Satisfaction sufficiency Completeness
RI <0.001 0.131 0.841 0.446 0.592 0.941 0.948
II <0.001 0.064 0.304 0.493 0.752 0.690 0.844
RC <0.001 <0.001 0.004 0.283 0.016 0.219 0.174
Scenario Reducer Pair Accuracy Description Similarity Confidence Quality
Understand Satisfaction sufficiency Completeness
RI NMF + Cluster 0.053 0.454 0.650 0.489 0.328 0.743 0.744
NMF + PCA <0.001 0.058 0.904 0.222 0.350 1.00 0.893
Cluster + PCA <0.001 0.182 0.589 0.549 0.979 0.783 0.849
II NMF + Cluster 0.006 0.298 0.246 0.277 0.693 0.387 0.911
NMF + PCA <0.001 0.016 0.152 0.429 0.451 0.739 0.643
Cluster + PCA 0.251 0.204 0.864 0.659 0.762 0.604 0.597
RC NMF + Cluster 0.558 0.863 0.402 0.605 0.261 0.143 0.293
NMF + PCA <0.001 <0.001 <0.001 0.116 0.074 0.894 0.411
Cluster + PCA <0.001 <0.001 0.029 0.304 0.008 0.108 0.068
Table 1:

Top: Mean and standard deviation of prediction accuracy, description similarity, confidence and quality comparison for 9 different groups

Middle: ANOVA test p values for each scenario

Bottom: T-test p values for each pair of reducers

Results: Table 1 shows the results of the human-subject experiment. We ran an ANOVA test on the results from each scenario. Then for each pair of reducers, we ran a T-test for comparison. Through the significance test, in the prediction task, NCAVs from NMF are more interpretable than CAVs from PCA (significant at the level), CAVs from Clustering are more interpretable than CAVs from PCA (significant at the level in most cases). For description similarity, results are not significant (not significant at the level in most cases). Most CAVs contain some meaningful information; participants are confident about their choice. There is no significant difference in confidence and quality scores in most cases (not significant at the level). We conclude that NCAVs from NMF are more interpretable than CAVs from PCA. NCAVs are at least equally interpretable to CAVs of Clustering.

Figure 5: When having two concepts of ‘mouth’ and ‘eyes’ measuring in two dimensions (only positive values reflect concepts), different reducers will provide different directions to represent concepts. PCA learns less meaningful but efficient directions. Clustering methods could provide meaningful centroids’ centre directions but are the least efficient. NMF may provide meaningful directions with fewer dimensions.

We observe from our experiments that reducers could help generate meaningful concepts from feature maps, but fidelity and interpretability are different. Three reducers have differences in theory. Here we propose an explanation for this phenomenon. Figure 5 shows a distribution of some concept instances. Each dot reflects an instance with some concept scores. Due to the activation in CNNs, we assumed that only positive values make sense in CNN models. The axis could contain the concept of ‘mouth’ and axis may reflect ‘eyes’. PCA has a new intersection of dimensions (bias) other than root so one of the dimensions is meaningless (points to negative values). Clustering methods provide correct concept directions (from the root to the centre of each cluster). But it may use more clusters for the same fidelity (clustering is based on data clusters but not directions). Also, it may provide some similar concepts (bottom left and upper right clusters have similar direction). Clusters may also be influenced by some isolated instances and provide meaningless concepts. But for NMF, it provides correct concept directions in an efficient way if only positive values reflect meaningful concepts.

5 Related Work

This work focuses on explanations for pre-trained models. Common methods provide explanations based on input level feature importance. Some methods provide model agnostic explanations based on importance for image segments [ribeiro2016should, lundberg2017unified]. Saliency maps provide pixel-level feature importance  for images based on gradients [shrikumar2017learning, bach2015pixel, smilkov2017smoothgrad]. However, some papers point out the unreliability of saliency methods [kindermans2017reliability, NIPS2019_9511]. CAM is another type of approach, providing heatmaps to indicate where the image activates the target class most based on CNN weights [zhou2016learning, selvaraju2017grad]

. Other than input level explanations, some papers build explanations from feature maps inside the CNN model and provide concept-level explanations based on supervised learning 

[kim2018interpretability, bau2017network, zhou2018interpretable]. ACE [ghorbani2019towards] relaxes the limitation of the labelled dataset using unsupervised learning. Learned concepts in the form of vectors could also be visualised by optimization methods [olah2017feature]. Other than learning from the feature maps, some works modify the structure of CNN models to provide concept level explanations through the model itself [hendricks2016generating, zhang2018interpretable, chen2019looks]. NMF provides only non-negative results [lee1999learning], it could provide meaningful concepts for CNN models [olah2018the].

6 Conclusion

We provide a framework for concept-based explanations for CNN models based on post-training explanation method ACE. By using feature maps inside the CNN model, we can gather some interpretable concept vectors. We also show that having requirements of fidelity and interpretability, NCAVs from NMF can provide better explanations compared with Clustering and PCA methods. PCA provides CAVs with better fidelity but lack interpretability. CAVs from Clustering methods are interpretable but not faithful.


This research was undertaken using the LIEF HPC-GPGPU Facility hosted at the University of Melbourne. This Facility was established with the assistance of LIEF Grant LE170100200.