On Concept-Based Explanations in Deep Neural Networks

by   Chih-Kuan Yeh, et al.

Deep neural networks (DNNs) build high-level intelligence on low-level raw features. Understanding of this high-level intelligence can be enabled by deciphering the concepts they base their decisions on, as human-level thinking. In this paper, we study concept-based explainability for DNNs in a systematic framework. First, we define the notion of completeness, which quantifies how sufficient a particular set of concepts is in explaining a model's prediction behavior. Based on performance and variability motivations, we propose two definitions to quantify completeness. We show that under degenerate conditions, our method is equivalent to Principal Component Analysis. Next, we propose a concept discovery method that considers two additional constraints to encourage the interpretability of the discovered concepts. We use game-theoretic notions to aggregate over sets to define an importance score for each discovered concept, which we call ConceptSHAP. On specifically-designed synthetic datasets and real-world text and image datasets, we validate the effectiveness of our framework in finding concepts that are complete in explaining the decision, and interpretable.



There are no comments yet.


page 10

page 18

page 19


Concept-based Explanations for Out-Of-Distribution Detectors

Out-of-distribution (OOD) detection plays a crucial role in ensuring the...

Explaining Deep Neural Networks using Unsupervised Clustering

We propose a novel method to explain trained deep neural networks (DNNs)...

Automating Interpretability: Discovering and Testing Visual Concepts Learned by Neural Networks

Interpretability has become an important topic of research as more machi...

Cause and Effect: Concept-based Explanation of Neural Networks

In many scenarios, human decisions are explained based on some high-leve...

Unifying Model Explainability and Robustness via Machine-Checkable Concepts

As deep neural networks (DNNs) get adopted in an ever-increasing number ...

Provable concept learning for interpretable predictions using variational inference

In safety critical applications, practitioners are reluctant to trust ne...

A Peek Into the Reasoning of Neural Networks: Interpreting with Structural Visual Concepts

Despite substantial progress in applying neural networks (NN) to a wide ...

Code Repositories


PyTorch Transformer-based Language Model Implementation of ConceptSHAP

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have shown great success in numerous tasks [goodfellow2016deep], from understanding images [scalable_image] to answering questions [devlin2018bert]. Yet, in many scenarios their lack of explainability serves as a bottleneck against their real-world impact, especially in high-stake decisions such as in medicine, transportation, and finance, where such explanations help identify systematic failure cases, comply with regulations, and provide feedback to model builders. This has thus led to increasing interest in human-like explanations of DNNs.

The most commonly-used methods to explain DNNs explain each prediction by quantifying the importance of each input feature [ribeiro2016should, lundberg2017unified]. However, such explanations typically explain the behavior locally for each case, rather than globally explaining how the model makes its decisions. Also, input features (such as the raw pixel values), and weights on them, are not necessarily the most effective explanations for human understanding. Instead, “concept-based explanations” characterize the global behavior of a DNN in a way understandable to humans, by explaining how DNNs use concepts in arriving at particular decisions. Such concept-based thinking, by extracting similarities from numerous examples and grouping them systematically based on their resemblance, has been shown to play an essential role in human minds for making generalizations [ARMSTRONG1983263, tenenbaum1999bayesian]. With a similar motivation, “concepts” can explain the decision-making rationale of DNNs and their generalizable knowledge. A few recent studies have thus focused on bringing such concept-based explainability to DNNs. Based on the common implicit assumption that the concepts should lie in certain linear subspaces of some intermediate DNN activations, they aim to find such concepts efficiently and relate them to data. These have ranged from supervised approaches [kim2018interpretability, zhou2018interpretable]

that obtain concept representations given human-labeled data on salient concepts, to purely unsupervised approaches that provide concept explanations automatically without human labeling, ranging from k-means clustering of DNN activations 

[ghorbani2019automating], to a self-interpretable Bayesian generative model [bouchacourt2019educe]. A key motivating question we ask in this paper is whether we could build on such unsupervised approaches to extract concepts, but where in addition to ensuring that the concepts are representative of the DNN activations, we would also like to ensure the additional facet that they are sufficiently predictive of the DNN function itself.

This leads naturally to a crucial unanswered question in concept-based explanation, which is how to evaluate whether a set of concepts are sufficient for prediction. Previous concept-based explanations select concepts that are salient to a particular class [kim2018interpretability]. However, selecting a set of salient concepts does not guarantee that these concepts are sufficient for prediction. The notion of explanations that are sufficient for prediction is also called the “completeness” of explanations [gilpin2018explaining], which is acknowledged to be valuable for evaluating explanations [yang2019evaluating]

. In this work, we propose such a completeness metric for a given set of concept explanations. The completeness measurement can be applied to a set of concept vectors that lie in the span of some intermediate DNN layer activations, which is a general assumption in previous concept-based explanation works

[kim2018interpretability]. The core idea is that, by projecting the activations onto the span of concept vectors, we keep just that information that can be explained by the concepts, and discard the information that are orthogonal to all concepts. Thus, when projecting activations onto the span of concept activation vectors result in no loss in prediction accuracy, we can learn concepts that are “complete” (i.e. sufficient for prediction).

Figure 1: The overview of our concept discovering algorithm. Given a deep classification model, we first provide semantically meaningful clusters by segmentation followed by k-means clustering as in [ghorbani2019automating]. Then, we discover complete and interpretable concepts under the constraint that each concept is salient to one (or a few) unique cluster, while projecting features onto the span of concept vectors does not deteriorate the classification performance. After the concepts of interest are retrieved, we can calculate the importance of each concept and the classes where each concept is the most important by ConceptSHAP.

Interestingly, we show that under a stringent degeneracy condition on the DNNs, principal component analysis (PCA) on the DNN activations can be shown to maximize these concept completeness metrics. Of course such degeneracy assumptions likely not hold in general, so that maximizing these completeness metrics could be viewed as a generalization of PCA that additionally takes the DNN model into account. However the resulting “principal components” are not guaranteed to be interpretable to humans. We thus build on the concept-interpretability principles proposed in ghorbani2019automating, and additionally consider carefully designed objectives that favors concepts that are more semantically meaningful to humans. A key facet of our approach is that it can work without any human supervision, which reduces the human labeling cost to provide explanations.

After a set of highly-complete concepts are discovered, we use game-theoretic notions to aggregate over sets to define contextualized importance of a concept, which we call ConceptSHAP. ConceptSHAP is shown to be the only scoring method that satisfies a set of axioms, which can explain how much does each concept contribute to the total completeness score. We also derive a class-specific version of ConceptSHAP that decomposes the ConceptSHAP score with respect to each class in the multi-class classification setting, which can be used to find concepts that contribute the most with respect to a specific class. To verify the effectiveness of our completeness-aware concept discovery method, we create a synthetic dataset where we can obtain the ground truth concepts and test whether existing methods can retrieve them. We find that our method is able to retrieve the ground truth concepts better than all compared methods. We also demonstrate examples from real-world language and vision datasets to show that our concept discovery algorithm provides additional insights on the behavior of the model.

2 Completeness of Concepts

Problem setting:

We are given a set of training examples , corresponding labels , and a DNN that is learned to map the labels (with dimension ) from given inputs (with dimension ). We choose an intermediate layer of the DNN, and define the operation for generating the intermediate features from input as

and feed forwarding from the intermediate layer to logit layer as

, yielding the decomposition . We define the data matrix as ; the corresponding feature matrix as , and the corresponding prediction matrix as . Assume that there is a set of concepts denoted by vectors that represented linear directions in some activation space given by a concept discovery algorithm. We define the concept matrix as .

Next, we propose two mathematical definitions that capture how complete is a given set of given concepts. Both definitions are based on the idea that completeness should quantify how sufficient a particular set of concepts are in explaining the model’s behavior. A low completeness score of a set of concepts indicates that the corresponding concepts do not capture the model behavior fully, and that the model bases its decision on factors other than the given concepts. We propose two metrics of completenss based on two different assumptions, as we discuss below.

Assumption 1:

If the given set of concepts is complete, then using a projection of the intermediate features from input onto the feature subspace spanned by the concepts, concept space, would not deteriorate the model performance. We define the projection of some input embedding onto the subspace spanned by as


We define the completeness metric on a set of validation data with T data points as based on the assumption that projecting input features onto the span of a complete set of concepts should not reduce the model prediction performance.

Definition 2.1.

Given a prediction model , a set of concept vectors , and some loss metric , we define the completeness score as:


where to ensure that .

We omit the dependency of , , , and of for notation simplicity. When is high, the network maintains a high accuracy even after projection, which supports that the set of discovered concepts hold sufficient info for prediction.

Assumption 2:

The second assumption is that if we remove all useful concept information for a classification task, the model should fail to discriminate different classes. Thus, when all salient information is removed from the network, predictions scores for examples in class A won’t be much different from examples in class A. We define the data matrix of validation set as as

. To quantify how much the prediction score varies across data samples, we use the sample variance of the predictions:

, where , and stands for the trace. Then, we define the second completeness metric following this assumption.

Definition 2.2.

Given a prediction model , and a set of concept vectors , we define the completeness score as:


Based on our assumption 2, the variance of the prediction gets lower after useful concept information is removed from the data, yielding a high completeness score .

We now show that under degenerate assumptions, the top PCA vectors of maximize the completeness score for a set of concept vectors. Top PCA vectors are designed to capture as much information in data as possible, a set of concepts with high completeness score similarly preserve the necessary information in the data for the model to reach satisfactory predictions.

Proposition 2.1.

When h is an isometry function that maps from , where L is the loss metric in equation 2 and (i.e. the loss is minimized), the first m PCA vectors maximizes .

Proposition 2.2.

When h is an isometry function that maps from , and each dimension of is uncorrelated with unit variance, the first m PCA vectors maximize .

We underline the two main differences between the concept vectors that maximize the completeness score and the PCA vectors. First, the propositions depend on degeneracy assumptions such as isometry of a DNN, which may not hold in practice. Therefore, the concepts that maximize the completeness score takes the prediction of the DNN into account, which can be seen as a generalization of the original PCA. Second, since the concept score only depends on the span of the set of concept vectors, any concept vectors whose span is equal to the span of the top PCA vectors also maximize the completeness score (i.e. the set of vectors that maximize the completeness is not unique). Each PCA vectors are constrained to minimize the reconstruction error and being orthogonal to other PCA directions. On the other hand, the discovered concept vectors that maximize the completeness can be designed so that each concept is interpretable and semantically-meaningful to humans, which will be further explained in the next section.

3 Discovering Complete and Interpretable Concepts

Our goal is to discover a set of maximally-complete concepts, where each concept is also interpretable and semantically-meaningful. ghorbani2019automating has listed meaningfulness, coherency, and saliency as the desired properties for concept-based explanations. Our work on completeness is a crucial addition to the set: not only concept are meaningful coherent and salient, we ensure they are sufficient to models prediction.

We assume that we are given some candidate clusters of concepts (which can be given by human labeling or self-discovery) and each cluster shares some feature attributes that are coherent and semantically-meaningful to human (which matches the two desired properties in ghorbani2019automating). We define the feature matrix of cluster i as where are samples that belong to cluster i. We denote the feature mean of cluster as . Clusters can be obtained by human labeling [kim2018interpretability] or by unsupervised grouping of relevant input features (e.g. segmentation of images based on grouping of pixels) [ghorbani2019automating]. In either case, we would not know which sets of clusters contain useful information to the model that we try to explain. We aim to find a minimum set of concepts that are maximally-complete to the prediction model. Additionally, we constraint that each concept is salient to one cluster only so that each concept direction is semantically-meaningful to human. To discriminate different concepts (for coherency), we constraint that different concepts are not salient to the same cluster.

We now define our objective function for discovering a set of complete and interpretable concepts . A primary goal is maximizing completeness (which can be or ), such that the set of concepts fully explain the model behavior. Besides, we introduce two regularization terms for interpretability (can be considered as generalization of the orthogonality constraint of PCA). We introduce cluster-sparsity regularization to encourage each concept is salient to minimum number of clusters, and we introduce concept-sparsity regularization to encourage different concepts are not salient to the same cluster, i.e. each cluster to be salient to at most one concept. Given some clusters , a set of training examples , and a pre-trained prediction model , the overall objective function (to minimize) becomes:


where and are loss coefficients. To formulate the cluster-sparsity regularization and concept-sparsity regularization , we first formally introduce the saliency score between concept to cluster as:

We note that the saliency score is normalized such that the saliency score between any concept and all clusters has unit norm. When the saliency score between concept to cluster is large, can differentiate samples from cluster from samples in a random cluster, and thus is salient to . To encourage that each concept can differentiate a small amount of clusters to random clusters, we regularize the L1 norm of saliency score for every concept-cluster pair (which can be seen as the sparse filtering objective in ngiam2011sparse), leading to the cluster-sparsity regularization loss:

which encourages sparse saliency scores. To constrain that different concepts are not salient to the same cluster, we penalize the pairwise saliency score product between every pair of concepts for the same cluster, leading to the concept-sparsity regularization loss:

If there are two concepts that are both salient with respect to the same cluster, the pairwise saliency score will be large and thus the concept-sparsity regularization loss will be large. We note that each concept has to be salient to some cluster, but a cluster can be not salient to any concepts. Therefore, we typically assume we have more clusters compared to concepts (i.e. ).

4 How Important is Each Concept?

ConceptSHAP to quantify concept importance:

Given a set of concepts with a high completeness score, we would like to evaluate the importance of each individual concept, specifically, by quantifying how much each individual concept contributes to the final completeness score. Let denote the importance score for concept , such that quantifies how much of the completeness score is contributed by . Motivated by its successful applications in quantifying attributes in what-if scenarios for complex systems, we adapt Shapley values [shapley_1988, lundberg2017unified], to fairly assign the importance of each concept (which we abbreviate as ConceptSHAP):

Definition 4.1.

Given a set of concepts and some completeness metric , we define the ConceptSHAP for concept as

The main benefit of using Shapley value to assign importance is that Shapley value can be shown to uniquely satisfy a set of desired axioms, listed in the following proposition:

Proposition 4.1.

Given a set of concepts and a completeness metric , and some importance score for each concept that depends on the completeness metric . defined by conceptSHAP is the unique importance assignment that satisfy the following four axioms:

  • Efficiency: The sum of all importance value should sum up to the total completeness value, .

  • Symmetry: For two equivalent concepts, which satisfy for every subset , .

  • Dummy: If for every subset , then .

  • Additivity: If and have importance value and respectively, then the importance value of the sum of two completeness metric should be equal to the sum of the two importance values, i.e, for all i.

The efficiency axiom distributes the completeness score of all concepts to the individual concepts. The symmetry axiom guarantees that two concepts that behaves the same get the same importance score for fairness. The dummy axiom guarantees that concepts that do not affect the completeness gets 0 importance score. The additivity axiom guarantees that decomposibility in the completeness leads to decomposibility in the importance score, and scaling the completeness does not change relative importance ratio between concepts.

Per-class saliency of concepts:

In multi-class classification, it may be more informative to obtain a set of related concepts that contribute to the prediction for a specific class, instead of the global contribution (i.e. concepts that are important to all classes). To obtain the concept importance score for each class, we first define the completeness score with respect to one class by only considering data points that belongs to that class, which is formalized as:

Definition 4.2.

Given a prediction model , a set of concept vectors that lie in the feature subspace in . We then define the completeness score for class as:


where is the set of validation data where ground truth label is and . Given the completeness for a specific class, we define the ConceptSHAP for concept i with respect to class j as:

Definition 4.3.

Given a prediction model , a set of concept vectors that lie in the feature subspace in . We can define the ConceptSHAP for concept i with respect to class j as:


For each class j, we may select the concepts with the highest conceptSHAP score with respect to class j. We note that and thus with the additivity axiom, .

5 Experiments

5.1 Synthetic Data with ground truth concepts

Figure 2: Two random training images and the respecting ground truth concepts that are positive along with a table that matches ground truth concepts to shape. Each object shape in the image corresponds to a ground truth concept (with random color and location), and the ground truth label depends solely on ground truth concept 1 to 5. Only the training image and ground truth label are provided during training (in the unsupervised case), and the goal of the discovering concept algorithm is to correctly retrieve ground truth concepts to .


We construct a synthetic image dataset with known complete concepts to evaluate whether the proposed automatic concept discovery algorithm can successfully extract the ground truth concept accurately. For each sample, we randomly sample 15-dimensional binary variable assigned as ground truth candidate concepts

, …, that is generated with Bernoulli independently for each dimension with . From ground truth concepts , we generate input data x and output label y. For the label target , we construct a 15-dimensional multi-label target for each sample, where the target is a function that depends on the first 5 dimension of the 15-dimensional . For example, 111the details of generating this dataset is in the appendix.. Therefore, the minimum set of ground truth variable is by construction. For the input data , we construct a toy image dataset where each concept is mapped to a specific shape, and the image contains the specific shape if and only if the concept . For example, if , a star (with random color and location) will occur in the image , and if , there will be no star in the image . The map of concept to shape and two example images are given in Figure 2.

For the input cluster image for our discover concept algorithm, we either provide the ground truth clustering or by superpixel segmentation followed by K-means clustering as in ghorbani2019automating

, which we call the method as ours-supervised and ours-unsupervised respectively. In total, we use 48k training samples and 12k evaluation samples, where each ground truth concept corresponds to some specific shape in the image. We train a convolutional neural network with 6 layers which achieves

accuracy, and take the first fully connected layer as the feature layer (which is in the problem definition.)

Evaluation metrics:

Let the known concepts be , , …, , and assume we discover some concept vectors , …, . We would like to evaluate how closely the discovered concept vectors align with the actual ground truth concepts. For a concept vector to align with a ground truth concept

, we assume that the ground truth concept can be linearly separated by the concept vector direction. More formally, we measure the accuracy of the best linear classifier with

as the weight vector applied on the binary classification problem where is the target.

We then evaluate how well the set of discovered concepts matches the set of ground truth concepts , …, as

which measures the best average accuracy by assigning the best concept vector to differentiate each ground truth concept.


Figure 3: Visualization Result for the nearest neighbors of each discovered concepts in ours-supervised and TCAV along with ground truth concept 1 to 5 that is constructed to be the minimum set of ground truth variable. We note that only the shape is revelent of the concept, as the color and location can be random. We show that each of our discovered concepts in ours-supervised corresponds to one of ground truth concept 1 to 5 (with a random order). While TCAV also shows meaningful discovered concepts, they fail to retrieve all ground truth concepts that are used by the model. Higher resolution examples will be shown in the appendix due to space constraint.

We summarize the results in Table 1, where ours-supervised and TCAV takes supervised clusters as input, and ours-unsupervised, ACE, Raw-Clustering takes the clustered segments as input. For supervised clusters, we randomly choose examples where for cluster j. The term supervised and unsupervised refers to whether the actual ground truth concept set is given or not. For ours-supervised 1, we maximize in equation 4; for ours-supervised 2, we maximize in equation 4. We see that both ours-supervised 1 and ours-supervised 2 obtain higher AlignemntScore compared to TCAV. ours-unsupervised 1 and ours-unsupervised 2 also achieves higher AlignemntScore than all compared baselines, which demonstrates the effectiveness of our concept discovery algorithm. We further observe that that completeness 1 and 2 are complementary: maximizing completeness 1 does not necessary lead to a higher value in completeness 2, and vice versa. Nevertheless, by jointly optimizing completeness 1 or completeness 2 along with additional sparsity regularization with respect to given clusters, we are able to retrieve the correct ground truth concepts. Lastly, we show the nearest neighbors (of the super-pixel segments) for the discovered concepts of ours supervised and TCAV along with the ground truth concepts in Figure 3 to validate that our concept discovering algorithm does retrieve the correct concept. While we only show the top-2 nearest neighbors, we note that the top-k nearest neighbors examples all belong to the same concept when k is large.

ours-supervised 1 1.0 0.21 0.99
ours-supervised 2 0.0 1.0 0.94
TCAV 0.20 0.30 0.71
ours-unsupervised 1 1.0 0.22 0.94
ours-unsupervised 2 0.12 0.99 0.90
ACE 0.27 0.37 0.71
PCA 0.5 0.67 0.79
Raw Clustering 0.49 0.83 0.66
Table 1: The Completeness and AlignemntScore for our methods compared to the baseline methods on synthetic dataset where ground truth can be obtained.

5.2 Text Classification


We apply our method on the IMDB text classification dataset. The IMDB dataset contains text of 50k movie reviews, where 25k reviews is used as training data and 25k reviews are used for evaluation. For each review, it is either classified as a positive or negative review. We use a pre-trained model with a BERT language model [devlin2018bert]

from Keras, which achieves 0.94 testing accuracy. To obtain the input cluster, we use a 10-word sliding window to obtain sub-sentences over the IMDB sentences. We then obtain the embedding for all sub-sentences, and perform k-means clustering on the positive sub-sentences and negative sub-sentences. We then run our concept discovering algorithm to obtain 5 concepts with


width=1 Concepts Nearest Neighbors ConceptSHAP Related Class Concept 1 plot is boring the characters are neurotic needlessly offensive 0.13 neg characters jess bhamra parminder nagra and jules paxton keira average chop socky all of the cast are likeable characters Concept 2 that keeps on reappearing to the scene where you think she deserved a more studied finale than that i think 0.29 neg think no sometimes hatred and isolation are deeper are more Concept 3 i think the most frustrating thing is that the performances 0.15 neg you might think to see organs yanked out of the many people think has an underlying meaning the love between Concept 4 don’t wait for it to be a classic watch it 0.43 pos has real potential and will be one to watch in i recommend you to watch it if you like mature Concept 5 children trying to comfort them after that is all said 0.21 pos paid so well after all acting is one of the it after watching it you will say that it was

Table 2: Concepts and their nearest neighbors, ConceptSHAP, and related class in IMDB.


For the 5 discovered concepts, we show the top nearest neighbors to each concept, and the ConceptSHAP value and related class (determined by TCAV score) for each concept discovered. Additional nearest-neighbor examples are shown in the appendix. We note that for all concepts, the nearest sub-sentences of other concepts mostly contain a specific word, which we highlight in blue. Nearest neighbors of concept 1 mostly contains the word “characters”, nearest neighbors of concept 2 and concept 3 mostly contains the word “think”, nearest neighbors of concept 4 mostly contains the word “watch”, and nearest neighbors of concept 5 mostly contains the word “after”. With a closer look at each concept’s nearest neighbors, we find that the nearest sub-sentences of the first concept usually contains negative adjectives alongside “characters”, nearest sub-sentences of the second concept usually contains the word "think" at the first or last position followed by disagreement towards the movie, nearest sub-sentences of the third concept usually contains “think” in the middle of the sub-sentence followed by the reviewer’s more neutral personal opinion, the nearest sub-sentences of the fourth concept often contain the phrase “watch it” where “it” refers to the movie, and the nearest sub-sentences of the fifth concept just contains the word “after”. We find that the most salient concept by ConceptSHAP value is the concept 4, where all of the top nearest neighbors explicitly mentioned the word “watch” with a positive sentiment in general. We perform TCAV test for all concepts with respect to the positive and negative class, and the first 3 concepts are significant to the class “negative” with TCAV score 1, and the last 2 concepts are significant to the class “positive” with TCAV score 1.

5.3 Image Classification


We next perform experiments on Animals with Attribute (AwA) [lampert2009learning]

to classify animals with 50 classes, where we take 26905 images as training data and 2965 images as evaluation data. Each training data has a ground truth label of one of 50 animals. We train an Inception-V3 model pre-trained on Imagenet

[szegedy2016rethinking] which reaches testing accuracy. To obtain the input clusters, we employ the method of ghorbani2019automating, which performs superpixel segmentation and k-means clustering with images to get 334 input clusters. We then perform our discovering concepts algorithm given the clusters to obtain 8 concepts with 0.99.

Figure 4: The Nearest Neighbors, ConceptSHAP, and related class for each concept obtained in AwA.


For each of the 8 discovered concepts, we show the top nearest neighbor patches, the ConceptSHAP value, and the related classes where the concept has at least twice as large ConceptSHAP value than any other concepts. From the nearest neighbor of each concept, we find that the concepts learned by the network mostly consider textures and colors. Since we only learn 8 concepts for 50 classes, each concepts learned are useful to multiple classes. We find that the ripple texture that is the most common in ocean is significant to many marine animals. The leaf/ grass concepts are often significant to animals that live in trees or pastures. We note that out of the 8 concepts learned, there are two concepts representing stripes and two concepts representing ripples. While the concept “stripe 1” seems to contain thicker stripes compared to “stripe 2”, we do not observe significant difference between the top nearest neighbors of “ripple 1” and “ripple 2”. Other than this, each discovered concept seems to be meaningful and coherent to human. We note that in some cases the related class of a concept may not necessarily contains the concept. One possible reason is that the concepts may be salient since they are “pertinent negative” to a certain class, which helps making the correct prediction since these concepts do not exist in images of a certain class. They main takeaway of this example is that the salient concepts for image classification shares similarity in texture instead of shape, which coincides with the finding in geirhos2018imagenet.

6 Related Work

Various approaches have been proposed to explain the decision making of pre-trained models. Most works fall under two categories: (i) feature-based explanation methods, that attribute the decision to important input features [ribeiro2016should, lundberg2017unified, smilkov2017smoothgrad, l_c_shapley], and (ii) sample-based explanation methods, that attribute the decision to previously observed samples [koh2017understanding, yeh2018representer, khanna2019interpreting, attention_prototypical]. Among these forms of interpretability, different evaluations of explanations are proposed, including more human-centric evaluations [lundberg2017unified, kim2018interpretability] and functionally-grounded evaluations [samek2016evaluating, kim2016examples, ancona2017towards, yeh2019on]. However, providing the most important input features or samples for a specific prediction does not necessary give insights on how the model behaves globally, which our work aims to address with concept-based explanations. For concept-based explanations, few recent works are related. TCAV [kim2018interpretability]

use human-labeled data and estimates the importance of a concept with respect to a specific class.

zhou2018interpretable decompose the prediction of a data sample into linear combinations of concept components. ghorbani2019automating automate TCAV by replacing human-labeled data by automatically super-pixel segmentation followed by k-means clustering. bouchacourt2019educe discover concept by training a inherently explainable model which trains a concept classifier along with the prediction model. While all aforementioned works defines concept directions in the linear span of some activation layer of the model, our framework brings completeness and interpretability to concept discovery.

Our work is also closely related to methods that perform dimension reduction in neural network layers to obtain meaningful latent variables and understand neural network. chan2015pcanet cascade PCA layers to obtain satisfactory prediction performances. raghu2017svcca apply SVD followed by CCA to compare two representations of a deep model to help better understand the deep representations. kingma2013auto perform deep dimension reduction for generative models where the latent space can be semantically-meaningful. For example, chorowski2019unsupervised show that when learning with speech data, the latent dimension is closely related to the phonemes, which can be seen as human-relatable concepts in speech data; or unsupervised_sentiment show that when learning with language data, a single unit is closely related to the sentiment.

7 Conclusions

Concept-based explanations can be a key direction to understand how DNNs make decisions. In this paper, we study concept-based explainability in a systematic framework. First, we define the notion of completeness, which quantifies how sufficient a particular set of concepts is in explaining the model’s behavior. Based on performance and variability motivations, we propose two definitions to quantify completeness. We show that they yield the commonly-used PCA method under certain assumptions. Next, we study two additional constraints to ensure the interpretability of discovered concept. Through experiments in toy data, text, and image domain, we demonstrate that our method is effective in finding concepts that are complete (in explaining the model’s prediction) and that are interpretable. Note that although our work focuses on post-hoc explainability of pre-trained DNNs, joint training with our proposed objective function can also be used to train an inherently-interpretable model. A future direction may be to explore whether jointly learning the concepts and the model can lead to better interpretability.


Appendix A Proof

Proof of Proposition 2.1


By the basic properties of PCA, the first PCA vectors (principal components) minimize the reconstruction error. Define the concatenation of the PCA vectors as a matrix and as the norm, the basic properties of PCA is equivalent to that for all ,

By the isometry of from the Frobenius norm to , we have

and since is equal to Y, we can rewrite to

and subsequently get that for any

Proof of Proposistion 2.2


We note that the completeness only depends on the span of . If we assume the matrix to have rank

, we may find a set of orthonormal basis (by QR decomposition)

that is orthonormal with the same completeness score. Therefore, for any set of given concepts , we can replace them with a set of orthonormal concepts without loss of generality. By the basic properties of PCA, the first m PCA vectors

maximizes the total projection data variance on the projected space with at most m orthonormal vectors, which can be formalized as


By using the notation for the entry of vector , we may rewrite total projected variance as


The fourth equality holds since and are uncorrelated, which can be shown by calculating the co-variance between and as:

Where the last two equations follow by each dimension of is uncorrelated with unit variance and and is uncorrelated. By plugging in equation 8 into equation A, we may obtain

Define as the concatenated matrix for the orthonormal basis for orthogonal complement of , and define by concatenating and . We know by fundamental properties of linear projections. Since all vectors in is orthogonal to vectors in and by pluggin in equation 8 for , we get . By combining the observations we get

and following the isometry of D, we have

and thus the first m PCA vectors maximizes . ∎

Appendix B Additional Experiments and Settings

Detailed Experiment Settings in Toy Example

The complete list of the target y is

We create the dataset in matplotlib, where the color of each shape is sampled independently from green,red,blue,black,orange,purple,yellow, and the location is sampled randomly with the constraint that different shapes do not coincide with each other. For hyper-parameter selection, we set so that the completeness is above 0.99 and produces reasonable results. We fix this hyper-parameter throughout all experiments to prevent exhaustive tuning and over-fitting. Scaling the hyper-parameter in the same order produces similar results. We use 1000 images in each cluster for all methods that are compared. For selecting the concepts in TCAV and ACE, we compare the number of labels where the concept has p-value < 0.2 and choose the top 5 concepts (since even TCAV score 1.0 does not have p-value < 0.05). We note that we have tried many alternatives for choosing concepts for TCAV and ACE, but failed to achieve better performance for TCAV and ACE. The main reason may be that the ground truth contains functional such as XOR, which has 0 TCAV score for inputs.

Implementation Details

For calculating ConceptSHAP, we use the method in kernelSHAP [lundberg2017unified] to calculate the Shapley values efficiently. Before calculating the nearest neighbor, we ensure that the dot product between each concept vector and its most salient cluster mean has a positive dot product (if it is negative, we take the negative of the concept vector as the new concept vector, which does not effect the loss at all). For the input cluster proposals in AwA, we follow the code of ghorbani2019automating and their hyper-parameters. For input cluster in Imdb, we obtain 500 clusters from positive sub-sentences and 500 clusters from negative sub-sentences by k-means clustering. We train a linear classifier differentiating the cluster segments and random segments, and remove clusters with accuracy lower than 0.95. We also remove clusters that have less than 100 elements. For input cluster proposals in the toy dataset, we used k-means clustering with 20 clusters.

Additional Nearest Neighbors for IMDB

We show addition nearest neighbors for each concept obtained in IMDB in Figure 5

. We observe that some top nearest sub-sentences of concept 2 and concept 3 do not have the word "think" in it. The top nearest neighbors in concept 4 generally has a tone that encourages readers of the review to watch the movie, which is probably why it has the largest ConceptSHAP score.

Figure 5: Additional Nearest Neighbors for each concept obtained in IMDB.

Additional Nearest Neighbors for AwA

We show addition nearest neighbors for each concept obtained in AwA in Figure 6. The nearest neighbors all share the same texture. Interestingly, some of the nearest neighbors of ripple 2 are not exactly ripple, but tree/leaves that share similar texture as ripple. Some nearest neighbors of dots contains dots from leaves instead of pure dots on animals. This again validates that the concepts are based on the texture of the image.

Figure 6: More Nearest Neighbors for each concept obtained in AwA.