Towards Automatic Concept-based Explanations
Interpretability has become an important topic of research as more machine learning (ML) models are deployed and widely used to make important decisions. Due to it's complexity, i For high-stakes domains such as medical, providing intuitive explanations that can be consumed by domain experts without ML expertise becomes crucial. To this demand, concept-based methods (e.g., TCAV) were introduced to provide explanations using user-chosen high-level concepts rather than individual input features. While these methods successfully leverage rich representations learned by the networks to reveal how human-defined concepts are related to the prediction, they require users to select concepts of their choice and collect labeled examples of those concepts. In this work, we introduce DTCAV (Discovery TCAV) a global concept-based interpretability method that can automatically discover concepts as image segments, along with each concept's estimated importance for a deep neural network's predictions. We validate that discovered concepts are as coherent to humans as hand-labeled concepts. We also show that the discovered concepts carry significant signal for prediction by analyzing a network's performance with stitched/added/deleted concepts. DTCAV results revealed a number of undesirable correlations (e.g., a basketball player's jersey was a more important concept for predicting the basketball class than the ball itself) and show the potential shallow reasoning of these networks.READ FULL TEXT VIEW PDF
Towards Automatic Concept-based Explanations
As machine learning (ML) has become a widely used tool in many applications from medical (e.g., ) to commercial , gaining insights into ML models’ predictions has become one of the most important topics of study, and sometimes even a legal requirement . The industry is also recognizing interpretability as one of the main components of responsible use of ML; not just a nice-to-have component but a must-have one. The ability to understand and interact with ML tools is one of the crucial factors to decide whether ML should be implemented in high risk domains with potentially severe consequences (e.g. medicine).
One of the unique challenges of interpretability in high stakes domains is that the users may not be very familiar with ML. This calls for using more intuitive interpretability language designed for laypersons. However, most of the developments in interpretability methods have been using less intuitive language, mostly focused on estimating how important each input feature is for prediction [24, 25, 33]. While this is a useful tool for explaining the prediction of a single data point (local explanation ), the limitations of this method has been repeatedly shown to be unreliable. These limitations include potential methodological weakness (e.g., the importance measure has little to do with the prediction, contradicting its promise), vulnerability to adversarial attacks , and susceptibility to human confirmation biases . In other words, using pixels as a medium requires the subjective judgment of humans, and some studies have shown that humans are able to find evidence for completely contradicting conclusions . We argue that this might be partially due to the fact that humans do not think or communicate using pixels.
Some recent interpretability methods aim to overcome this by generating quantitative explanations using high-level ’concepts’ (e.g., diagnostic concepts, gender, race) instead of input features. TCAV 
uses a user-chosen set of example data points to form a concept activation vector (CAV), which then is used to calculate the importance of the concept for a prediction. This method uses intuitive language for a layperson to express concepts of interest and to understand their model through those concepts. However, the user has to have a set of concepts in mind for testing and provide examples of such concepts. What if users do not have candidates concepts and/or have ways to provide examples? What if the space of plausible concepts to test is exponentially large?
In this work, we introduce DTCAV (Discovery TCAV) which automatically discovers concepts by collecting connected parts of images (segments) that together form important concepts. We validate via human experiment that the learned segments form concepts as coherent as human-labeled concepts. We further validate the learned concepts by showing that these concepts segments alone often carry sufficient information to be predicted as the corresponding class. We also add and remove sets of segments sorted by their importance in prediction and show the resulting significant impacts on the prediction.
This work focuses on post-training interpretability methods - finding explanations given an already trained network. While there is a line of research on building inherently interpretable models [32, 15, 29], we focus on scenarios where we cannot modify the model. Most common post-training interpretability methods provide explanations by estimating the importance of each input feature or training sample for the prediction of a particular data point [24, 25, 33, 17]. Naturally, these methods can only explain one data point at a time (local explanation).
While this is useful when only specific data points matter, these methods have been shown to come with many limitations, both methodologically and fundamentally. For example,  showed that some input feature-based explanations are qualitatively and quantitatively similar for a trained model (i.e., making superhuman performance prediction) and a randomized model (i.e., making random predictions). This shows that the explanation may have little to do with prediction, contradicting its goal of explaining predictions. Other work proved that some of these methods are in fact trying to reconstruct the input image, rather than estimating pixels’ importance for prediction . In addition, it’s been shown that these explanations are susceptible to humans’ confirmation biases . For example,  showed that given identical input feature-based explanations, human subjects confidently find evidence for completely contradicting conclusions. Using input features as explanations also introduces challenges in scaling this method to high dimensional datasets (e.g., health records). Humans typically reason in higher abstracted concepts () than a particular input feature (e.g., lab results, a particular hospital visit).
A recently developed method uses high-level concepts, instead of input features. Given a set of examples of a concept of user’s choice, TCAV  produces estimates of how important that a concept was for the prediction. However, users have to provide examples of the concept, limiting this method to cases when users have a set of concepts in mind and are interested their importance measures.
Our method leverages multi-resolution image segmentation methods to There has been a long line of research on multiscale and hierarchical segmentation of images ([23, 30, 4]). In this work, we use the SLIC  superpixel segmentation method for its simplicity, memory efficiency, speed, and high quality performance, as shown in .
In this section, we first review TCAV, a concept based interpretation method for interpreting deep neural networks. We then introduce our method, Discovering and Testing Concept Activation Vectors (DTCAV), by first describing what we define as concepts and how we discover them and then completing the description by testing discovered concepts.
is a post-training interpretability method that calculates how important a user-chosen concept is for a deep neural network’s prediction of a target class, e.g. how important is stripedness for predicting the zebra class. A user first provides a set of example data points of the chosen concept together with random data points that do not belong to the concept (i.e., a random counterpart). Then data points are then mapped to the activation space of a bottleneck layer of the user’s choice. Concept Activation Vectors (CAVs) are defined as the direction orthogonal to a linear classifier trained to distinguish the concept activations from the random activations. The importance of the concept is generated using the directional derivative of the prediction unit of a particular class with respect to the calculated CAV. The final TCAV score is simply an aggregated statistic of these directional derivatives for many images from the target class. Intuitively, the TCAV score measures how important the concept (represented as a CAV) is for a class prediction by conducting a form of sensitivity test. In order to reject any concepts that were “not learned” by the network, a statistical testing between TCAV scores with multiple CAVs of the same concept (using different random counterparts) and that with random CAVs (using random images in the place of concept images) is conducted. TCAV output only includes concepts that pass this test.
While TCAV allows a layperson to express their concept of interest and conduct hypothesis testing of their choice, users are in charge of selecting domains where users have a clear set of concepts in mind, it may not be suitable for domains where users simply do not know which concepts to test or have the resources to collect concept examples.
Our method was inspired by the following questions: If we limit the space of concepts to a set of target class images, can we automatically discover concepts contained in them? Can we discover concepts that are coherent to humans while sufficient for prediction?
At a high level, the DTCAV method first segments target class images (i.e., “discovery images”) and applies simple clustering methods followed by outlier removal to finalize the set of discovered concepts. The output of the method is a number of sets of segments of the discovery images, each set representing a concept, together with a quantitative measure how important each set of concepts is.
Fig. 1 shows the overall algorithm in detail. First, we create a set of images belonging to the target class that we call “discovery image”. For each discovery image, segmentation is applied several times with different levels of resolution; for instance, using superpixel segmentation with various parameters resulting in different number of segments.(Fig. 1(a)) Each segment is then resized to the original input size of the network and mapped to a chosen bottleneck layer’s activation space.(Fig. 1(b)) Clustering with outlier removal is then applied to the activations of the segments to discover the concepts. (Fig. 1(c)) A new set of images of the target class is used to calculate TCAV importance scores using the method described in . (Fig. 1(d)) While the final method above is simple, each piece in the method address many inherent challenges of concept discovery. The first challenge is that discovered concepts must be location and scale invariant, since the same concept may appear multiple times in different scales and locations in images. An efficient multi-resolution segmentation method (SLIC superpixel segmentation) is crucial as one image is segmented with multiple resolutions (Fig. 1(a)). Since doing so may create duplicated segments, we use Jaccard similarity to remove potential duplicates. The second challenge is effective filtering for potentially important concepts that are coherent. In other words, we want to filter out potentially irrelevant segments, e.g. a human face appearing in one zebra image. We empirically identified three simple but important factors: distance, frequency and popularity. The distance factor is intuitive - we remove segments that are far from all of the clusters. The frequency factor means that segments in a cluster must occur across many images (frequency) and not just small number of images. The popularity factor simply means that the cluster also must be big enough to be a good candidate. We filter clusters where neither of these factors are satisfied (details in Section 4).
After filtering we have set of candidate concepts then are used to compute the concept activation vectors (CAVs) and perform statistical testing to obtain TCAV scores.
A number of previous literature supports our assumption that clustering method is surprisingly effective in distinguishing concepts. An experiment of  where in the right bottleneck layer of an image classification network, images of a concept are linearly separable from random images using various sets of random images.  also verify that simple linear classifiers were sufficient for discovering concepts. Another evidence is by  that pointed out striking similarities in deep neural network’s learned representations with that in human perception. A comprehensive discussion of linearity in deep neural networks activation space is provided in .
Note that DTCAV is not limited to using TCAV scores to measure the importance. Once the concepts are discovered, for example, one could use gradient-based importance measures like saliency maps. The averaged value of importance scores for all pixels that fall into the segment could be used as a proxy for importance measure. One can also use cosine similarity between CAVs of each concept and CAVs of target images as a measure.
All experiments were performed using Inception-V3 model 
trained on the ILSVRC2012 data set (Imagenet). We randomly chose 100 out of 1000 classes in the data set for our experiments. We used “mixed 8” bottleneck for this section.
method was chosen as it strikes good balance between quality of segments and efficiency. We performed three-resolution segmentation by changing SLIC’s number of segments parameter to 15, 50, and 80. After resizing each segment, since segments are in irregular shapes, we fill in the empty part of the image with the zero pixel value (117 in our network, after post-processing). For the choice of cluster, we performed concept discovery using several clustering methods including K-means, Affinity Propagation , and DBSCAN 
. When Affinity Propagation was used, typically a large number of clusters (30-70) were produced, which was then simplified by another hierarchical clustering step. The best results, however, were acquired using k-means clustering followed by removing all points but thepoints that have the smallest distance from the cluster center. For filtering, as described in Section 3, we remove all but a) high frequency (segments come from more than of discovery images) b) medium frequency with medium popularity (more than of discovery images and the cluster size is larger than ) and c) high popularity (cluster size is larger than ). In all the following experiments, and , and we use k-means with ; 50 images of training set were used for concept discovery.
In what follows, we first show examples of the discovered concepts using DTCAV algorithm. We first verify that our method returns coherent sets of concepts via human subject experiment. The results indicate that the discovered concepts are as coherent to humans as hand-labeled concepts. We show that our method is able to learn various abstract levels of concepts; from simple concepts (e.g., color, texture) to more complex ones (e.g., objects, parts). We also quantitatively verify that these concepts were in fact crucial for prediction. First, we show that a set of important concepts are enough to predict the right class. Second, we show that adding or deleting important concepts significantly impacts the prediction performance.
The multi-resolution segmentation step of DTCAV naturally returns segments that contain simple concepts such as color or texture and more complex concepts, such as parts of body or objects. Among those segments, DTCAV successfully identifies concepts with similar level of abstract-ness with similar semantic meaning (as verified via human experiment). Fig. 2 shows some examples of the discovered concepts. Note that each segment is re-sized for display.
We designed an intruder detection human experiment following interpretability literature  to verify the quality of the discovered concepts. At each question, a subject is asked to identify one image out of six that is conceptually different from the rest. We created a questionnaire of 34 questions, such as shown in Fig. 3 Among 34 randomly ordered questions, 15 of them include a set of randomly chosen DTCAV concepts, and the other 15 questions are human-labeled concepts from Broaden dataset . The first four questions were used as training and discarded. On average, participants answered the hand-labeled dataset 97% (14.6/15) () correctly, while discovered concepts were answered 99% (14.9/15) () correctly. This experiment confirms that while a discovered concept is only sets of segments of images, DTCAV was able to identify segments with coherent concepts.
For each discovered concept, we compute its CAV and then test the CAVs using a set of held-out images of the target class to get the TCAV score of each discovered concept. Note that a TCAV score of means that of the target class images have positive sensitivity to the concept; intuitively, this means that increasing the presence of that concept increases the prediction score of the class. Fig. 4 shows the result for running DTCAV on a subset of target classes. For each class we show four concepts, some with high and low TCAV scores and some concepts that did not pass the statistical test (i.e., the concept was not relevant to the prediction). Three randomly selected segments are shown for each concept. More examples are provided in Appendix B.
Reviewing discovered concepts with high TCAV scores shows what the network pays attention to, which reveals some surprising correlations. In some cases, we see that the network picked up on appropriate related concepts. The letters in police van were correctly identified as important in Fig. 4, while the asphalt road in the background was identified as not important. Fig. 5(a) shows more examples of this case. Not surprisingly, we discover some undesirable correlations. For example, the lionfish prediction considers the background reef to be important, and basketball predictions consider player jerseys and the wooden floor important instead of the ball, as seen in Fig. 4. Some classes such as European gallinule Fig. 4, network considers the background (grass) much more important (TCAV score 0.73) then parts of the bird (TCAV score 0.46). This indicates that this classifier may not be great for robustly detecting this bird, and that gathering more training data with various background might improve the result. Similarly undesirable correlations are shown in Fig. 5(b).
Another insight we gained was that in some cases when the object is complex, the network identifies parts of the object as separate concepts, and some parts are more important than others. For example, in Fig. 5(c), carousel lights, poles structure, and seats (horses). It is interesting to learn that the lights were more indicative of the carousel than seats.
Note that some of the discovered concepts may seem duplicated to humans. For example, in Fig. 6(b), three different ocean surfaces (wavy, calm, and shiny) are discovered separately and all of them have similarly high TCAV scores. Future work remains to see whether this is because the network represents these ocean surfaces differently, or whether we can further combine these concepts into one ’ocean’ concept.
While rare, there are concepts that are less coherent to humans. This may be due to limitations of our method or because things that are similar to the neural network are not similar to humans. However, the incoherent concepts were never in top-5 most important concepts among the 100 classes used for experiments.
The discovered segments only contain a part of the story of the target class, especially since it loses the global structure of the object (e.g., shape). However, it is plausible that sometimes the mere presence of important concepts of a target class is sufficient for classification without considering the global structure of the class images. For example, zebra pattern could be distinctive enough that stitching together zebra skin textures may convince the network that it is a zebra, without having to see the anatomy of the zebra. To this end, we designed a concept stitching experiment where we randomly place concept segments on a blank image in a sorted order of importance.
For each target class, we experimented stitching top- highly important concepts and generated 100 stitched images for each experiment. We then picked the experiment yielding the highest “success rate”, which is the percentage of stitched images classified as that target class (i.e., accuracy with stitched images). We choose in top- via greedy cross-validation. An average success rate of 39.6% was obtained (note that random chance is 0.001%). As a control, we also ran the experiment of stitching the bottom-5 concepts yielding a 1% success rate. Interestingly, for zebra, leopard, and drilling platform classes, the success rate is relatively high (more than 80%) which shows that the network is only looking at important concepts (Fig. 8). On the other side, police van, jinrikisha, and bullet train classes, the success rate is close to zero which means that the general structure of the class images is also necessary for correct classification. Examples of these classes are shown in Fig. 8. Aggregate results are shown in Fig. 7 where we group the classes into the ones with a success rate more than 40%, which we consider “stitchable” classes, and the other classes which we consider “unstitchable”. This experiment shed insights on whether the global structure of the class was crucial for the prediction or not, revealing potentially shallow reasoning of the network.
In this experiment, we show the effect of adding or deleting important concepts from images. The idea is that if a segment’s respective concept is indeed important, then deleting/adding that segment should decrease/increase the network’s ability to predict more so than random deletion/addition.
For a set of test images, we add or delete segments with respect to their associated concept’s TCAV scores. Then we track the prediction accuracy, one from highest scores (blue line in Fig. 9) and one from lowest scores (red line from Fig. 9). To find each segment’s associated concept, we find its nearest neighbor concept cluster in the bottleneck layer’s activation space space. Fig. 9 shows two examples of such addition/deletion.
The results in Fig. 10 show that the discovered concepts carry important signal for prediction; a small number (5) of top concepts are sufficient to predict of images correctly, while the bottom concepts only achieve accuracy. Note that the relative performance with randomly-ordered concepts further supports these results. Deleting top/bottom concepts also lead to the same conclusion. When the top-5 concepts are removed, more than of originally correctly predicted images were no longer correctly predicted.
We note a couple of limitations of our method. This work is based on image data sets, as the super-pixel segmentation method is limited to images. While the general idea of discovering and testing concepts does apply to many other data types such as texts, it was not tested. Additionally, our method only can discover concepts that can be expressed with pixels. While we still discover plenty of insights based on pixel segments, there might be more complex and abstract concepts that we are unable to discover. Future work includes better optimizing our method’s performance by tuning the multi-resolution segmentation parameter per class. This may better capture the inherent granularity of objects; nature scenes may have a smaller number of concepts than city scenes. For example, the ”European gallinule” class in Fig 4 could have been benefited from a segment of the entire bird itself.
In conclusion, DTCAV is a post-training interpretability method that automatically discovers high-level important concepts in the form of image segments. We verified that the diverse set of discovered concepts are coherent to human via human experiment, and further validated that discovered concepts are indeed carry important signals for prediction. The discovered concepts reveal insights into surprising and sometimes undesirable correlations that the network has learned, highlighting networks’ frequent shallow reasoning. Such insights may help to promote safer use of this powerful tool, machine learning.
Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs.Jama, 316(22):2402–2410, 2016.
The unreasonable effectiveness of deep features as a perceptual metric.arXiv preprint, 2018.
As mentioned in Section 3 and Section 4, in order discover concepts that are frequently present in a target class, we perform unsupervised clustering of reszied image segments in the bottleneck layer’s activation space. Outlier removal is then performed to remove unrelated members of a cluster. In order to make sure that a discovered concept is actually present in target class images, we introduced three different types of concepts that are acceptable:
If the members of a concept’s cluster come from majority of discover images, in other words, the segments forming that concept’s cluster are parts of a large number of discovery images, it could be said that the discovered concepts frequently appears in the target class images. In the experiments, any concept appearing in more than 50% of discovery images is considered to be acceptable. On example would be the ball in the basketball class. It’s present in every image and usually there is one of it. As a result, its respective cluster is not large but it has members coming from a large number of discovery images.
If the segments in a concept cluster, appear in a reasonable number of discovery images but the cluster size is large, it means that the concept has a significant presence in part of the images belonging to the target class. In our experiments, if a concept appears in 25% to 50% of images but its cluster has more than 40 members, it is an acceptable concept. One example would be the hand object in the basketball class. Many of the basketball images do not have a hand in them but hand is highly related concept to the basketball class that appears constantly in a portion of the images.
If the segments in a cluster come from a small number of images but still the cluster is large, it could be deduced that the concept has a very distinctive presence in those small portion of discovery images. One example would be the human crowd concept in the basketball class. A small percentage of images have that concept but when its present, because it covers a large area of its corresponding image, it will be partitioned into large number of segments; each of which belonging to the same concept. In our experiments, a cluster with more than 80 segments in it coming from more than 10% of the discovery images is acceptable.
Any concept that is not satisfying one of the aforementioned criteria is removed.