Cause and Effect: Concept-based Explanation of Neural Networks

by   Mohammad Nokhbeh Zaeem, et al.
Carleton University

In many scenarios, human decisions are explained based on some high-level concepts. In this work, we take a step in the interpretability of neural networks by examining their internal representation or neuron's activations against concepts. A concept is characterized by a set of samples that have specific features in common. We propose a framework to check the existence of a causal relationship between a concept (or its negation) and task classes. While the previous methods focus on the importance of a concept to a task class, we go further and introduce four measures to quantitatively determine the order of causality. Through experiments, we demonstrate the effectiveness of the proposed method in explaining the relationship between a concept and the predictive behaviour of a neural network.



page 6

page 8

page 9

page 11

page 12

page 13


On Concept-Based Explanations in Deep Neural Networks

Deep neural networks (DNNs) build high-level intelligence on low-level r...

Analyzing Representations inside Convolutional Neural Networks

How can we discover and succinctly summarize the concepts that a neural ...

Concept-based Explanations for Out-Of-Distribution Detectors

Out-of-distribution (OOD) detection plays a crucial role in ensuring the...

CHAIN: Concept-harmonized Hierarchical Inference Interpretation of Deep Convolutional Neural Networks

With the great success of networks, it witnesses the increasing demand f...

A Peek Into the Reasoning of Neural Networks: Interpreting with Structural Visual Concepts

Despite substantial progress in applying neural networks (NN) to a wide ...

Net2Vec: Quantifying and Explaining how Concepts are Encoded by Filters in Deep Neural Networks

In an effort to understand the meaning of the intermediate representatio...

Spatial-temporal Concept based Explanation of 3D ConvNets

Recent studies have achieved outstanding success in explaining 2D image ...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Applications of Machine Learning (ML) and Artificial Intelligence (AI) as methods to help with automatic decision-making have grown to the extent that it has raised concerns about the trustworthiness of these methods. There have been rules and regulations all around the world that organizations should provide explanations for decisions made by their automated decision-making systems

[9]. These concerns often exist whenever the problem at hand is not fully understood, explored or our knowledge of the problem is not complete. Knowing the reasoning of machine learning methods may also help with catching their unwanted behaviours by comparing the reasoning to experts’ understanding of problems. An example of such unwanted behaviours is biases in decision-making. On the other hand, explanations can be used to extract the knowledge gained by these black boxes as well. Knowledge extraction can help with a better understanding of the AI view of the problem and the machine learning methods.

Neural networks as one of the most promising forms of AI with high performance on classification problems like ImageNet challenge

[6, 17] have been criticized for their black box decision-making process. One of the most important questions asked about neural network’s decisions is how a certain concept influences the internal representation and eventually the output of the neural network. Here, a concept is a representation of a feature and is defined by a set of samples with that feature, against a random set [11]. For example, for predicting the job title of a person from their image, this feature can be as simple as the colour of uniform (e.g. white or pink), background (e.g. office, ambulance or clinic), or objects (e.g. stethoscope) in the image.

Breaking down of the decision, output or task class of a given pretrained neural network into high-level humanly meaningful concepts presented in the input (post-hoc analysis) have been an active area of research in the past few years. This process is done by inspecting the internal representations or activations of the neural network. This approach may be called concept-based explanation of neural networks. The training phase of the original network and the explanation phase can be completely separate with different datasets (one for task classes and one for concept classes) and done by different people. For instance, for predicting the job title of a person from their image, the goal is to determine whether having a clinic as an image’s background affects the prediction of the job title to be a doctor. Or how the presence of a stethoscope around the neck changes the prediction of the neural network. These methods do not require concept labels and task labels to be from the same set of samples. For example, the task of predicting if the job title of a person is doctor can be represented by a set of physician images. But, the concept clinic may be represented by a set of clinic images and the concept stethoscope may be represented by a set of stethoscope images.

I-a Nonlinear Concepts

Most concept-based methods often assume that a concept, if present in an activation space, should be linearly separable from non-concept samples [4, 11, 10, 22]. This assumption, however, does not necessarily hold, especially in the earlier layers of a network where the learned features are often not abstract enough to linearly separate concepts [13] or in later layers when they fuse to form higher-level concepts. This hinders these methods’ ability in tracking the presence of a concept throughout the network. Another limitation comes from the assumption that the gradient of a section of a network with respect to the input is a good representation of that section [11, 22]. Such a first-order approximation might be misleading. This issue has been extensively discussed for saliency maps —which are also based on gradients approximations— and have been proven misleading [1, 12].

In our method, we check the presence of a concept in a layer’s activations by training a concept classifier —a network with the same structure as the task classifier from that layer onward but trained to detect the concept. The accuracy of concept classification gives us a good understanding of the importance and possible influence of the concept on a task class. For instance, the colour of shoes might not be relevant in the job classification task and such information is likely to be discarded by the network after the first few layers. If in a particular layer, a concept cannot be detected by the concept classifier, it is safe to say that the network cannot recall the concept –i.e. the concept is forgotten (not necessarily universally but to the capacity and power of the given network). Such a conclusion can be made only if the concept classifier shares the same structure as the task classifier since the network structure is the upper limit for extraction power of the network Moreover, the concept classifier is initialized by the weights of the network under inspection. This initialization will reduce the number of concept samples required for training the concept classifier. This particular choice of the concept classifier’s structure allows us to track the concept information across the original network’s layers.

I-B Causality

We aim to capture and quantify the causal relationship between a concept and prediction of a class. For example, we want to quantify the extent that the presence of concept fever (or its absence) is a necessary or sufficient condition for the prediction of class flu. Another shortcoming of the previous methods is that most methods yield a score that captures the correlation between concepts and output and cannot give any further details about the nature of such a relation [4, 11, 22]. Following the above example about job title classification from images, the correlation between wearing a lab coat and being classified as a doctor cannot answer the questions like do all images classified as doctor include lab coats or all people wearing lab coats are classified as doctors. This problem is sometimes referred to as causality confusion. Note that the goal is not to investigate causal relationships in the training dataset. We aim to investigate the causal relations “learned” by a neural network.

In a medical diagnosis setting existing methods have difficulty answering questions like, does all patients that are classified as having flu have fever symptoms (fever necessary for predicting flu)? Are all patients with fever classified as having flu (fever sufficient for predicting flu)? Are all patients classified as flu do not have a fever (absence of fever necessary for predicting flu)? Are all patients with fever not classified as flu (fever sufficient for negative of predicting flu)?

Based on the trained concept classifier and the existing network for task classification, we evaluate whether a concept is necessary, sufficient, or irrelevant for a specific task class. To avoid unnecessary assumptions like linear assumption or first-order approximation, we use a distribution sample set –i.e. a set of samples representing the distribution of data manifold. This set is a representative of the likely inputs of the network. Then we directly measure four relationship scores based on concept and target values of the distribution sample set. The four measures will be extracted in terms of causal expressions, showing whether a concept causes a task class or vice-versa. Unlike the previous works in [4, 22], which are limited to specific network structures like convolutional layers, the proposed method can be applied to a wide range of network structures.

Contributions of this work are as follows:

  1. We propose a framework to capture the existence of a given concept in a layer of a neural network without the linear assumption or first-order approximation.

  2. We also propose a set of scores to capture the nature of the relationship between the concept and the network decisions in the form of causal expressions. In other words, we determine which one logically follows the other.

  3. We then show practical applications of our method based on several experiments. Some of our experiments is designed to investigate linear and first-order approximation assumptions (individually to isolate them from other factors). The results have shown that these assumptions are not reliable.

  4. Through experiments, we also compare our method with two existing methods, namely TCAV [11] and IBD [22], in determining relationships of concepts and tasks. The results show that our method succeeds in cases previous methods fail. Moreover, we demonstrate that our method outperforms the IBD method in explaining an image classification of a real-world dataset [21].

Ii Related Work

There have been several works on explaining the intermediate activations of a neural network based on human-friendly concepts. Most notably, Kim et al. [11] proposed a percentage measure, called TCAV score, to measure how much a concept interacts with the task classifier. TCAV works based on whether the gradient of the neural network is in the direction of the concept. The direction of the concept is defined as the direction orthogonal to the linear classification decision boundary between concept and non-concept samples. The TCAV score captures the correlation between the network output and the concept and lacks detailed information about the nature of the relationship. Moreover, it assumes that concepts can be represented linearly in the activations space, an assumption that does not necessarily hold [13]. They also represent a section of the network only by its gradient (first-order approximation), which might be misleading. A similar approach has also been explored in methods Net2Vec [7] and Network dissection [4] methods but they assume that the concepts are aligned with single neurons’s activation.

In another work, Interpretable Basis Decomposition for visual explanation (IBD) [22], the authors tried to explain the activations of a neural network by greedily decomposing it into some concept directions. They use the resulting decomposition as explanations for the image classification task. One of the drawbacks of such an approach is its linear assumption which comes from the usage of linear decomposition of the gradient in the activation space. Using greedy methods can also potentially result in inaccurate and unstable results. Another limitation of the IBD method [22] is that it can only explain convolutional layers and therefore for neural networks that include dense layers, they have to modify the network. In their experiments, they have replaced each dense layer with a global average pooling layer and a linear layer.

The linear assumption indicates that a concept in hidden layers corresponds to a vector and the representation of data in each layer is a vector space. Such methods assume that addition, subtraction, scalar product and inner product (as projecting an activation to a concept vector) operations in an activations space are always meaningful. The linear assumption is originated in feature visualization methods. Most feature visualization methods optimize for inputs that maximally activate certain neurons or directions. Early studies on neural network activation space tried to find samples that maximally activate a single neuron, and associate a concept to the neuron. In

[18] the authors argued that random linear combinations of neurons may also correspond to interpretable meaningful concepts. The general idea of using a linear classifier to check the information of intermediate layers originated in [2]. They proposed to use linear probes – trainable linear classifiers independent of the network – to get an insight into the network representations. In contrast to what was mentioned in [18], in [4, 16] the authors reported that the basis (each neuron) direction activation is more often corresponding to a meaningful concept than just random vectors. Still feature visualization methods, ignore the distribution of the input data which results in inputs that are not consistent with real samples.

Linear interaction of concepts has been even less studied in feature visualization methods. In [16] the authors showed in some cases the addition of two concepts’ activations will result in inputs with both concepts present. But they cast doubt on whether this finding is always true. Linear assumption lacks enough evidence to be considered reliable for being the basis of interpretability methods that try to gain the trust of humans and justify neural network decision making.

Some other methods have tried to automatically discover new concepts from neural networks, namely Automatic Concept-based Explanations (ACE) [8] and Completeness-aware Concept-Based Explanations (CCE) [20], rather than taking a concept as input. ACE tries to automatically extract concepts based on TCAV while CCE extracts concepts based on convolutional layers continuity. CCE also tries to avoid first-order approximation by measuring the importance of concepts using Shapley score. Though these methods can help with cases that no principle exists for rational behaviour of the network, in many cases, the experts have a good principle about the problem at hand and the principle’s concepts are predefined. So they want to check the consistency of the neural network with the existing principles. For instance, in the detection of a certain disease, the experts check all the related symptoms and are not interested in other concepts that the machine learning method might introduce. For example for the prediction of a patient having flu, medical experts know that fever is a symptom, and we want to know exactly what is the relation between the fever and being classified as having flu.

Our work relates to CACE [10] in that, both try to address the shortcoming of TCAV [11] by capturing causal expressions. The CACE method [10] measures the influence of concept by the difference of conditional expected values. This requires highly controllable datasets or very accurate generative models that may not be available in practice. Our method relates to works that define and train neural networks with concept-based explanations in mind [13, 3, 5], though our method explains existing pretrained neural networks.

Our work relates to [19] in that both use a specific visual method to examine the influence of different input features on the output of a machine learning model. But our method goes further and inspects the nature of the relationship and quantifies these visualizations. We also consider high-level concepts instead of raw input features.

In the next section, we will propose a framework for a concept-based explanation for neural networks, which simultaneously addresses the linear assumption, first-order approximation and causality confusion issues discussed above.

Iii Framework

Iii-a Background

Logical expressions are usually expressed as a causality clause in the mathematical notation form of . In this notation, phenomenon is the reason for the phenomenon and whenever happens, will follow. To understand the clause, both and should be understood. Meaning that both condition and consequence should be familiar for humans so that the clause can be understandable.

In fundamental math, concepts are represented by sets. We use the same representation to visualize the relation between concept and task in a neural network. Being in a set means that the corresponding feature is present, and not being in the set means the feature is not present. Two arbitrary sets are usually demonstrated by a Venn diagram (as seen in Figure 1a).

Fig. 1: (a) There are four different subsets that might potentially be empty. Depending on which subset is empty, one of the four conditions will happen. (b) 3 relative positions of two sets. With respect to (T) Task class, (S) Sufficient condition, (N) Necessary condition, and (I) Negative necessary.

There can be several possible relations between the two sets. Each of these relations can also be represented as a causality clause.

  • Necessary: ().

  • Sufficient: (), reverse of necessary.

  • Negative Necessary: meaning and are inconsistent ( or ).

  • Negative Sufficient: meaning either or or both should happen ( or ). ( is a set that contains all the elements).

Iii-B Methodology

In our method, we base the explanations on a certain layer’s activations and explain whether and how the concept interacts with a task class based on the activations of the layer. We break up the neural network into two sections, the section before the hidden layer (denoted by ) and the section after the hidden layer (denoted by ). denoted the trainable parameters of the second section and the whole network can be expressed as (see Figure 2c).

Fig. 2: (a) samples of positive and negative concepts, (b) distribution sample set, (c) representation of task class and concept in network, (d) analysis of the relationship between the task class and concept showing the concept is a necessary condition for the task class.

As shown in Figure 2, we only need two sets of samples for our analysis. (1) Concept set labelled on the concept information only (Figure 2a). (2) Distribution sample set, without any labelling (Figure 2b). Note that access to the original training task data is not required.

For the sake of explaining the proposed method, let us consider a neural network trained on colour-coded hand-written digits. In the training set, a unique colour was assigned to each class, and samples within each class were coloured accordingly. For instance, all 0’s in the training set are red, and all 1’s are blue. One would expect the network decision to be influenced by the colour as well as the digit itself; but the challenge is how to measure this influence. We aim to determine, how the concept, i.e. red, influences the decision-making of this neural network.

The first step in our analysis is to check whether the concept is present in the layer. In other words, can the classifier on activations of the layer achieve acceptable concept detection accuracy. We check if the second section of the network has adequate power and capacity to distinguish the concept set (colour red) in that layer. The number of output neurons are adjusted to match the concept.

For representing the concept, a set of positive and negative concept examples are used, in our case red samples against other colours (Figure 2a). Then the concept classifier with structure of the second section of the neural network is trained () to distinguish the concept from non-concept activations. As a result, we will have two networks with identical structures but different parameters. denotes the parameters of the original network trained for task classification (digit classification) and denotes the parameters learned to distinguish a concept from non-concept samples (red vs. other colours). is the task classifier, whereas is the concept classifier. For learning , we initialize the trainable parameters of as . Note that the parameters of the first section () do not change while is learned.

Other than showing the concept is present and extractable by the network, training another network with the same structure gives us a way to generalize over concept samples. And since now we have generalizable representations of both task classes and concepts, we can proceed with our causal analysis.

Checking if a set is a subset of another, can be easily done by checking the definition. Since we cannot sample every possible instance in our input space, we only check the relationship on a distribution sample set (Figure 2b). Note that this sample set is chosen randomly and it is not specifically selected like prototyping methods. As we mentioned earlier, the distribution sample set, the samples used for causal analysis evaluation, does not have labels. The set is a subset of set if every sample in is also in , which is equivalent to being a necessary condition for or . Checking the negation of this definition is much easier (just checking that no counter-example exists). For this purpose, a scatterplot is generated by evaluating the task classifier and concept classifier on distribution sample set (each point in the scatterplot is a sample of this set). A counter-example, in this case, is a sample in and not in (e.g. a sampled classified as 0 and not red). So the clause correctness corresponds to the case where the top left corner of the evaluation graph is empty – equivalently no counter-examples found in our distribution sample set (Figure 2d). Note that the points of the scatter plots are only outputs of the task classifier () and concept classifiers () based on the layer and are not necessarily close to true labels, this is a positive point since we want to measure the relationship based on network information and not the true labels. Other corners being empty translate to the other relationships between concept and task class.

Two observations support our choice for using the same structure for the detection of concepts. First, if the concept is present and the network is using it, the network has to extract information using its structure so the network structure should be able to detect it. Second, if the concept is not detectable by the existing structure there is no way of it being involved in the network decision. Of course, if the network is not using the concept but it’s present in the layer, the evaluation analysis will detect the concept not being involved in task class decision making.

Though concepts like colour can be easily learned by much simpler network structures, more complex concepts like the presence of objects (a stethoscope) might not be as simple to detect. Since this network is pre-trained (on task classification), using it for simpler concepts is not a restriction. Moreover, the structural coherence of the network is kept intact. In other words, the limitations, powers and local behaviour of the network (as initial parameters) are considered in the detection of concept, keeping the convolutional activations as convolution representations (with the spatial information preserved).

Of course, this evaluation is only done using the samples and does not necessarily mean that the real sets defined by the decision boundary of the concept and target classifier networks have such a relation. In fact, we are only interested in the samples that are on the manifold (see Figure 3). And decision boundary behaviour outside the data manifold distribution does not have any effect on the practical relationship and hence is not important to us. For example in Figure 3, the concept (C) is a sufficient condition for the task class (T) since all samples predicted as concept red are classified as target class.

Fig. 3: The two sets might not have a relationship but on the manifold of data they might have a strong relationship.

Iv Quantifying the Relationships

Since the concept classifier and task class are represented by soft decisions (outputs of the two networks), we propose a method to quantify the absence of counter-examples similar to the ROC curve (see Figure 4).

Consider the fact that the logical expression is equivalent to . For the expression to be true either or has to true. Assuming the threshold for both expressions and calculating by , we get the fact that the expression holds for any sample that is not in the section.

For better handling of imbalanced classes we use the adapted F1 score instead of accuracy:


where and

are adapted precision and recall defined as:


TP, TN and F denote the number of samples in the corresponding part of Figure 4.

Fig. 4:

The process of creating quantification curve for necessary score. In this process we check if the task class is true with probability at least

then the concept is true with probability of at least . The highest value is achieved when there is no counter-example () samples.

Based on the introduced parameters, for each threshold , an F1 score can be calculated. The measure of the strength of a necessary relationship is then calculated as the area under the F1 versus threshold curve. Intuitively the strongest relationships in this measure, hold with stronger accuracy for smaller thresholds .

For simplicity here, we assumed that the threshold for and are equal, but in general, these thresholds can be considered as threshold for and threshold for . In that case, the quantitative curve will be a 3d surf (the F1 score vs. and ) and the volume under the curve should be used as the measure of the strength of the relationship.

Each of these relationships can logically be converted to an OR () expression and evaluated in the same manner we evaluated the necessary relationship. A simpler way is to logically convert them to a necessary evaluation and quantified with the mentioned process. For instance being sufficient for is equivalent to being necessary for . The negation of concepts and tasks () is calculated by just subtracting them from one, i.e. (), so for negative necessary measurement, we do the same calculation with the negative of concept values.

For any of our experiments, we create four quantification curves based on the scatterplot. It helps quantify the relationship between the concept and task class. Each quantification curve shows the F1 score against different thresholds. We further use the area under curve (AUC) to summarize each curve into a real-valued score between 0 and 1. The area under the curve is a good estimate of how strong the relationship is. For instance, an area under the curve close to one is a very strong relationship while an area of zero means there is no basis for that relationship. In the next chapter, we demonstrate the proposed methods over several experiments.

V Experiments and Results

In this section, we explore the application of the proposed method in the evaluation of the relationship of neural network task classes and concepts in a controlled setting and a real-world setting. The controlled settings explain neural networks with alexnet structure and the real-world setting uses a pretrained Resnet18. We compare our results with TCAV [11] and IBD [22] methods as they are the most related works to the proposed method.

We have constructed two controlled datasets to simulate the scenario where certain concepts have a positive or negative correlation with the task classes.

Vi Coloured MNIST Dataset

MNIST is an image recognition benchmark dataset consisting of ten classes of handwritten digit images [14]. We modify this dataset to add useful or useless additional hints (as colour) to the samples for the neural network to use. Each class may correspond with multiple colours. The colour of each digit in a class is chosen randomly from a set of two colours. We choose the colours of each class (its colour set) in a way that no two classes (colour sets) in the colour space can be linearly separated, see Figure 5. We simulate two scenarios: 1) each digit is coloured using its own set of colours, –i.e. the colour concept and task class are fully correlated. We call this dataset ColorDataset1. 2) all colourings are random (from the same set) –i.e. no relation between the colour concept and task class. We call this dataset ColorDataset2.

Fig. 5: How the two hint colours are chosen for each figure in ColorDataset1. Each figure is associated with two different colours that have the highest distance in colour space.

For generating the concept samples (for training the concept classifier), we shuffle pixels of images (to wipe out the image information) and then add colours. Each colour set (two colours) is considered a concept.

Vii Images with Captions Dataset

In this dataset, we add a hint caption to two classes of the Imagenet dataset

[6, 17], namely class dog and class cat. The hint is added as a white text on the image (by changing the pixels of the image). So part of the sample pixels has some extra information about the class. We consider two scenarios: 1) The caption always reads the same as the image (the word cat for cat images, and dog for dog images). We call this dataset CaptionDataset1. 2) The caption is always a random word and hence does not include any information about the classification task (dogs vs. cats). We call this dataset CaptionDataset2. The captions have random rotation and scaling associated with them. Figure 6 shows two samples of the CaptionDataset1 images.

Fig. 6: Two samples from CaptionDataset1.

For generating the concept samples (caption concept), we shuffled pixels of images (to wipe out the image information) and then add a caption to the resulting shuffled image. This technique makes sure that the concept is only present in these samples and our representation of the concept is the most accurate.

Viii Analysis of Concept Influence

In this experiment, we show the effectiveness of the proposed method in detecting the causal relationship between a concept and task classes of neural networks. We train a neural network on each of our datasets. The results for CaptionDataset1 and CaptionDataset2 are shown in Figure 7. Similar results were obtained on the ColorDataset1 and ColorDataset2. On the left side, it can be seen that the concept (caption dog) was detected to have a 98% necessary relationship with the class dog. All samples predicted by as the dog class (above the red horizontal line) are predicted by to have the concept (i.e. they are on the right of the red vertical line). Plotting the modified F1 score for different values of thresholds gives the necessary quantification curve (blue), which has an area under the curve of close to one. The sufficient quantification curve (orange) can be obtained from the scatter plot of vs. (instead of vs. ) because being sufficient for is equivalent to being necessary for . The negative necessary quantification curve (green), can be obtained from the scatter plot of vs. and the negative sufficient quantification curve (red), can be obtained from the scatter plot of vs. . The results for CaptionDataset2 are shown on the right of Figure 7. It can be seen that there is no tangible relationship between the dog class and the caption concept i.e. the AUC of all four measures are small. This confirms that the proposed method detects the causal relationship between the concept and the task classification.

Fig. 7: Comparison of two networks with the same structure, while one has been trained on CaptionDataset1 (left side), and the other has been trained on CaptionDataset2 (right side). The area of one under the necessary curve shows that the concept is necessary for the task class. The values for each AUC are mentioned at the bottom of the figure. These results are extracted from the last layers of a neural network with an alexnet structure.

Now that we have established that the method can detect the usefulness of the hints, from now on we only use the CaptionDataset1 and ColorDataset1.

Ix Informativeness of the Proposed Relationship Measures

In this experiment, we show how the proposed relationship measures can reveal more than just a correlation between a concept and a task class. We examine several classes of a neural network with ten classes, trained on our ColorDataset1 to check the causal relationships between different target classes and a particular concept. In particular, we inspect for the colour set associated with class handwritten figure 1 (see Figure 5). So, we expect the concept (the corresponding colour set of class 1) to be a necessary condition for class 1 and negative necessary for other classes. Figure 8 shows the results of this experiment for two task classes 0 and 1. It can be seen that the presence of this colour set is a 86% necessary condition for class 1 (see the blue solid curve on the bottom-right sub-figure), while it is a 93% negative necessary condition for class 0 (see the dashed green curve on the bottom-left sub-figure). This confirms that for class 1, it is necessary to have the colour set (if not present, the task is not classified as class 1), but for other classes, like class 0, it is necessary not to have this colour set (if the colours are present, the output will not be class 0).

Fig. 8: Analysis for class 0 (left) and class 1 (right). The causal relationship of a colour concept with two different classes. This concept is necessary for class 1 (right) but the concept is inconsistent with class 0 (left) so the negative of concept is necessary for class 0.

X Comparison with Linear Based Classifier

Many methods use linear classifiers to extract concept information from the middle layers of neural networks. In this experiment, we check the implications of using a linear classifier to detect concepts in network activations. We check the effectiveness of such assumption in the detection of concept in earlier layers of a neural network trained on our CaptionDataset1. As we know by design the concept dog and the caption dog have a positive correlation (all dog images have the dog caption). Figure 9 shows the two scenarios, one with our design ( as the concept classifier) on left, and one with linear design (a linear classifier as the concept classifier) on right and shows the effectiveness of the proposed measures compared with directional derivation measures in capturing the relationship between task class and concept.

While our method, shows that for this layer, the negative sufficient (red curve) and negative necessary (green curve) are both good descriptions of the relationship. Since the area under the green (78%) and red curve (89%) is high. With linear classifiers, the four relationship area measures are almost the same, and the designed relationship cannot be detected. These findings are consistent with the points mentioned in [13].

Fig. 9:

Shortcoming of the linear layers to pick up a concept in earlier sections of the neural network (right) while it can be easily detected with the proposed method (left). If a concept is not present, knowing where in the network the concept is disappearing can be a helpful tool. For instance in transfer learning.

This experiment shows that the linear classifiers have failed to extract concept information specifically in earlier network layers. The same phenomenon might be observed on the final layers of a neural when a concept is merged with other concepts to create a higher-level concept.

Xi Comparison with TCAV Method

In many methods, the directional derivation of a classifier with respect to activations or input is considered a good representation of the model’s local behaviour (first-order approximation) [11, 22]. As it was shown in TCAV [11], the directional derivative can be calculated by the dot product of concept gradient and task gradient. These methods use the dot product of concept classifier gradient and task classifier gradient as an agreement score on different class samples (linear assumption). Here we demonstrate that this score is not accurate. We test the score on a neural network trained on our CaptionDataset1. We test the relationship between the dog class and the caption cat. As we know, by design, these two information are inconsistent in the training data, -i.e. no training data have both. Figure 10 shows that the proposed measures capture the negative relationship between the class dog and the concept caption cat correctly. Our method captures the correct relationship (Negative Necessary) of the concept cat for the class dog, as it is seen in Figure 10 the area under the negative necessary (green) curve is very close to one. The third row of Figure 10 shows the distribution of directional derivatives. Though the concept and class are by design inconsistent, directional derivatives are positive on all samples of distribution sample set, showing it is not a reliable explanation. Distribution sample set (points that the evaluations were done) consists of dog and cat images with both dog and cat captions.

Fig. 10:

Results of the caption experiment for different linear (right) and nonlinear (left) concept classifiers show that directional derivative can be misleading in both cases. The joint distribution of concept and class decisions for the nonlinear and linear models (first row). Changes of relationship measures with respect to different thresholding (second row). Distribution of directional derivative (third row). Moreover, the new measures give us more information about the type of the relationship between concept and task class.

Further investigating the relationship between directional derivative, we evaluated the difference of the task classifier output and the concept classifier output and plotted it against the directional derivative. Figure 11 shows that there is no positive or negative correlation between the directional derivative value and the agreement of the concept and classifier in both linear and non-linear cases.

Fig. 11: Results of caption experiment for different linear (right) and nonlinear (left) concept classifiers show that directional derivative can be misleading in both cases.

Xii Comparision with IBD method

Since the IBD method [22] is limited to convolutional networks, in this experiment we change our method to be comparable to IBD. We examine the last hidden layer of a Resnet18 trained on the Places365 dataset [21] – a dataset where each class is a place. This network was the benchmark of the IBD method. We use the same set of concept classifiers they trained (with the parameters they provided). We use 10,000 samples from the places365 validation set without their labels as the distribution sample set. Our concepts come from the same dataset IBD method used as their benchmark, Broden [4] – a dataset with segmentation annotations. For a better comparison, we use only use the concepts originally used in the IBD benchmark.

This experiment is designed to find the most important concepts for classifying each class of the Places365 dataset. For each concept, a concept classifier is trained, and then each task class (a class of Places365) is examined against each concept. The necessary scores of concepts for each task class are sorted and the highest values are reported as the most necessary concepts for the class. The most necessary concepts are then compared against IBD recommended concepts, by decomposition of the decisions into concept space. The top seven are reported for both methods. Then three different annotators were asked to highlight concepts that are not relevant to the class, their majority vote is considered as irrelevant concepts (highlighted in the Tables I). The concepts are from left to right in decreasing importance.

Examining the results of the experiment (by a majority of three annotators), it is apparent that our method assigns more reasonable values of necessary scores to the concepts (compared to what IBD calculates based on its decomposition process). For instance for their benchmark topiary garden, in IBD’s top 7 concepts, IBD suggested tail and sheep (among five others) which are irrelevant to the class of topiary garden. On the other hand, our method suggests plant and tree which are quite relevant concepts to the topiary garden class. For soccer field class our method proposes grass, pitch, grandstand, court, person, post and goal which are all are relevant. But IBD suggests pitch, field, cage, ice rink, tennis court, grass, and telephone booth. Among these concepts, cage, ice rink, tennis court, and telephone booth are irrelevant to the soccer field class (see Table I).

Ours plant hedge tree brush flower bush sculpture
IBD hedge brush tail palm flower sheep sculpture
Ours crosswalk road sidewalk post container streetlight traffic light
IBD crosswalk minibike pole rim porch central reservation van
Ours armchair sofa back cushion back pillow coffee table ottoman
IBD armchair fireplace inside arm shade sofa frame back pillow
Ours pedestal sales booth shop case bag bulletin board food
IBD sales booth pedestal food fluorescent shop shops apparel
Ours grass pitch grandstand court person post goal
IBD pitch field cage ice rink tennis court grass telephone booth
Ours tree bush trunk cactus brush fire leaves
IBD tree trunk bush leaves semidesert grid clouds
Ours person hand paper plaything fabric bag board
IBD paper drawing plaything painting hand board figurine
Ours drinking glass stool table spindle menu person plate
IBD plate light stool sash napkin display board spindle
Ours shoe bottle shelf box gym shoe boot bag
IBD shoe gym shoe handbag hat catwalk shop window minibike
Ours mountain hill desert badlands rock valley land
IBD hill badlands desert cliff cloud mountain diffusor
Ours mountain rock cliff hill badlands land desert
IBD cliff mountain badlands desert pond bumper hill
Ours sea sand land embankment rock mountain water
IBD sea wave land mountain pass sand cliff cloud
Ours bush river rock land cliff tree earth
IBD river waterfall land pond leaf ice fire
TABLE I: Explanations for several classes using our method and IBD method. The concepts labelled as irrelevant by the majority of three annotators are in bold.

We also realized that the quality of the IBD benchmark concepts is not verified. So we checked in which samples of distribution sample set every concept is maximized (the output of the concept classifier network is maximal) and where it is minimized. Figures 12 shows examples of this experiment. In each section, in the first row, the five test images of the minimal concept classifier output are shown in the increasing relevance order from left to right. In the second row, the five test images that maximize the concept are shown in the increasing relevance order from left to right. This test will do a sanity check on concept learning, (since the IBD method does not check the accuracy of its learned concepts) and shows that what is considered as each of the concepts, and shows if some of the concepts are not reliable.

Fig. 12: Some minimal and maximal samples of concepts.

For instance, we found that the concepts sheep, cow, hay and elephant are not accurate and maximized whenever a dirt field is present, (like a horse race track) as shown in Figure 13. These figures follow the same structure as Figure 12. Their second row shows that even some images maximize more than one of these concepts, meaning the network cannot even distinguish between these concepts. Or in other words, in this layer, these individual animal concepts are merged to form a higher-level “animal” concept while the original ones are forgotten (apparently the type of animal is not important for Places365 classes). Usage of concepts without verification decreases the quality of explanations of the IBD method.

Fig. 13: Some minimal and maximal samples of animal related concepts.

The same phenomenon is observed for the set of concepts of towel, bathtub, toilet tissue, soap dispenser and bidet (Figure 14) and the set of concepts of refrigerator, microwave, knife, kitchen island, and kettle (Figure 15).

Fig. 14: Some minimal and maximal samples of bathroom related concepts.
Fig. 15: Some minimal and maximal samples of kitchen related concepts.

This will result in the degradation of IBD’s accuracy of concepts which propagates into their decomposition process. These results emphasize the importance of checking accuracy of a concept prediction, which is indeed one of the very first steps in the proposed method. Our method allows verifying concept classification accuracy and prevents such pitfalls.

For further investigation, we also calculate the deepdream maximal output for each class

[15]. Deepdream is a method to calculate an input that maximally activates a certain neural network, by optimizing the input on several scales (or octaves). While deepdream is sometimes a good visualization method to figure out what maximally activates a certain classifier, the method fails to capture higher structures. For instance in Figure 16, while concepts like sky, tree or fabric make sense, the concepts with a higher structure like wall, floor, windowpane, building, person, or head are nothing like what a human may expect.

Fig. 16: Deepdream results of the concept maximal inputs for the top 12 most common concepts, while some of them make sense, most of them are unexpected.
Fig. 17: Deepdream results of the concept maximal inputs for the second top 12 most common concepts, while some of them make sense, most of them are unexpected.

Xiii Discussions

The distribution sample set, the set that represents the distribution of likely inputs of the network plays an important role in our analysis as all measure evaluations are based on the samples of this set. Checking the network with the same set that it was trained on, will most likely confirm the network decisions. Most methods that predict the behaviour of the network need such sample sets, for instance, TCAV [11] need samples from the task class.

The distribution sample set represents likely cases of input and should be a good representation of the inputs that the network will be tested on. For instance, using the job title classification example, if we expect most but not all doctors to wear lab coats, it should be reflected in the distribution sample set. On the other hand, if we don’t expect the network to be tested on images of doctors with stethoscopes in a garage fixing cars, distribution sample set should not contain such images. The fact that distribution sample set does not need any kind of labelling, enables us to use any set of inputs like a held-out part of data or even inputs recorded from other sources, as long as they are a good representation of likely task classification inputs.

The choice of which layer to inspect is not a straightforward decision. Of course, the inspection of later layers is computationally cheaper (since the training of the concept classifier is cheaper). But there is no guarantee that the concepts are still present in those layers since the network might have traded them with a combination of concepts more useful for the task classification. For instance presence of a stethoscope might not be detectable in the last layer, but the presence of a medical instrument might be possible (distinguishing images that include a medical instrument from the ones that don’t). For this reason, we start our analysis from the last layer in the network and work our way back till we reach a layer that the concept is present (classifiable with good accuracy) or reach the first layer (which will guarantee that the concept is too hard to be detected by the network).

Xiv Conclusion and Future Work

We proposed a framework for verifying the presence of high-level concepts in the activations of the intermediate layers of neural networks. We also determine the type or nature of the causal relationship between a concept and the neural network task classes by quantification of the causal relationship between the task classes and the concept. We showed the effectiveness of the proposed measurements through several comparative experiments, demonstrating improved performance compared with previous methods.

Based on our method tracking a certain concept is possible, and the analysis can show whether the concept was too difficult for the whole network to begin with or it got lost during intermediate layers. Moreover, the analysis shows how much influence the concept has on each layer and potentially where it was forgotten. The main reason this analysis is possible is the structure of the concept classifier.

Most concepts influencing a certain decision are necessary conditions, but for finding equivalence conditions we need to find a concept that is both necessary and sufficient. A combination of necessary conditions to generate equivalence conditions can be the subject of future work.


  • [1] J. Adebayo, J. Gilmer, I. Goodfellow, and B. Kim (2018) Local explanation methods for deep neural networks lack sensitivity to parameter values. arXiv preprint arXiv:1810.03307. Cited by: §I-A.
  • [2] G. Alain and Y. Bengio (2016) Understanding intermediate layers using linear classifier probes. External Links: 1610.01644, Link Cited by: §II.
  • [3] M. T. Bahadori and D. E. Heckerman (2020) Debiasing concept bottleneck models with instrumental variables. arXiv preprint arXiv:2007.11500. Cited by: §II.
  • [4] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba (2017) Network dissection: Quantifying interpretability of deep visual representations. In

    Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017

    Vol. 2017-Janua, pp. 3319–3327. External Links: Document, arXiv:1704.05796v1, ISBN 9781538604571 Cited by: §I-A, §I-B, §I-B, §XII, §II, §II.
  • [5] Z. Chen, Y. Bei, and C. Rudin (2020) Concept whitening for interpretable image recognition. Nature Machine Intelligence 2 (12), pp. 772–782. Cited by: §II.
  • [6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §I, §VII.
  • [7] R. Fong and A. Vedaldi (2018) Net2vec: quantifying and explaining how concepts are encoded by filters in deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8730–8738. Cited by: §II.
  • [8] A. Ghorbani, J. Wexler, J. Y. Zou, and B. Kim (2019) Towards automatic concept-based explanations. In Advances in Neural Information Processing Systems, pp. 9277–9286. Cited by: §II.
  • [9] B. Goodman and S. Flaxman (2017) European union regulations on algorithmic decision-making and a “right to explanation”. AI Magazine 38 (3), pp. 50–57. Cited by: §I.
  • [10] Y. Goyal, A. Feder, U. Shalit, and B. Kim (2019) Explaining classifiers with causal concept effect (cace). arXiv preprint arXiv:1907.07165. Cited by: §I-A, §II.
  • [11] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. Sayres (2018) Interpretability beyond feature attribution: Quantitative Testing with Concept Activation Vectors (TCAV). 35th International Conference on Machine Learning, ICML 2018 6, pp. 4186–4195. Note: From Duplicate 1 (Interpretability beyond feature attribution: Quantitative Testing with Concept Activation Vectors (TCAV) - Kim, Been; Wattenberg, Martin; Gilmer, Justin; Cai, Carrie; Wexler, James; Viegas, Fernanda; Sayres, Rory) Ok, so they have done linear analysis of the middle activations but why stop at linear. the concepts may not be linearly seperable. but they might argue that the next layer combines things linearly External Links: arXiv:1711.11279v5, ISBN 9781510867963 Cited by: item 4, §I-A, §I-B, §I, §XI, §XIII, §II, §II, §V.
  • [12] P. Kindermans, S. Hooker, J. Adebayo, M. Alber, K. T. Schütt, S. Dähne, D. Erhan, and B. Kim (2017) The (un) reliability of saliency methods. arXiv preprint arXiv:1711.00867. Cited by: §I-A.
  • [13] P. W. Koh, T. Nguyen, Y. S. Tang, S. Mussmann, E. Pierson, B. Kim, and P. Liang (2020) Concept bottleneck models. In International Conference on Machine Learning, pp. 5338–5348. Cited by: §I-A, §X, §II, §II.
  • [14] Y. LeCun (1998) The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §VI.
  • [15] A. Mordvintsev, C. Olah, and M. Tyka (2015) Inceptionism: going deeper into neural networks. Cited by: §XII.
  • [16] C. Olah, A. Mordvintsev, and L. Schubert (2017) Feature visualization. Distill 2 (11), pp. e7. Cited by: §II, §II.
  • [17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §I, §VII.
  • [18] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings, pp. 1–10. External Links: 1312.6199 Cited by: §II.
  • [19] J. Wexler, M. Pushkarna, T. Bolukbasi, M. Wattenberg, F. Viégas, and J. Wilson (2019) The what-if tool: interactive probing of machine learning models. IEEE transactions on visualization and computer graphics 26 (1), pp. 56–65. Cited by: §II.
  • [20] C. Yeh, B. Kim, S. Arik, C. Li, T. Pfister, and P. Ravikumar (2020) On completeness-aware concept-based explanations in deep neural networks. Advances in Neural Information Processing Systems 33. Cited by: §II.
  • [21] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017)

    Places: a 10 million image database for scene recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: item 4, §XII.
  • [22] B. Zhou, Y. Sun, D. Bau, and A. Torralba (2018) Interpretable basis decomposition for visual explanation. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11212 LNCS, pp. 122–138. External Links: ISSN 16113349, Document, ISBN 9783030012366 Cited by: item 4, §I-A, §I-B, §I-B, §XI, §XII, §II, §V.