Recently, the convolution neural networks are applied to many different tasks, such as image classification , object detection , and have achieved great success. The network decision-making process is still a puzzle to us. Therefore the interpretation models obtain more and more attention. Among those, visual interpretation is a popular research direction owing to its similarity to the human understanding way. The previous visual interpretations for internal features are focused on giving the pattern of the network on an individual layer .
Notwithstanding, it does not provide an interpretable logical process for network decision-making.
The logic is interpretable in the decision-making process of human. There is a question of whether the decision-making process of networks could be understood as a human-logic way. To answer it, we should solve the two following issues:
For visual tasks, we build our logic based on visual concepts, i.e., color, material, part, object, scene. The hierarchical structure of visual concepts starts from low to high semantic-levels. In comparison, the stratified structure of networks builds from shallow to deep layers. Could the hierarchical structure for visual concepts be used to explain the stratified structure of networks in the visual task?
In human-logic, a decision could be hierarchically deduced into sub-decisions. Could the decision of networks be explained as the form of a hierarchical inference?
In this paper, we propose the framework of Concept HierArchical INference interpretation (CHAIN) which is inspired by the human understanding way. Human logic is always designed from interpretable elements such as visual concepts for images. Accordingly, for the first issue, we propose the concept harmonizing model which can interpret networks utilizing an interpretable visual concept. In this model, network features from shallow to deep layers are harmonized with visual concepts from low to high semantic levels.
For us, high-level visual concepts such as scene can be deduced into low-level visual concepts such as object and part. Therefore, for the second issue, we propose hierarchical inference model to decompose the concept in a deep layer into units in a shallow layer. Subsequently, the concept-harmonized hierarchical inference model is introduced to infer a deep-layer concept into its shallow-layer concepts.
Consequently, CHAIN explains a network decision by representing its concept-harmonized hierarchical inference in a human-understandable way. The main contributions of this work are summarized as follows:
In the CHAIN interpretation, we explain net features in a stratified structure with visual concepts in a hierarchical structure. Specifically, we first build the concept harmonizing model in which visual concepts are aligned with net-units in a depth-stratified way (from deep to shallow layers).
In the CHAIN interpretation, the feature learning of the network from shallow to deep layers is interpreted as a hierarchical logical process of the decision-making from low to high levels. To achieve that, we successively introduce the hierarchical inference model and the concept-harmonized hierarchical inference model. Through the hierarchical inference model, the net learning for a deep-layer concept is inferred from shallow-layer features. Based on the previous models, the concept-harmonized hierarchical inference model can hierarchically deduce a deep-layer concept into shallow-layer concepts.
For the instance level, CHAIN provides the concept-harmonized hierarchical inference of a net decision, which is an understandable logic for the network decision-making process from deep to shallow layers. Moreover, for the class level, the CHAIN interpretation can explain net decisions for a class.
In experiments, we analyze qualitatively and quantitatively the CHAIN interpretation for the intra- and inter-class.
Ii Related Work
Recently, the study of network interpretation has been increasingly drawn attention and gained popularity for it. In this section, we review three main branches of network interpretation, as described below:
Input-based network interpretation. The study in this direction explains network by learning critical input regions to a particular net output. It utilizes the perturbation mechanism in which critical input regions are obtained by perturbing the input and observing output changes [4, 5, 6, 7]. Therefore, it builds a mapping from the output space to the input space, which can give us interpretable visualization to understand net decisions. Nonetheless, its visual interpretation is only based on input space which also means it can not explain the internal network mechanism. Meantime, the shape of perturbation patches, which is a super-pixel  or a regular grid , restricts the shape of visual interpretation based on input images.
Feature-based network interpretation.
Another popular interpretation technique is visualizing the internal features of networks. By using the gradient information of net features, it can give the visual interpretation of internal networks. Guided Backpropagation and Deconvolution  visualize the image pattern which can obtain the largest activation of a particular net-unit. However, it is only a general interpretation method which means it can not explain a specific net decision for a given input. In comparison, CAM  and Grad-CAM 
provide the class-discriminative interpretation of a specific net decision by visualizing the linear combination of features weighted by their gradients to the target output. Nevertheless, those approaches heavily depend on the calculation of net gradient, which is not as efficient for shallow layers as that for the last layer. To overcome it, CHIP model based on the channel-wise perturbation can distill class-discriminative channels from shallow to deep layers to interpret internal net-features. Depending on the channel-wise perturbation, its performance is not limited by the perturbation-patch shape and the net-layer location.
Concept-based network interpretation. To interpret net-features as human-understandable concepts, researchers proposed the concept-based interpretation . The internal representation of networks can be interpreted with visual concepts by evaluating the overlap between a concept region and the saliency region of a net-feature . However, it is a one-to-one alignment and not a learnable approach. It means does not consider the one-to-many situation in which networks learn the representation of a visual concept from a combination of net-units. Furthermore, Interpretable basis decomposition  was proposed to represent the class-discriminative importance weight for the target class as a linear combination of concept-discriminative importance weights for different concepts. Consequently, it can deduce the class-discriminative feature for the target class into concept-discriminative features for different concepts. It should be noticed that its decomposition target and bases need to utilize net features from the same layer in networks. However, the study shows that different semantic-level concepts should be matched with net-features in different layers. Therefore, in its interpretation for the final convolutional layer, low semantic-level concept bases might be as active as high semantic-level concept bases. Meanwhile, this decomposition is limited to explain one internal layer in which the prediction is decomposed into the features in the last convolutional layer. Therefore, it cannot explain the layer structure for the net decision-making process.
Here, we propose the CHAIN interpretation to explain the net decision-making from deep to shallow layers by the concept-harmonized hierarchical inference interpretation. Accordingly, layer-stratified net-features are explained by semantic-stratified visual concepts, which means net-features in different layers are aligned with visual concepts in different semantic-level. The proposed interpretation is built by hierarchically inference of the concept from the deep to shallow layers. Therefore, it can interpret hierarchical network learning as an interpretable decision-making process for visual concepts from high to low semantic levels.
Iii-a The Framework of Concept-harmonized Hierarchical Inference Interpretation
For human, we have a knowledge system of visual-concepts from low to high semantic levels. According to it, the human decision-making process can be deduced into a series of sub-decision making processes from high to low levels. In this paper, we propose CHAIN to builds a human-understandable interpretation. CHAIN can explain the net decision-making process by the concept-harmonized hierarchical inference interpretation in which the operation of a network presents an analogy to the working of the human brain.
The CHAIN interpretation can be divided into three steps. In the first stage, we build a link between the semantic-stratified structure of visual concepts and the layer-stratified structure of networks. Specifically, the concept harmonizing model is proposed to harmonize net-units with visual concepts in a similar semantic level. Secondly, the hierarchical inference model interprets the network structure by disassembling the net-unit from deep to shallow layers. Finally, in the concept-harmonized hierarchical inference model, the network can be explained as an interpretable decision-making process. During this process, the high semantic-level concept in a deep layer is deduced into low semantic-level concepts in its shallow layer. Based on that, the hierarchical inference of net-units from deep to shallow layers is understood by utilizing visual concepts from high to low semantic levels. Consequently, CHAIN can present the optimal concept-harmonized hierarchical inference starting from deep to shallow layers.
In the CHAIN interpretation, the layer structure of networks is interpreted by the hierarchical structure of visual concepts. Meanwhile, for a net prediction being interpreted, CHAIN can provide its interpretable hierarchical inference of the net decision-making.
Iii-B Concept Harmonizing Model
In the concept harmonizing stage, we design a model for learning the correlation between visual-concepts and net-units. Visual-concepts are aligned with net-units with a similar semantic level.
Visual concepts. Here, we adopt the Broden dataset, which contains different semantic-level visual concepts. Specifically, the visual concepts in concept harmonizing model have five semantic levels, i.e., color, material, part, object, and scene concepts from low to high level. Therefore, network features in five layers are selected to be harmonized with concepts in a similar semantic level. The samples of visual concepts are the pixel-level labeled concepts, excluding scene concept which cover full images.
The dataset of the concept harmonizing model. For each concept, a concept harmonizing model is designed to learn its correlation with units in its corresponding semantic-level layer. In the training of the harmonizing model for a particular concept, image pixels containing the target concept are positive samples. Otherwise, image pixels without the target concept are negative samples.
The training samples for the target concept are fed into the network to obtain features in the corresponding layer. Subsequently, we utilize the net sample-feature in the -th layer denoted as to learn the concept harmonizing model, where is the -th net unit in the -th layer for the target sample and is the number of units.
Finally, for the harmonizing model of the -th concept in the -th layer, we get the dataset where for the -th input sample , is its -th net-layer feature and is its concept label.
The harmonizing weights of units to the visual concept. In the harmonizing model, units in the -th layer are harmonized with the -th concept in the corresponding semantic level by the harmonizing weight . The harmonizing weight of the -th unit is denoted as .
Concept harmonizing model. The harmonizing model of the -th concept in the -th layer is learned by optimizing the following problem:
where is the regularization parameter. And the is the label representing the absence (0) or presence (1) of the target concept. After the optimization of Eq. (1), we can obtain the optimal correlation between the visual-concept and net-units in the corresponding layer.
Afterward, the optimization of the harmonizing model for the next critical concept in the target layer should be calculated, and so on and so forth. Once this is done, we obtain the optimal concept harmonizing of net units in the same layer. Subsequently, the optimization is conducted backward for the concept harmonizing of net units in the lower layer. Finally, the concept harmonizing model can get the optimal concept harmonizing from the highest semantic level to the lowest semantic level.
Iii-C Hierarchical Inference Model
In the previous stage, the concept harmonizing was designed to link units with concepts from high to low semantic levels. Based on it, in the hierarchical inference model, the concept in the deep layer can be inferred into the shallow layer. Therefore, a network decision is interpreted by representing its hierarchical inference.
The inference weights of shallow-layer units to the concept in the deep layer. To explain the internal net structure, our CHAIN interpretation needs to optimize the importance of the shallow-layer units for a particular concept in its deep-layer.
-th concept in the deep layer, the inference weight vector is, where represents the importance of the -th unit in shallow layer to the target concept in the deep layer.
Net perturbation for hierarchical inference. In CHAIN interpretation, we adopt the net perturbation-based approach in CHIP model. The inference representation is learned by analyzing the variation of the concept in the deep layer after switching off its partial shallow-layer units. The underlying principle is that the concept in the deep layer would drop dramatically if the forward propagations of important shallow-layer units are blocked.
The pre-trained network is perturbed by shallow-layer gates to learn the inference weights. The shallow layer is associated with a gate layer in which each unit gate controls the state of the corresponding unit in the shallow layer. Here, a binary vector is denoted the unit gate. The -th unit in the shallow layer is turned off if is zero.
The perturbed network is generated by adding the unit gate layer after the shallow layer. We denote the original unit in the shallow layer as , where and are the width and height of the channel and is the number of units. For the -th unit in the shallow layer, the output of the shallow-layer gate layer is
The global average pooling of the shallow-layer gate is . For the -th shallow-layer unit, the global average pooling (GAP) is
The concept-harmonized unit. Based on the optimal harmonizing weight, the concept-harmonized unit for the -th concept in the deep layer is defined as
where means the -th feature in the deep layer.
The net function mapping from shallow-layer features to deep-layer features is denoted as . Specifically, the original feature of the -th deep-layer unit is expressed as
In the perturbed network, we add control gate layers behind the shallow layer without changing the original weights from the shallow to the deep layer in the pretrained network.
After the net perturbation, the concept-harmonized unit in the deep layer mapping from perturbed shallow-layer features is
In perturbed net, its global average pooling is
The dataset of the hierarchical inference model. In order to learn the inference weights of shallow-layer units, we need to generate the perturbed dataset. Specifically, to learn , the perturbed dataset is obtained by the following three steps: The first step is to generate the perturbed networks. For the shallow layer, we sample the channel gate values by using . And for other layers, we freeze the channel gate to be open.
Secondly, we feed each image into each perturbed network and get the features of shallow and deep layers. In -th perturbed network based on , the global average pooling of the shallow layer is denoted as .
Likewise, in the -th perturbed network, the global average pooling of the -th concept in the deep layer is denoted as .
Finally, for the -th concept in the deep layer of the image being interpreted, we get the perturbed dataset .
Hierarchical inference model.
The inference representation is optimized by solving a local linear regression problem on a net-perturbation dataset. Given the perturbed datasetof the -th concept in the deep layer, we formulate the hierarchical inference model as
where is the regularization parameter.
In CHAIN model, the first term in our interpretation model is the loss function. In the loss function,is denoted as the proximity measure between a binary channel gate vector and the all-one vector . Specifically, it is defined as
The second term is the sparse regularization term owing to the inherent sparse property of network structure. Meanwhile, to make the interpretation model be simple enough to be interpretable, the sparsity of inference weights measures the complexity of the interpretation model.
Here, to solve the optimization problem in Eq. (8), we design a hierarchical inference algorithm by adopting the alternating iteration rule to learn .
Hierarchical inference algorithm. The optimization problem can be converted into the equivalent formulation
The augmented Lagrangian for the above problem is
The equation can be rewritten as
Through a careful choice of the new variable, the initial problem is converted into a simple problem. Given that the optimization is considered over the variable , the optimization function can be reduced to
The solution is
In order to calculate , the optimization problem to be solved is
The solution is
Lagrange multipliers update to
By the optimization of Eq. (8), we obtain the optimal inference weight of shallow layer units to the -th concept in the deep layer.
Iii-D Concept-harmonized Hierarchical Inference Model
In previous stages, we complete the concept harmonizing and the hierarchical inference separately. By combining them, a network decision can be interpreted by representing its concept-harmonized hierarchical inference.
Concept-harmonized hierarchical inference model. Specifically, we deduce the concept in the high semantic level into concepts in the low semantic level. Based on it, the contributions of low-level concepts in shallow-layer to a high-level concept in the deep-layer are computed.
The contribution weight vector of concepts in the shallow layer to the k-th concept in the deep layer is defined as . Specifically, is the contribution of the -th concept in the shallow layer to the k-th concept in the deep layer. The harmonizing set of concepts in the shallow-layer is denoted as . denotes the harmonizing weight of the -th concept in the shallow layer. And the inference weight is the importance of shallow-layer units to the k-th concept in the deep layer.
The concept-harmonized hierarchical inference model is formulated as
where refers to the number of nonzero elements in the vector and is also viewed as the measure of sparsity. Moreover, the concept-harmonized hierarchical inference sparsity is bounded by .
Based on the contribution weight, the high level concept is inferred into low level concepts. Meanwhile, we can give a quantitative analysis of the concept-harmonized hierarchical inference interpretation for a net decision.
Then, the concept-harmonized hierarchical inference models of the other critical concepts in the deep layer are continuously optimized. Subsequently, the optimization is conducted backward for the concept-harmonized hierarchical inference model from the deep to the shallow layer. Finally, the concept-harmonized hierarchical inference model can get the optimal hierarchical inference representation of concepts from the highest to the lowest semantic level.
The concept directional-derivative. For the inference of part concepts, we only care about the most activated material concept which has the most significant contribution to the target part concept. It is also applied to the inference from the material concept to the color concept. Therefore, we design a simple way to select the most critical shallow-layer concept to the target deep-layer concept.
Specifically, utilizing the concept directional-derivative, we study the contribution of shallow-layer concepts to the deep-layer concept. The concept directional-derivative is defined as
which is the directional derivative of the deep-layer concept function along the concept direction at the in the shallow-layer feature space. In the concept harmonizing model, is also the direction of the -th concept at the shallow-layer feature space. In mathematics, the concept directional-derivative represents the instantaneous rate of change of the function , moving through with a velocity specified by .
In the hierarchical inference model, it is assumed that . Therefore, the concept directional-derivative can be rewritten as
which is also defined as the contribution weight of the -th concept in the shallow layer to the -th concept in the deep layer.
Therefore, we can obtain the most critical shallow-layer concept to the target deep-layer concept by optimizing the following problem
It should be noticed that the optimal solution for the shallow-layer concept in Eq. (22) is same with that in Eq. (19) when the sparsity of in Eq. (19) is set to 1. It means that the optimization based on the concept directional-derivative can be considered as a special case of the concept-harmonized hierarchical inference model in Eq. (19).
Iii-D1 The instance-level CHAIN interpretation
For the interpretation of a specific net decision for a given input, the concept harmonizing model is optimized from the highest to the lowest semantic level. The net unit is harmonized with a concept in a similar semantic level. Next, in the hierarchical inference model, we deduce the target net-output from the deepest layer to the shallowest layer. Lastly, based on the above, the inference of net-units from the deep to the shallow layer can be represented as the inference of visual concepts from the high to the low semantic level. Finally, we can obtain the concept-harmonized hierarchical inference for the interpretation of a specific net decision.
Iii-D2 The class-level CHAIN interpretation
For the interpretation of net decisions for images from a specific class, we build the class-level CHAIN by selecting the shared concepts among instance-level CHAIN for different instances in the same class. The class-level concept contribution weight is the average of its instance-level weights in a class dataset. Therefore, it interprets the network mechanism from the class-level view.
In this section, the experiments show the qualitative and quantitative analyses for the performance of the proposed interpretation model. In section IV-B, we provide the instance-level CHAIN interpretation for the net being interpreted. In section IV-C, the CHAIN interpretation also applies to explain net predictions in class level. In section IV-C1 and IV-C2, networks can be further understood by its intra-class and inter-class CHAIN interpretation on the class level.
Iv-a Experimental setting
Iv-A1 ResNet on the Places365 scene classification dataset
In the experiment, CHAIN interpretation is applied to explain the ResNet-18 
which is pretrained on the ImageNet dataset
and finetuned on the Places365 scene classification dataset
. ResNet is a convolutional neural network and can learn rich feature representations. It can classify images into 365 scene categories. In the concept harmonizing model, we use five layers (i.e., the output, conv5, conv4, conv3, and conv2) to be harmonized with five semantic level concepts (scene, object, part, material, and color).
Iv-A2 The concept harmonizing dataset
In the concept harmonizing model, the Broden Dataset is utilized as the concept dataset, which is a fully annotated image dataset . The Broden dataset contains a hierarchical level of labeled visual-concept samples. We use five semantic level concepts, i.e., the scene, object, part, material, and color concepts from the Broden dataset. The annotations are mostly in pixel-level except for the scene annotation for image level. The five semantic level concepts are from the ADE20K , Pascal-Part , and OpenSurfaces datasets .
Iv-A3 Inference distance
In the experiment, we define the inference distance to quantitatively analyze the CHAIN interpretation.
Inference distance of the concept for the image set is defined as the average Euclidean distance of inference weights for the corresponding image set, which is calculated as
where the center of inference distance is , which is calculated by
Inference distance between the concept for the image set and the concept for the image set is obtained by
Iv-B The instance-level CHAIN interpretation
Iv-B1 The instance-level CHAIN interpretation for different classes
In this experiment, we randomly select four images from different classes to explain their net predictions. For these images, the net accurately predicts their scene classes.
Fig. 3 shows the instance-level CHAIN interpretation for images from four classes (i.e. Farm, Orchard, House, and Pasture). For the interpretation of a specific net decision for a given input, CHAIN provides the concept-harmonized hierarchical inference for the network decision-making process from the highest to the lowest semantic level. Meanwhile, the CHAIN interpretation provides visualization for concepts in each semantic level.
In the bottom right of Fig. 3, for the pasture scene image, CHAIN infers that the pasture scene prediction is based on the house, grass, fence concepts which are learned from shallow layer features for object level. Moreover, the horse concept in the object level is inferred from the torso concept at the part level which is deduced from the skin material concept. Finally, the horse concept can be hierarchically deduced from the orange color concept. The CHAIN interpretation is a logical decision-making process to explain the net decision.
Meanwhile, the visualization of concepts in the CHAIN interpretation can localize the corresponding visual parts, which can further interpret the net feature learning for visual concepts. In the CHAIN interpretation, the net prediction is interpreted from the scene to the color semantic level. Similarly, the scale of concept visualization is decreased from the high to the low semantic level. The reason is that the receptive field in the net feature learning process is decreased from the deep to the shallow layer.
Iv-B2 The instance-level CHAIN interpretation within a class
In this section, house images with three types of surroundings are selected as the target to analyze the CHAIN interpretation within a class. Specifically, for each type, we randomly choose two instances to show their results.
Meanwhile, in the CHAIN interpretation for each instance, the sunburst chart presents object concepts (the inner circle) and their corresponding part concepts (the outer circle). The proportion of each visual concept in the inner circle indicates its contribution to the network scene prediction. Similarly, the contribution of a part concept for its object concept is indicated by its proportion in the outer circle.
Fig. 4 shows instance-level CHAIN interpretation for House-class images in which a house is with a swimming pool. In Fig. 4, the CHAIN interpretation of these two images both includes house and swimming pool concepts in the object level, which is consistent with the visual perception for these images. From visual understanding in the object level, the left image also contains a hedge region which does not exist in the right image. In contrast, the house roof can be observed clearly in the right image compared with that in the left image. These differences can be reflected in the corresponding CHAIN interpretation. It means the CHAIN interpretation can explain the difference in net feature learning for different images in the set of the house with a swimming pool.
In Fig. 4, the part level concept of the swimming pool is explained as water concept for both images. In the left image, the part level concept for the house object concept includes the bush concept which is not deduced in the right image.
Fig. 5 presents instance-level CHAIN interpretation for House-class images in which a house is enclosed the hedge. For the image set of the house with the hedge, CHAIN can distill the difference and similarity in net feature learning as the concept interpretation for different images. For example, the house and hedge object concepts are shared in CHAIN interpretation for both images in Fig. 5. In contrast, roof and chimney object concepts only exist in the interpretation of the left image in which these two objects are apparent.
Fig. 6 shows instance-level CHAIN interpretation for House-class images in which a house is by the roadside. From Fig. 4, 5 and 6, it is noted that for the object concept level, the house object concept is the common interpretation for house images with different surroundings. In comparison, the swimming pool, hedge, and curb object concepts are unique interpretations for corresponding house surroundings. Therefore, at the instance-level, the CHAIN interpretation can interpret the difference and similarity of net feature learning within a class.
Iv-C The class-level CHAIN interpretation
In this section, we analyze the CHAIN interpretation on the class level. Specifically, in section IV-C1, the house with three types of surroundings is selected for the intra-class analysis. In section IV-C2, scene orchard and house image sets are applied for the study of inter-class CHAIN interpretation.
Iv-C1 The intra-class CHAIN interpretation
In this experiment, for the house scene class, the intra-class interpretation is analyzed by house images with three typical surroundings (i.e., curb, hedge, and swimming pool). For each type of surrounding, we randomly choose twenty house images as the corresponding image set.
Fig. 7 displays class-level CHAIN interpretation for the House class. (Top) House images with three types of surroundings, i.e. curb, hedge, and swimming pool. (Middle) Sunburst charts of images with different surroundings. The innermost ring of a sunburst chart shows concepts that are crucial to the House-class prediction. The expansion of a concept section to its outer ring shows the lower-level concepts that are important to the concept itself. Meanwhile, in each sunburst chart, the concepts in each level are sorted in descending order of the contribution to their corresponding high level concept. (Bottom) CHAIN interpretation diagram for the House-class images. The fraction enclosed by the purple dashed line denotes the house-related concepts shared by the three types of images. Fractions in yellow, green, and red dashed rectangles are the concepts for different surroundings.
In the three sunburst charts of Fig. 7, on the object level, the house concept has the most significant contribution to the house scene prediction. Meanwhile, hedge and swimming pool object concepts own the second-largest contribution to the net prediction on house image sets for hedge and swimming pool, respectively. For the house image set with the curb, the object concept curb also contributes a lot to the house prediction. Therefore, CHAIN can explain the intra-class net predictions by presenting their common and unique concepts within a class.
In the bottom of Fig. 7, we present the CHAIN interpretation for the house class in which the net output is inferred from the scene to the color semantic level. The class level CHAIN interpretation (in the bottom of Fig. 7) is consistent with the observation of the three sunburst charts. In the class level CHAIN interpretation, the hedge object concept is deduced to plant part, and then to the foliage material, and finally to the green color concept. Meanwhile, the hedge object concept is shared in the interpretation of the house with the hedge and swimming pool. From the image level, it is observed that many images in the swimming pool set involve hedge region, as shown in the second image in the first row and the fourth image in the third row. Therefore, the observation of CHAIN interpretation is understandable. Similarly, the chimney as the common object concept in the CHAIN interpretation can be found in both image sets for curb and hedge.
Fig. 8 depicts the intra-class CHAIN interpretation for the House class at the scene level. The left plot shows the inference weights of the house concept (scene level) for three image sets (houses with curb, hedge, and swimming pool) in the 3D-PCA space. The right table shows their inference distances.
From the table in Fig. 8, the numbers in the bold font (0.1344 and 0.1115) are larger than the others. From the left plot, the red points can be easily separated from the other points. In comparison, the green and blue points have some overlap. Therefore, in the CHAIN interpretation, at the scene level, the inference of the house concept for the swimming pool set is different from those for the curb and hedge sets. From the visual perception of the scene, the house surroundings for the swimming pool set has a huge difference from those for the curb and hedge sets.
Fig. 9 presents the intra-class CHAIN interpretation for the House class on object level. The left chart plots the inference weights of three object concepts (curb, hedge, and swimming pool) for their corresponding image sets (houses with curb, hedge, and swimming pool) in the 3D-PCA space. The right table shows their inference distances.
In the left plot of Fig. 9, the three color points can be clustered into three groups separately. In the right table, diagonal entries are smaller than the others. Hence, at the object level of the CHAIN interpretation, the inference for three object concepts (curb, hedge, and swimming pool) can be easily distinguished between each other.
Fig. 10 shows the intra-class CHAIN interpretation for the House class on object level. The left chart plots the inference weights of the house concept (object level) for three image sets (houses with curb, hedge, and swimming pool) in the 3D-PCA space. The right table shows their inference distances.
The left plot in Fig. 10
shows that the three color points mix with each other, which is also testified by the right table. It means at the object level, the inference of the house concept for three image sets are similar. For the three image sets, the house object is similar even though the difference of their surroundings. In summary, the CHAIN interpretation can learn the similarity and variance within a class, which is consistent with our visual understanding.
Iv-C2 The inter-class CHAIN interpretation
In this experiment, the scene orchard and house are utilized for the study of the intra-class CHAIN interpretation. For the orchard class, we randomly select twenty images for testing. For the house class, we continue using the previous three house image sets.
Fig. 11 displays the inter-class CHAIN interpretation for the Orchard and the House classes. The top row shows samples from four image sets (orchard, houses with curb, hedge, and swimming pool) for two classes. The right chart (in the middle) plots the inference weights of the scene concepts for four image sets in the 3D-PCA space. The left table (in the middle) shows their inference distances. Similarly, the bottom row shows the analysis for the object-level concepts.
In the left two tables of Fig. 11, the entries in bold font are more significant than the others. In the right two plots, the data in red color can be easily separated from the other data. At the scene level, the CHAIN interpretation of the scene concept orchard varies a lot from that of the scene house. Likewise, there exists a large discrepancy between the interpretation of the house object concept and that of the tree object concept. Therefore, the inter-class difference between orchard and house is larger than the intra-class difference, which is also aligned with the visual perception from images. The CHAIN interpretation can be used for the inter-class investigation.
In this paper, the CHAIN interpretation is proposed to give an explanation for the net decision-making process. Specifically, the CHAIN interpretation hierarchically reasons a net decision to be visual concepts from the high level to the low level. The hierarchical visual concepts also help explain the layer structure of the network. Except for the instance-level interpretation, the CHAIN interpretation can also provide inference at the class level. Experiment results demonstrate that the proposed CHAIN model can provide reasonable interpretations at both levels.
-  Q. Xie, E. Hovy, M.-T. Luong, and Q. V. Le, “Self-training with noisy student improves imagenet classification,” arXiv preprint arXiv:1911.04252, 2019.
-  J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, “Libra r-cnn: Towards balanced learning for object detection,” in
-  C. Olah, A. Mordvintsev, and L. Schubert, “Feature visualization,” Distill, 2017.
-  M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European Conference on Computer Vision. Springer, 2014, pp. 818–833.
-  M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should i trust you?: Explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016, pp. 1135–1144.
-  R. C. Fong and A. Vedaldi, “Interpretable explanations of black boxes by meaningful perturbation,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017, pp. 3449–3457.
-  X. Cui, D. Wang, and Z. J. Wang, “Multi-scale interpretation model for convolutional neural networks: Building trust based on hierarchical interpretation,” IEEE Transactions on Multimedia, vol. 21, no. 9, pp. 2263–2276, Sep. 2019.
-  J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, “Striving for simplicity: The all convolutional net,” in Workshop in International Conference on Learning Representations, 2015.
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 2921–2929.
-  R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017, pp. 618–626.
-  X. Cui, D. Wang, and Z. J. Wang, “Chip: Channel-wise disentangled interpretation of deep convolutional neural networks,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14, 2019.
B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. sayres,
“Interpretability beyond feature attribution: Quantitative testing with
concept activation vectors (TCAV),” in Proceedings of the 35th
International Conference on Machine Learning
, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. Stockholmsmässan, Stockholm Sweden: PMLR, 10–15 Jul 2018, pp. 2668–2677.
-  D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network dissection: Quantifying interpretability of deep visual representations,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 3319–3327.
-  B. Zhou, Y. Sun, D. Bau, and A. Torralba, “Interpretable basis decomposition for visual explanation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 119–134.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 770–778.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1452–1464, June 2018.
-  B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 5122–5130.
-  X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille, “Detect what you can: Detecting and representing objects using holistic models and body parts,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1971–1978.
-  S. Bell, K. Bala, and N. Snavely, “Intrinsic images in the wild,” ACM Transactions on Graphics (TOG), vol. 33, no. 4, pp. 1–12, 2014.