1 Introduction
Convolutional neural networks (CNNs) [15, 13, 10] have achieved superior performance in many visual tasks, such as object classification and detection. However, besides the discrimination power, model interpretability is still a significant challenge for neural networks. Many studies have been proposed to visualize, analyze, or semanticize feature representations hidden inside a CNN, in order to obtain insight understanding of network representations.
In this study, we propose a new task, i.e. using a decision tree to quantitatively explain the logic of CNN predictions at the semantic level. Note that our decision tree summarizes “generic” decision modes for different images to produce prior explanations of CNN predictions. This is different from the visualization of image pixels that are related to CNN outputs. This task requires us to answer the following three questions.

How many types of patterns are memorized by the CNN? [2] defined six types of patterns for CNNs, i.e. objects, parts, scenes, textures, materials, and colors. We can roughly consider the first two types as “part patterns” and summarize the last four types as “texture patterns.” In this research, we limit our attention to effects of part patterns in the task of object classification. It is because that compared to texture patterns, part patterns are usually contained in higher convlayers and contribute to the classification task more directly.

For each input image, which objectpart patterns are used for prediction?
Given different images, a CNN may activate different sets of objectpart patterns for prediction. Let us take the bird classification as an example. The CNN may use head patterns as rationales to classify a standing bird and take wing patterns to distinguish a flying bird. We regard such different selections of objectpart patterns as different decision modes of the CNN. We need to mine such decision modes from a CNN as rationales for each CNN prediction.

How to quantitatively measure the contribution of each objectpart pattern to a certain prediction? Our task is to identify the exact contribution of each objectpart pattern in the prediction, e.g. a head pattern contributes 23%, and a feet pattern contributes 7%.
The above three questions require us 1) to identify the semantic meaning of each neural activation in the feature map of a convlayer and 2) to quantitatively measure the contribution of different neural activations, which propose significant challenges for stateoftheart algorithms.
In this paper, we propose to slightly revise the CNN for disentangled representations and learn a decision tree to explain CNN predictions. We are given object images in a certain category and random images as positive and negative samples as inputs to learn both the CNN and the decision tree. We do not label any parts or textures as additional supervision^{1}^{1}1Part annotations are not used to learn the CNN and the decision tree. After the learning procedure, we label object parts for toplayer filters to compute partlevel contributions in Equation (14).. Firstly, we add the filter loss proposed in [30] to the CNN, which pushes each filter in the top convlayer towards the representation of an object part. Secondly, we invent a decision tree to quantitatively explain the decision mode for an input image, i.e. which object parts (filters) are used for prediction and how much they contribute.
As shown in Fig. 1, each node in the decision tree represents a specific decision mode, and the decision tree organizes all decision modes in a coarsetofine manner. Nodes near to the top root node mainly represent common decision modes shared by many samples. Nodes near terminal nodes correspond to finegrained modes of minority samples. In particular, each terminal mode corresponds to gradients of the CNN output w.r.t. different object parts in a certain image.
Compared to terminal finegrained modes, we are more interested in generic decision modes in highlevel nodes. These decision modes usually select significant objectpart patterns as rationales for CNN prediction and ignore insignificant ones, which provide compact logic of CNN predictions.
Inference: When the CNN makes a prediction for an input image, the decision tree determines a parse tree (see red lines in Fig. 1) to explain the prediction at different levels.
Contributions: In this paper, we focus on a new task, i.e. disentangling CNN representations and learning a decision tree to quantitatively explain the logic of each CNN prediction. We propose a simple yet effective method to learn the decision tree without using any annotations of object parts as additional supervision to learn CNNs. Theoretically, our method is a generic approach to revising CNNs and learning a tight coupling of CNNs and decision trees. Experiments have demonstrated the effectiveness of the proposed method on VGG networks.
2 Related work
In this section, we limit our discussion to the literature of opening the black box of CNN representations.
CNN visualization: Visualization of filters in a CNN is the most direct way of exploring the pattern hidden inside a neural unit. Gradientbased visualization [27, 16]estimates the input image that maximizes the activation score of a neural unit. [7] proposed upconvolutional nets to invert feature maps of convlayers into images. Unlike gradientbased methods, upconvolutional nets cannot mathematically ensure that the visualization result reflects actual neural representations.
[31] proposed a method to accurately compute the imageresolution receptive field of neural activations in a feature map. The estimated receptive field of a neural activation is smaller than the theoretical receptive field based on the filter size. The accurate estimation of the receptive field is crucial to understand a filter’s representations.
Network diagnosis: Going beyond visualization, some methods diagnose a pretrained CNN to obtain insight understanding of CNN representations.
[26] evaluated the transferability of filters in intermediate convlayers. [1] computed feature distributions of different categories in the CNN feature space. Methods of [8, 19] propagated gradients of feature maps w.r.t. the CNN loss back to the image, in order to estimate image regions that directly contribute the network output. [17] proposed a LIME model to extract image regions that are used by a CNN to predict a label.
Networkattack methods [12, 23] diagnosed network representations by computing adversarial samples for a CNN. In particular, influence functions [12] were proposed to compute adversarial samples, provide plausible ways to create training samples to attack the learning of CNNs, fix the training set, and further debug representations of a CNN. [14] discovered knowledge blind spots (unknown patterns) of a pretrained CNN in a weaklysupervised manner. [29] developed a method to examine representations of convlayers and automatically discover potential, biased representations of a CNN due to the dataset bias.
CNN semanticization: Compared to the diagnosis of CNN representations, some studies aim to learn more meaningful CNN representations. Some studies extracted neural units with certain semantics from CNNs for different applications. Given feature maps of convlayers, [31] extracted scene semantics. Simon et al. mined objects from feature maps of convlayers [20], and learned object parts [21]. [18] proposed a capsule model, which used a dynamic routing mechanism to parse the entire object into a parsing tree of capsules, and each capsule may encode a specific meaning. [30] proposed to learn CNNs with disentangled intermediatelayer representations. [4, 11] learned interpretable input codes for generative models.
Decision trees for neural networks: [9] proposed to distillate representations of a neural network for image classification into a decision tree, but the decision tree did not explain the network logic in a humaninterpretable manner. [25] learned a decision tree via knowledge distillation to represent the output feature space of a RNN. This approach used the tree logic to regularize the RNN for better representations.
In spite of the use of tree structures, there are two main differences between the above two studies and our research. Firstly, we focus on a new task of using a tree to semantically explain the logic of each prediction made by a pretrained CNN. In contrast, decision trees in above studies are mainly learned for classification and cannot provide semanticlevel explanations. Secondly, we summarize decision modes from gradients w.r.t. neural activations of object parts as rationales to explain CNN prediction. Compared to above “distillationbased” methods, our “gradientbased” decision tree reflects the CNN logic more directly and strictly.
3 Algorithm
3.1 Preliminaries: learning a CNN with disentangled representations
[30] has learned disentangled CNN representations by adding a loss to each filter in the top convlayer, which pushes the representation of the filter towards a specific object part^{2}^{2}2[30] assumes that positive samples belong to a category and share common parts, while negative samples are random images. A filter mainly encodes parts of positive samples.. Note that people do not need to label object parts for supervision. The CNN assigns each filter with a certain part automatically during the endtoend learning.
Let denote the feature map of the top convlayer produced by the CNN on a given image , where is referred to as the height/width of the feature map, and is referred to as the filter number. denotes the feature map of the th filter . As shown in Fig. 2, people design positive templates for positive samples to denote ideal activation shapes when the target object part of the filter appears at different location candidates on . A negative template is also used to describe feature maps on negative samples. The loss for the filter is given as the minus mutual information between all feature maps and all templates.
(1) 
where denotes the set of feature maps of filter on all training samples, and denotes all the
templates. The prior probability
is defined as a constant. The fitness between a feature map and a template is measured as the conditional likelihood .(2) 
where . indicates the trace of a matrix, i.e. , where and denotes the element of the matrices. Please see [30] for technical details.
3.2 Learning a decision tree
We focus on the CNN for singlecategory classification. Let denote indexes of training samples, which consists of positive samples and negative samples . Given each training image , and denote the groundtruth and the estimated label of the input image, respectively. We train the CNN based on the log logistic loss.
Part concepts in filters: The loss in Equation (1) ensures that each filter represents a specific object part. Let us focus on the th channel of the feature map , , which is produced by a specific filter . The channel represents a disentangled object part. We can rewrite the filter loss in Equation (1) as
(3) 
where the first term is a constant. The second term encourages each filter can be exclusively triggered by positive samples. The third term encourages a low entropy of the spatial distribution of neural activations, i.e. each filter can only be activated by a single region of an object. It is assumed that a pattern that repetitively appears at different regions of an object usually represents a repetitive texture, instead of an object part. As shown in Fig. 3, this loss ensures the th filter represents a part of the target object.
Internal logic for CNN prediction: As discussed in [21], the decision mode encoded in fullyconnected layers for can be roughly described by piecewiselinear representations:
(4) 
where denotes the convolution. The gradient w.r.t. the feature map
can be computed via gradient backpropagation. To simplify the computation, we use vectors
to approximate the 3order tensors
as and , where denotes the th element of and denotes the element at the location in . measures activation magnitudes of the th filter. The based normalization is conducted for more convincing results.Considering part semantics of filters, activation values in different dimensions of reflect the signal strength of different object parts, and the gradient corresponds to the selection of object parts for the CNN prediction. Thus, we use and to describe the prediction rationale, i.e. using which object parts for prediction.
(5) 
Without loss of generality, we further normalize the gradient term to a unit vector for more reasonable results^{3}^{3}3, , and ..
Tree: As shown in Fig. 4, we extract decision modes encoded in fullyconnected layers of the CNN and build a decision tree to organize the hierarchy of decision modes. From the top node to terminal nodes, the decision tree encodes decision modes in a coarsetofine manner. Each node in the decision tree represents a common decision mode shared by a group of positive^{2}^{2}footnotemark: 2 training samples . The node may has a set of children nodes to further divide the decision mode of into finegrained modes. Decision modes in terminal nodes is close to gradients of specific samples. For each node , we formulate its rationales of its decision mode as
(6) 
where is referred to as rationales of the decision mode. denotes a unit gradient . is given as a binary vector to select a few filters to construct the decision mode. denote elementwise multiplications. reflects common gradients of all samples within ’s sample set , and we compute parameters and via sparse representations.
(7)  
(8) 
Learning: The basic idea of learning a decision tree is to summarize common decision modes from specific decision modes of different samples to represent general rationales for CNN predictions. At the beginning, we initialize the gradient of each positive^{2}^{2}footnotemark: 2 sample as a terminal node by setting and . Thus, we build an initial tree as shown in Fig. 4, in which the top node takes gradients of all positive^{2}^{2}footnotemark: 2 samples as children. Then, in each step, we select and merge two nodes in the second layer (i.e. children of the top node) to obtain a new node , where denotes the set of nodes in the second layer. and become ’s children, and replaces and as a new child of the top node. In this way, we gradually modify the initial tree towards the final decision tree after merging operations as
(9) 
We formulate the overall objective of learning as follows.
(10) 
where denotes the likelihood of being positive that is estimated by the tree . indicates the discriminative power of . is a scaling parameter. This objective penalizes the decrease of the discriminative power and encourages the system to use a few decision modes as compact representations for the CNN.
We compute the likelihood of being positive as
(11) 
where denotes the prediction based on , i.e. the value estimated by the best child in the second layer. is a constant scaling parameter.
In the th step, we merge two nodes in the second layer of to get a new node and obtain a tree . The increase of is given as
(12) 
Based on the above equation, we learn the decision tree via a greedy strategy. In each step, we select and merge the nodes that maximize the value of . We normalize using the sample number of the selected nodes to get more reasonable clustering performance. Because each node merger operation only affects values of a few examples in , we can quickly estimate for each pair of nodes .
3.3 Interpreting CNNs
Given a testing image , we use the CNN to make a prediction . We can use the decision tree to compute quantitative explanations for rationales of the prediction. During the inference procedure, we can infer a parse tree, which starts from the top node, in a topdown manner. Red lines in Fig. 4 shows a parse tree. When we select node to represent the decision mode of , we can further select its child that maximizes the compatibility between ’s true gradients and the node mode as a more finegrained mode:
(13) 
where denotes the parameter of node (unlike in Equation (6), we add the subscript to differentiate the parameter from those of other nodes).
A node in the decision tree provides an explanation for the prediction of image at a certain finegrained level. We compute the vector and to evaluate the contribution of different filters in the prediction.
(14) 
where the th element of , , denotes the contribution to the CNN prediction that is made by the th filter. If , then the th object part makes a positive contribution. If , then the th filter makes a negative contribution. We use a matrix to assign each filter in the top convlayer with a specific object part, where is the part number. Similarly, the th element of , measures the contribution of the th object part.
4 Experiments
4.1 Implementation details
We learned a disentangled CNN based on the structure of the VGG16 network [22]. We followed the technique of [30]
to modify a VGG16 network to a disentangled CNN, which changed the top convlayer of the VGG16 to a disentangled convlayer and further added a disentangled convlayer on the top convlayer. We used feature maps of the new top convlayer as the input of our decision tree. We loaded parameters of all thirteen old convlayers directly from a VGG16 network that was pretrained using images in the ImageNet ILSVRC 2012 dataset
[6] with a loss for 1000category classification. We initialized parameters of the new top convlayer and all fullyconnected layers. We then finetuned the CNN for singlecategory classification using three benchmark datasets (which will be introduced later). We simply set parameters as , , and in all experiments to enable fair comparisons.4.2 Datasets
Because the quantitative explanation of CNN predictions requires us to assign each filter in the top convlayer with a specific object part, we used three benchmark datasets with groundtruth art annotations to evaluate our method. The selected datasets include the PASCALPart Dataset [5], the CUB2002011 dataset [24], and the ILSVRC 2013 DET AnimalPart dataset [28]. Just like in most partlocalization studies [5, 28], we used animal categories, which prevalently contain nonrigid shape deformation, for evaluation. I.e. we selected six animal categories—bird, cat, cow, dog, horse, and sheep—from the PASCAL Part Dataset. The CUB2002011 dataset contains 11.8K images of 200 bird species. Like in [3, 21, 28], we ignored species labels and regarded all these images as a single bird category. The ILSVRC 2013 DET AnimalPart dataset [28] consists of 30 animal categories among all the 200 categories for object detection in the ILSVRC 2013 DET dataset [6].
4.3 Analysis of object parts for prediction
In this subsection, we analyzed the contribution of different object parts in the CNN prediction, when we assigned each filter with a specific object part. The vector in Equation (14) specifies the contribution of different object parts in the prediction of . For the th object part, we computed as the ratio of the contribution of the part.
More specifically, for CNNs based on the ILSVRC 2013 DET AnimalPart dataset, we manually labeled the object part for each filter in the top convlayer. For CNNs based on the Pascal VOC Part dataset [5], [30] merged tens of small parts into several major landmark parts for the six animal categories. Given a CNN for a certain category, we used [31] to estimate regions in different images that corresponded to each filter’s neural activations, namely the image receptive field of the filter (please see Figs. 6 and 3). For each filter, we selected a part from all major landmark parts, which was closest to the filter’s image receptive field through all positive samples. For the CNN based on the CUB2002011 dataset, we used groundtruth positions of the breast, forehead, nape, tail of birds as major landmark parts. Similarly, we assigned each filter in the top convlayer with the nearest landmark part.
4.4 Evaluation metrics
We used three metrics to evaluate the accuracy of the decision tree. The first metric is the classification accuracy. Because denotes the prediction of based on the best child in the second layer, we regarded as the output of the tree and we evaluated the discrimination power of . We used values of for classification and compared its classification accuracy with the accuracy of the CNN.
Because the disentangled CNN representation selectively learns objectpart patterns in positive samples and treats negative samples as random images, we use the other two metrics to identify whether tree nodes correctly reflect CNN representations of positive samples. The second metric, namely the prediction error, measures the error of the estimated value w.r.t the true value . We computed the prediction error as , where we normalized the error using the value range of .
The third metric, namely the fitness of filter contribution, compares the groundtruth contribution of different filters in the top convlayer with the estimated contribution of these filters during the prediction process. When the decision tree uses node to explain the prediction for , the contribution vector in Equation (14) denotes the contribution distribution of different filters. corresponds to the groundtruth contribution distribution. We reported the interactionoftheunion value between and to measure the fitness of the groundtruth and the estimated filter contributions. I.e. we computed the fitness as , where denotes the th element of and . We used nonnegative values of and , because vectors and may have negative elements.
Evaluation for nodes in different layers: The above three metrics evaluate decision modes (nodes) in the second layer of the decision tree. Because nodes in lower layers encode more finegrained decision modes, we extended such three metrics to evaluate nodes in low layers. When we evaluated nodes in the th layer, we temporarily constructed a new tree by removing all nodes above the th layer and directly connecting the top node to nodes in the th layer. Thus, we can apply the evaluation to the new tree.
4.5 Experimental results and analysis
Dataset  2nd  5th  10th  50th  100th 

ILSVRC AnimalPart  4.8  31.6  69.1  236.5  402.1 
VOC Part  3.8  25.7  59.0  219.5  361.5 
CUB2002011  5.0  32.0  64.0  230.0  430.0 
Dataset  CNN  2nd  5th  10th  50th  100th  bottom 

ILSVRC AnimalPart  96.7  94.4  89.0  88.7  88.6  88.7  87.8 
VOC Part  95.4  94.2  91.0  90.1  89.8  89.4  88.2 
CUB2002011  96.5  91.5  92.2  88.3  88.6  88.9  85.3 
Dataset  2nd  5th  10th  50th  100th  bottom 

ILSVRC AnimalPart  0.052  0.064  0.063  0.049  0.034  0.00 
VOC Part  0.052  0.066  0.070  0.051  0.035  0.00 
CUB2002011  0.075  0.099  0.101  0.087  0.083  0.00 
Dataset  2nd  5th  10th  50th  100th  bottom 

ILSVRC AnimalPart  0.23  0.30  0.36  0.52  0.65  1.00 
VOC Part  0.22  0.30  0.36  0.53  0.67  1.00 
CUB2002011  0.21  0.26  0.28  0.33  0.37  1.00 
Table 1 shows the structure of the decision tree by listing numbers of nodes in different layers of the decision tree. Fig. 5 visualizes decision modes in the decision tree. Fig. 6 shows distributions of objectpart contributions to the CNN prediction, which were estimated using nodes in the second layer of decision trees. Tables 2, 3, and 4 show the average classification accuracy, the average prediction error, and the average fitness of filter contribution based on nodes in different layers. Generally speaking, when we used more finegrained decision modes to explain the prediction logic, the explanation would better fit the actual logic in the CNN, and the prediction error of using decision modes would decrease. However, finegrained decision modes did not exhibit higher accuracy in classification, because the objective of our method is to summarize decision modes of a pretrained CNN instead of improving the classification accuracy.
5 Conclusion and discussions
In this study, we focus on a new task of using a decision tree to explain the logic in a CNN at the semantic level. We have developed a method to revise a CNN and built a tight coupling of the CNN and a decision tree. The proposed decision tree encodes decision modes of the CNN as quantitative rationales for each CNN prediction. Our method does not need any annotations of object parts or textures in training samples to guide the learning the CNN. We have tested our method in different benchmark datasets, and experiments have proved the effectiveness of our approach.
Note that theoretically, the decision tree just provides an approximate explanation for CNN predictions, instead of an accurate reconstruction of CNN representation details. There are two reasons. Firstly, without accurate objectpart annotations to supervised the learning of CNNs, [30] can only roughly make each filter to represent an object part. The filter may produce incorrect activations in a few challenging samples. Secondly, the decision mode in each node ignores insignificant objectpart patterns (filters) to ensure a sparse representation of the decision mode.
References

[1]
M. Aubry and B. C. Russell.
Understanding deep features with computergenerated imagery.
In ICCV, 2015.  [2] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying interpretability of deep visual representations. In CVPR, 2017.
 [3] S. Branson, P. Perona, and S. Belongie. Strong supervision from weak annotation: Interactive training of deformable part models. In ICCV, 2011.
 [4] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
 [5] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
 [6] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 [7] A. Dosovitskiy and T. Brox. Inverting visual representations with convolutional networks. In CVPR, 2016.
 [8] R. C. Fong and A. Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. In arXiv:1704.03296v1, 2017.
 [9] N. Frosst and G. Hinton. Distilling a neural network into a soft decision tree. In arXiv:1711.09784, 2017.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [11] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. vae: learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
 [12] P. Koh and P. Liang. Understanding blackbox predictions via influence functions. In ICML, 2017.
 [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [14] H. Lakkaraju, E. Kamar, R. Caruana, and E. Horvitz. Identifying unknown unknowns in the open world: Representations and policies for guided exploration. In AAAI, 2017.
 [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, 1998.
 [16] A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In CVPR, 2015.
 [17] M. T. Ribeiro, S. Singh, and C. Guestrin. “why should i trust you?” explaining the predictions of any classifier. In KDD, 2016.
 [18] S. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In NIPS, 2017.
 [19] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Gradcam: Visual explanations from deep networks via gradientbased localization. In arXiv:1610.02391v3, 2017.
 [20] M. Simon and E. Rodner. Neural activation constellations: Unsupervised part model discovery with convolutional networks. In ICCV, 2015.
 [21] M. Simon, E. Rodner, and J. Denzler. Part detector discovery in deep convolutional neural networks. In ACCV, 2014.
 [22] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [23] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In arXiv:1312.6199v4, 2014.
 [24] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltechucsd birds2002011 dataset. Technical report, In California Institute of Technology, 2011.
 [25] M. Wu, M. C. Hughes, S. Parbhoo, M. Zazzi, V. Roth, and F. DoshiVelez. Beyond sparsity: Tree regularization of deep models for interpretability. In NIPS TIML Workshop, 2017.
 [26] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In NIPS, 2014.
 [27] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
 [28] Q. Zhang, R. Cao, Y. N. Wu, and S.C. Zhu. Growing interpretable part graphs on convnets via multishot learning. In AAAI, 2016.
 [29] Q. Zhang, W. Wang, and S.C. Zhu. Examining cnn representations with respect to dataset bias. In AAAI, 2018.
 [30] Q. Zhang, Y. N. Wu, and S.C. Zhu. Interpretable convolutional neural networks. In arXiv:1710.00935, 2017.
 [31] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. In ICRL, 2015.
Comments
There are no comments yet.