Convolutional neural networks (CNNs) [15, 13, 10] have achieved superior performance in many visual tasks, such as object classification and detection. However, besides the discrimination power, model interpretability is still a significant challenge for neural networks. Many studies have been proposed to visualize, analyze, or semanticize feature representations hidden inside a CNN, in order to obtain insight understanding of network representations.
In this study, we propose a new task, i.e. using a decision tree to quantitatively explain the logic of CNN predictions at the semantic level. Note that our decision tree summarizes “generic” decision modes for different images to produce prior explanations of CNN predictions. This is different from the visualization of image pixels that are related to CNN outputs. This task requires us to answer the following three questions.
How many types of patterns are memorized by the CNN?  defined six types of patterns for CNNs, i.e. objects, parts, scenes, textures, materials, and colors. We can roughly consider the first two types as “part patterns” and summarize the last four types as “texture patterns.” In this research, we limit our attention to effects of part patterns in the task of object classification. It is because that compared to texture patterns, part patterns are usually contained in higher conv-layers and contribute to the classification task more directly.
For each input image, which object-part patterns are used for prediction?
Given different images, a CNN may activate different sets of object-part patterns for prediction. Let us take the bird classification as an example. The CNN may use head patterns as rationales to classify a standing bird and take wing patterns to distinguish a flying bird. We regard such different selections of object-part patterns as different decision modes of the CNN. We need to mine such decision modes from a CNN as rationales for each CNN prediction.
How to quantitatively measure the contribution of each object-part pattern to a certain prediction? Our task is to identify the exact contribution of each object-part pattern in the prediction, e.g. a head pattern contributes 23%, and a feet pattern contributes 7%.
The above three questions require us 1) to identify the semantic meaning of each neural activation in the feature map of a conv-layer and 2) to quantitatively measure the contribution of different neural activations, which propose significant challenges for state-of-the-art algorithms.
In this paper, we propose to slightly revise the CNN for disentangled representations and learn a decision tree to explain CNN predictions. We are given object images in a certain category and random images as positive and negative samples as inputs to learn both the CNN and the decision tree. We do not label any parts or textures as additional supervision111Part annotations are not used to learn the CNN and the decision tree. After the learning procedure, we label object parts for top-layer filters to compute part-level contributions in Equation (14).. Firstly, we add the filter loss proposed in  to the CNN, which pushes each filter in the top conv-layer towards the representation of an object part. Secondly, we invent a decision tree to quantitatively explain the decision mode for an input image, i.e. which object parts (filters) are used for prediction and how much they contribute.
As shown in Fig. 1, each node in the decision tree represents a specific decision mode, and the decision tree organizes all decision modes in a coarse-to-fine manner. Nodes near to the top root node mainly represent common decision modes shared by many samples. Nodes near terminal nodes correspond to fine-grained modes of minority samples. In particular, each terminal mode corresponds to gradients of the CNN output w.r.t. different object parts in a certain image.
Compared to terminal fine-grained modes, we are more interested in generic decision modes in high-level nodes. These decision modes usually select significant object-part patterns as rationales for CNN prediction and ignore insignificant ones, which provide compact logic of CNN predictions.
Inference: When the CNN makes a prediction for an input image, the decision tree determines a parse tree (see red lines in Fig. 1) to explain the prediction at different levels.
Contributions: In this paper, we focus on a new task, i.e. disentangling CNN representations and learning a decision tree to quantitatively explain the logic of each CNN prediction. We propose a simple yet effective method to learn the decision tree without using any annotations of object parts as additional supervision to learn CNNs. Theoretically, our method is a generic approach to revising CNNs and learning a tight coupling of CNNs and decision trees. Experiments have demonstrated the effectiveness of the proposed method on VGG networks.
2 Related work
In this section, we limit our discussion to the literature of opening the black box of CNN representations.
Visualization of filters in a CNN is the most direct way of exploring the pattern hidden inside a neural unit. Gradient-based visualization [27, 16]estimates the input image that maximizes the activation score of a neural unit.  proposed up-convolutional nets to invert feature maps of conv-layers into images. Unlike gradient-based methods, up-convolutional nets cannot mathematically ensure that the visualization result reflects actual neural representations.
 proposed a method to accurately compute the image-resolution receptive field of neural activations in a feature map. The estimated receptive field of a neural activation is smaller than the theoretical receptive field based on the filter size. The accurate estimation of the receptive field is crucial to understand a filter’s representations.
Going beyond visualization, some methods diagnose a pre-trained CNN to obtain insight understanding of CNN representations.
 evaluated the transferability of filters in intermediate conv-layers.  computed feature distributions of different categories in the CNN feature space. Methods of [8, 19] propagated gradients of feature maps w.r.t. the CNN loss back to the image, in order to estimate image regions that directly contribute the network output.  proposed a LIME model to extract image regions that are used by a CNN to predict a label.
Network-attack methods [12, 23] diagnosed network representations by computing adversarial samples for a CNN. In particular, influence functions  were proposed to compute adversarial samples, provide plausible ways to create training samples to attack the learning of CNNs, fix the training set, and further debug representations of a CNN.  discovered knowledge blind spots (unknown patterns) of a pre-trained CNN in a weakly-supervised manner.  developed a method to examine representations of conv-layers and automatically discover potential, biased representations of a CNN due to the dataset bias.
Compared to the diagnosis of CNN representations, some studies aim to learn more meaningful CNN representations. Some studies extracted neural units with certain semantics from CNNs for different applications. Given feature maps of conv-layers,  extracted scene semantics. Simon et al. mined objects from feature maps of conv-layers , and learned object parts .  proposed a capsule model, which used a dynamic routing mechanism to parse the entire object into a parsing tree of capsules, and each capsule may encode a specific meaning.  proposed to learn CNNs with disentangled intermediate-layer representations. [4, 11] learned interpretable input codes for generative models.
Decision trees for neural networks:
 proposed to distillate representations of a neural network for image classification into a decision tree, but the decision tree did not explain the network logic in a human-interpretable manner.  learned a decision tree via knowledge distillation to represent the output feature space of a RNN. This approach used the tree logic to regularize the RNN for better representations.
In spite of the use of tree structures, there are two main differences between the above two studies and our research. Firstly, we focus on a new task of using a tree to semantically explain the logic of each prediction made by a pre-trained CNN. In contrast, decision trees in above studies are mainly learned for classification and cannot provide semantic-level explanations. Secondly, we summarize decision modes from gradients w.r.t. neural activations of object parts as rationales to explain CNN prediction. Compared to above “distillation-based” methods, our “gradient-based” decision tree reflects the CNN logic more directly and strictly.
3.1 Preliminaries: learning a CNN with disentangled representations
 has learned disentangled CNN representations by adding a loss to each filter in the top conv-layer, which pushes the representation of the filter towards a specific object part222 assumes that positive samples belong to a category and share common parts, while negative samples are random images. A filter mainly encodes parts of positive samples.. Note that people do not need to label object parts for supervision. The CNN assigns each filter with a certain part automatically during the end-to-end learning.
Let denote the feature map of the top conv-layer produced by the CNN on a given image , where is referred to as the height/width of the feature map, and is referred to as the filter number. denotes the feature map of the -th filter . As shown in Fig. 2, people design positive templates for positive samples to denote ideal activation shapes when the target object part of the filter appears at different location candidates on . A negative template is also used to describe feature maps on negative samples. The loss for the filter is given as the minus mutual information between all feature maps and all templates.
where denotes the set of feature maps of filter on all training samples, and denotes all the
templates. The prior probabilityis defined as a constant. The fitness between a feature map and a template is measured as the conditional likelihood .
where . indicates the trace of a matrix, i.e. , where and denotes the element of the matrices. Please see  for technical details.
3.2 Learning a decision tree
We focus on the CNN for single-category classification. Let denote indexes of training samples, which consists of positive samples and negative samples . Given each training image , and denote the ground-truth and the estimated label of the input image, respectively. We train the CNN based on the log logistic loss.
Part concepts in filters: The loss in Equation (1) ensures that each filter represents a specific object part. Let us focus on the -th channel of the feature map , , which is produced by a specific filter . The channel represents a disentangled object part. We can re-write the filter loss in Equation (1) as
where the first term is a constant. The second term encourages each filter can be exclusively triggered by positive samples. The third term encourages a low entropy of the spatial distribution of neural activations, i.e. each filter can only be activated by a single region of an object. It is assumed that a pattern that repetitively appears at different regions of an object usually represents a repetitive texture, instead of an object part. As shown in Fig. 3, this loss ensures the -th filter represents a part of the target object.
Internal logic for CNN prediction: As discussed in , the decision mode encoded in fully-connected layers for can be roughly described by piecewise-linear representations:
where denotes the convolution. The gradient w.r.t. the feature map
can be computed via gradient back-propagation. To simplify the computation, we use vectors
to approximate the 3-order tensorsas and , where denotes the -th element of and denotes the element at the location in . measures activation magnitudes of the -th filter. The -based normalization is conducted for more convincing results.
Considering part semantics of filters, activation values in different dimensions of reflect the signal strength of different object parts, and the gradient corresponds to the selection of object parts for the CNN prediction. Thus, we use and to describe the prediction rationale, i.e. using which object parts for prediction.
Without loss of generality, we further normalize the gradient term to a unit vector for more reasonable results333, , and ..
Tree: As shown in Fig. 4, we extract decision modes encoded in fully-connected layers of the CNN and build a decision tree to organize the hierarchy of decision modes. From the top node to terminal nodes, the decision tree encodes decision modes in a coarse-to-fine manner. Each node in the decision tree represents a common decision mode shared by a group of positive22footnotemark: 2 training samples . The node may has a set of children nodes to further divide the decision mode of into fine-grained modes. Decision modes in terminal nodes is close to gradients of specific samples. For each node , we formulate its rationales of its decision mode as
where is referred to as rationales of the decision mode. denotes a unit gradient . is given as a binary vector to select a few filters to construct the decision mode. denote element-wise multiplications. reflects common gradients of all samples within ’s sample set , and we compute parameters and via sparse representations.
Learning: The basic idea of learning a decision tree is to summarize common decision modes from specific decision modes of different samples to represent general rationales for CNN predictions. At the beginning, we initialize the gradient of each positive22footnotemark: 2 sample as a terminal node by setting and . Thus, we build an initial tree as shown in Fig. 4, in which the top node takes gradients of all positive22footnotemark: 2 samples as children. Then, in each step, we select and merge two nodes in the second layer (i.e. children of the top node) to obtain a new node , where denotes the set of nodes in the second layer. and become ’s children, and replaces and as a new child of the top node. In this way, we gradually modify the initial tree towards the final decision tree after merging operations as
We formulate the overall objective of learning as follows.
where denotes the likelihood of being positive that is estimated by the tree . indicates the discriminative power of . is a scaling parameter. This objective penalizes the decrease of the discriminative power and encourages the system to use a few decision modes as compact representations for the CNN.
We compute the likelihood of being positive as
where denotes the prediction based on , i.e. the value estimated by the best child in the second layer. is a constant scaling parameter.
In the -th step, we merge two nodes in the second layer of to get a new node and obtain a tree . The increase of is given as
Based on the above equation, we learn the decision tree via a greedy strategy. In each step, we select and merge the nodes that maximize the value of . We normalize using the sample number of the selected nodes to get more reasonable clustering performance. Because each node merger operation only affects values of a few examples in , we can quickly estimate for each pair of nodes .
3.3 Interpreting CNNs
Given a testing image , we use the CNN to make a prediction . We can use the decision tree to compute quantitative explanations for rationales of the prediction. During the inference procedure, we can infer a parse tree, which starts from the top node, in a top-down manner. Red lines in Fig. 4 shows a parse tree. When we select node to represent the decision mode of , we can further select its child that maximizes the compatibility between ’s true gradients and the node mode as a more fine-grained mode:
where denotes the parameter of node (unlike in Equation (6), we add the subscript to differentiate the parameter from those of other nodes).
A node in the decision tree provides an explanation for the prediction of image at a certain fine-grained level. We compute the vector and to evaluate the contribution of different filters in the prediction.
where the -th element of , , denotes the contribution to the CNN prediction that is made by the -th filter. If , then the -th object part makes a positive contribution. If , then the -th filter makes a negative contribution. We use a matrix to assign each filter in the top conv-layer with a specific object part, where is the part number. Similarly, the -th element of , measures the contribution of the -th object part.
4.1 Implementation details
to modify a VGG-16 network to a disentangled CNN, which changed the top conv-layer of the VGG-16 to a disentangled conv-layer and further added a disentangled conv-layer on the top conv-layer. We used feature maps of the new top conv-layer as the input of our decision tree. We loaded parameters of all thirteen old conv-layers directly from a VGG-16 network that was pre-trained using images in the ImageNet ILSVRC 2012 dataset with a loss for 1000-category classification. We initialized parameters of the new top conv-layer and all fully-connected layers. We then fine-tuned the CNN for single-category classification using three benchmark datasets (which will be introduced later). We simply set parameters as , , and in all experiments to enable fair comparisons.
Because the quantitative explanation of CNN predictions requires us to assign each filter in the top conv-layer with a specific object part, we used three benchmark datasets with ground-truth art annotations to evaluate our method. The selected datasets include the PASCAL-Part Dataset , the CUB200-2011 dataset , and the ILSVRC 2013 DET Animal-Part dataset . Just like in most part-localization studies [5, 28], we used animal categories, which prevalently contain non-rigid shape deformation, for evaluation. I.e. we selected six animal categories—bird, cat, cow, dog, horse, and sheep—from the PASCAL Part Dataset. The CUB200-2011 dataset contains 11.8K images of 200 bird species. Like in [3, 21, 28], we ignored species labels and regarded all these images as a single bird category. The ILSVRC 2013 DET Animal-Part dataset  consists of 30 animal categories among all the 200 categories for object detection in the ILSVRC 2013 DET dataset .
4.3 Analysis of object parts for prediction
In this subsection, we analyzed the contribution of different object parts in the CNN prediction, when we assigned each filter with a specific object part. The vector in Equation (14) specifies the contribution of different object parts in the prediction of . For the -th object part, we computed as the ratio of the contribution of the part.
More specifically, for CNNs based on the ILSVRC 2013 DET Animal-Part dataset, we manually labeled the object part for each filter in the top conv-layer. For CNNs based on the Pascal VOC Part dataset ,  merged tens of small parts into several major landmark parts for the six animal categories. Given a CNN for a certain category, we used  to estimate regions in different images that corresponded to each filter’s neural activations, namely the image receptive field of the filter (please see Figs. 6 and 3). For each filter, we selected a part from all major landmark parts, which was closest to the filter’s image receptive field through all positive samples. For the CNN based on the CUB200-2011 dataset, we used ground-truth positions of the breast, forehead, nape, tail of birds as major landmark parts. Similarly, we assigned each filter in the top conv-layer with the nearest landmark part.
4.4 Evaluation metrics
We used three metrics to evaluate the accuracy of the decision tree. The first metric is the classification accuracy. Because denotes the prediction of based on the best child in the second layer, we regarded as the output of the tree and we evaluated the discrimination power of . We used values of for classification and compared its classification accuracy with the accuracy of the CNN.
Because the disentangled CNN representation selectively learns object-part patterns in positive samples and treats negative samples as random images, we use the other two metrics to identify whether tree nodes correctly reflect CNN representations of positive samples. The second metric, namely the prediction error, measures the error of the estimated value w.r.t the true value . We computed the prediction error as , where we normalized the error using the value range of .
The third metric, namely the fitness of filter contribution, compares the ground-truth contribution of different filters in the top conv-layer with the estimated contribution of these filters during the prediction process. When the decision tree uses node to explain the prediction for , the contribution vector in Equation (14) denotes the contribution distribution of different filters. corresponds to the ground-truth contribution distribution. We reported the interaction-of-the-union value between and to measure the fitness of the ground-truth and the estimated filter contributions. I.e. we computed the fitness as , where denotes the -th element of and . We used non-negative values of and , because vectors and may have negative elements.
Evaluation for nodes in different layers: The above three metrics evaluate decision modes (nodes) in the second layer of the decision tree. Because nodes in lower layers encode more fine-grained decision modes, we extended such three metrics to evaluate nodes in low layers. When we evaluated nodes in the -th layer, we temporarily constructed a new tree by removing all nodes above the -th layer and directly connecting the top node to nodes in the -th layer. Thus, we can apply the evaluation to the new tree.
4.5 Experimental results and analysis
Table 1 shows the structure of the decision tree by listing numbers of nodes in different layers of the decision tree. Fig. 5 visualizes decision modes in the decision tree. Fig. 6 shows distributions of object-part contributions to the CNN prediction, which were estimated using nodes in the second layer of decision trees. Tables 2, 3, and 4 show the average classification accuracy, the average prediction error, and the average fitness of filter contribution based on nodes in different layers. Generally speaking, when we used more fine-grained decision modes to explain the prediction logic, the explanation would better fit the actual logic in the CNN, and the prediction error of using decision modes would decrease. However, fine-grained decision modes did not exhibit higher accuracy in classification, because the objective of our method is to summarize decision modes of a pre-trained CNN instead of improving the classification accuracy.
5 Conclusion and discussions
In this study, we focus on a new task of using a decision tree to explain the logic in a CNN at the semantic level. We have developed a method to revise a CNN and built a tight coupling of the CNN and a decision tree. The proposed decision tree encodes decision modes of the CNN as quantitative rationales for each CNN prediction. Our method does not need any annotations of object parts or textures in training samples to guide the learning the CNN. We have tested our method in different benchmark datasets, and experiments have proved the effectiveness of our approach.
Note that theoretically, the decision tree just provides an approximate explanation for CNN predictions, instead of an accurate reconstruction of CNN representation details. There are two reasons. Firstly, without accurate object-part annotations to supervised the learning of CNNs,  can only roughly make each filter to represent an object part. The filter may produce incorrect activations in a few challenging samples. Secondly, the decision mode in each node ignores insignificant object-part patterns (filters) to ensure a sparse representation of the decision mode.
M. Aubry and B. C. Russell.
Understanding deep features with computer-generated imagery.In ICCV, 2015.
-  D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba. Network dissection: Quantifying interpretability of deep visual representations. In CVPR, 2017.
-  S. Branson, P. Perona, and S. Belongie. Strong supervision from weak annotation: Interactive training of deformable part models. In ICCV, 2011.
-  X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
-  X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
-  A. Dosovitskiy and T. Brox. Inverting visual representations with convolutional networks. In CVPR, 2016.
-  R. C. Fong and A. Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. In arXiv:1704.03296v1, 2017.
-  N. Frosst and G. Hinton. Distilling a neural network into a soft decision tree. In arXiv:1711.09784, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. -vae: learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
-  P. Koh and P. Liang. Understanding black-box predictions via influence functions. In ICML, 2017.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  H. Lakkaraju, E. Kamar, R. Caruana, and E. Horvitz. Identifying unknown unknowns in the open world: Representations and policies for guided exploration. In AAAI, 2017.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998.
-  A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In CVPR, 2015.
-  M. T. Ribeiro, S. Singh, and C. Guestrin. “why should i trust you?” explaining the predictions of any classifier. In KDD, 2016.
-  S. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In NIPS, 2017.
-  R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In arXiv:1610.02391v3, 2017.
-  M. Simon and E. Rodner. Neural activation constellations: Unsupervised part model discovery with convolutional networks. In ICCV, 2015.
-  M. Simon, E. Rodner, and J. Denzler. Part detector discovery in deep convolutional neural networks. In ACCV, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In arXiv:1312.6199v4, 2014.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Technical report, In California Institute of Technology, 2011.
-  M. Wu, M. C. Hughes, S. Parbhoo, M. Zazzi, V. Roth, and F. Doshi-Velez. Beyond sparsity: Tree regularization of deep models for interpretability. In NIPS TIML Workshop, 2017.
-  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In NIPS, 2014.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
-  Q. Zhang, R. Cao, Y. N. Wu, and S.-C. Zhu. Growing interpretable part graphs on convnets via multi-shot learning. In AAAI, 2016.
-  Q. Zhang, W. Wang, and S.-C. Zhu. Examining cnn representations with respect to dataset bias. In AAAI, 2018.
-  Q. Zhang, Y. N. Wu, and S.-C. Zhu. Interpretable convolutional neural networks. In arXiv:1710.00935, 2017.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. In ICRL, 2015.