Visualizing the decision-making process in deep neural decision forest

04/19/2019 ∙ by Shichao Li, et al. ∙ The Hong Kong University of Science and Technology 0

Deep neural decision forest (NDF) achieved remarkable performance on various vision tasks via combining decision tree and deep representation learning. In this work, we first trace the decision-making process of this model and visualize saliency maps to understand which portion of the input influence it more for both classification and regression problems. We then apply NDF on a multi-task coordinate regression problem and demonstrate the distribution of routing probabilities, which is vital for interpreting NDF yet not shown for regression problems. The pre-trained model and code for visualization will be available at



There are no comments yet.


page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Traditional decision trees [5, 3]

are interpretable since they conduct inference by making decisions. An input is routed by a series of splitting nodes and the conclusion is drawn at one leaf node. Training these models follow a local greedy heuristic

[5, 3], where a purity metric such as entropy is adopted to select the best splitting function from a candidate set at each splitting node. Hand-crafted features were usually used and the model’s representation learning ability is limited.

Deep neural decision forest (NDF) [4] and its later regression version [6]

formulated a probabilistic routing framework for decision trees. As a result the loss function is differentiable with respect to the parameters used in the splitting functions, enabling gradient-based optimization in a global way. Despite the success of NDF, there is few effort devoted to visualize the decision making process of it. In addition, the deep representation learning ability brought by the soft-routing framework comes with the price of visiting every leaf node in the tree. The model will be more similar to traditional decision tree and more interpretable if few leaf nodes contribute to the final prediction. Fortunately, the desired property was demonstrated by the distribution of routing probabilities in

[4] for a image classification problem. To our best knowledge, this property has not yet been validated for any regression problem.

Figure 1: Illustration of the decision-making process in deep neural decision forest. Input images are routed (red arrows) by splitting nodes and arrive at the prediction given at leaf nodes. The feature extractor computes deep representation from the input and send it (blue arrows) to each splitting node for decision making. Best viewed in color.

In this paper, we trace the routing of input images and apply gradient-based technique to visualize the important portions of the input that affect NDF’s decision-making process. We also apply NDF to a new multi-task regression problem and visualize the distribution of routing probabilities to fill the knowledge blank. In summary, our contributions are:

  • We trace the decision-making process of NDF and compute saliency maps to visualize which portion of the input influences it more.

  • We utilize NDF on a new regression problem and visualize the distribution of routing probabilities to validate its interpretability.

2 Related works

Traditional classification and regression trees make predictions by decision making, where hand-crafted features [5, 3] were computed to split the feature space and route the input. Deep neural decision forest (NDF) and its regression variant [6]

were proposed to equip traditional decision trees with deep feature learning ability. Gradient-based method


was adopted to understand the prediction made by traditional deep convolutional neural network (CNN). However, this visualization technique has not yet been applied to NDF. Another orthogonal line of research attempts to learn more interpretable representation

[9] and organize the inference process into a decision tree [10]. Our work is different from them since it is more of a visualization-based model diagnosis and no other loss function is used in the training phase to drive semantically meaningful feature learning as in [9].

3 Methodology

A deep neural decision forest (NDF) is an ensemble of deep neural decision trees. Each tree consists of splitting nodes and leaf nodes. In general each tree can have unconstrained topology but here we specify every tree as full binary tree for simplicity. We index the nodes sequentially with integer as shown in Figure 1.

A splitting node is associated with a recommendation (splitting) function that extracts deep features from the input and gives the recommendation score (routing probability) that the input is recommended (routed) to its left sub-tree.

We denote the unique path from the root node to a leaf node a computation path . Each leaf node stores one function

that maps the input into a prediction vector

. To get the final prediction , each leaf node contributes its prediction vector weighted by the probability of taking its computation path as


and is the set of all leaf nodes. The weight can be obtained by multiplying all the recommendation scores given by the splitting nodes along the path. Assume the path consists of a sequence of splitting nodes and one leaf node as , where the superscript for a splitting node denotes to which child node to route the input. Here means the input is routed to the left child and otherwise. Then the weight can be expressed as


Note that the weights of all leaf nodes sum to 1 and the final prediction is hence a convex combination of all the prediction vectors of the leaf nodes. In addition, we assume the recommendation and mapping functions mentioned above are differentiable and parametrized by at node . Then the final prediction is a differentiable function with respect to all the parameters which we omit above to ensure clarity. A loss function defined upon the final prediction can hence be minimized with back-propagation algorithm.

Note here all computation paths will contribute to the final prediction of this model, unlike traditional decision tree where only one path is taken for each input. We believe the model is more interpretable and similar to tradition decision trees when only a few computation paths contribute to the final prediction. This has been shown to be the case for classification problem in [4]. Here we also demonstrate the distribution of routing probabilities for a regression problem.

To understand how the input can influence the decision-making of this model, we take the gradient of the routing probability with respect to the input and name it decision saliency map (DSM),


For classification problem, the prediction vector for each leaf node

is a discrete probability distribution vector whose length equals the number of classes. The

th entry gives the probability that the input belongs to class . For regression problems, is also a real-valued vector but the entries do not necessarily sum to 1. The optimization target for classification problems is to minimize the negative log-likelihood loss over the whole training set containing instances , . For a multi-task regression problem with instances , we directly use the squared loss function, .

In the experiment, we use deep CNN to extract features from the input and use sigmoid function to compute the recommendation scores from the features. The network parameters and leaf node prediction vectors are optimized alternately by back propagation and update rule, respectively. Details about the network architectures, training algorithm and hyper-parameter settings can be found in our supplementary materials (included in the GitHub repository).

Figure 2: Decision saliency maps for MNIST test set. Each row gives the decision-making process of one image, where the left-most image is the input and the others are DSMs along the computation path of the input. Each DSM is computed by taking derivative of the routing probability with respect to the input image. Model prediction is given above the input image and (Na, Pb) means the input arrives at splitting node a with probability b during the decision-making process.
Figure 3: Decision saliency maps for CIFAR-10 test set using the same annotation style as Figure 2.

4 Experiments

4.1 Classification for MNIST and CIFAR-10

Standard datasets provided by PyTorch

111 are used. We use one full binary tree of depth 9 for both datasets, but the complexity of the feature extractor for CIFAR-10 is higher. Adam optimizer is used with learning rate specified as 0.001. Test accuracies for different datasets and feature extractors are shown in Table 1. We record the computation path for each test image that has the largest probability been taken, and compute DSMs for some random samples as shown in Fig. 2 and Fig. 3. The tree is very decisive as indicated by the probability of arriving at each splitting node. In addition, the foreground usually affect the decision more as expected and also similar to [7]

. Interestingly, the highlight (yellow dots) for different DSMs along the computation path vary a lot for some examples. This means the network is trying to look at different regions of the input while deciding how to route the input. Another interesting observation is that the model mis-classify dog as bird when it is not certain about its decision.

Figure 4: Face alignment using a cascade of NDFs. A coarse shape initialization can be updated to well fit the ground truth after 10 stages. Best viewed in color.
Figure 5: Distribution of recommendation scores for boosted regression with NDF. Three stages are visualized and the model is very decisive as the distribution is peaked around 0 and 1.

4.2 Cascaded regression on 3DFAW

Here we study the decision-making process for a more complex multi-coordinate regression problem on 3DFAW dataset [2]. To our best knowledge, this is the first time NDF is boosted and applied on a multi-task regression problem. For an input image , the goal is to predict the position of 66 facial landmarks as a vector . We start with an initialized shape

and use a cascade of NDF to update the estimated facial shape stage by stage. The final prediction

where is the total stage number and is the shape update (model prediction) at stage . We concatenate 66 local patches cropped around current estimated facial landmarks as input and every leaf node stores a vector as the shape update. We use a cascade length of 10, and in each stage an ensemble of 3 trees is used where each has a depth of 5. The model prediction is shown in Fig. 4.

The distribution of recommendation scores for this regression problem is shown in Fig. 5, which is consistent with the results for classification in [4]. This means NDF is also decisive for a regression problem and the model can approximate the decision-making process of traditional regression trees. The input patches to the model and their corresponding DSMs for a randomly chosen splitting node are shown in Fig. 6. From these maps we can tell which part of the face influence the decision more during the routing of the input.

Dataset Feature extractor Accuracy
MNIST Shallow CNN 99.3%
CIFAR-10 VGG16 [8] 92.4%
CIFAR-10 ResNet50 [1] 93.4%
Table 1: Accuracies for the classification experiments with different feature extractors.
Figure 6: Input patches to NDF for regression and their corresponding DSMs.

5 Conclusion

We visualize saliency maps during the decision-making process of NDF for both classification and regression problems to understand with part of the input has larger impact on the model decision. We also apply NDF on a facial landmark regression problem and obtain the distribution of routing probabilities for the first time. The distribution is consistent with the previous classification work and indicates a decisive behavior.

Acknowledgement. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.