are interpretable since they conduct inference by making decisions. An input is routed by a series of splitting nodes and the conclusion is drawn at one leaf node. Training these models follow a local greedy heuristic[5, 3], where a purity metric such as entropy is adopted to select the best splitting function from a candidate set at each splitting node. Hand-crafted features were usually used and the model’s representation learning ability is limited.
formulated a probabilistic routing framework for decision trees. As a result the loss function is differentiable with respect to the parameters used in the splitting functions, enabling gradient-based optimization in a global way. Despite the success of NDF, there is few effort devoted to visualize the decision making process of it. In addition, the deep representation learning ability brought by the soft-routing framework comes with the price of visiting every leaf node in the tree. The model will be more similar to traditional decision tree and more interpretable if few leaf nodes contribute to the final prediction. Fortunately, the desired property was demonstrated by the distribution of routing probabilities in for a image classification problem. To our best knowledge, this property has not yet been validated for any regression problem.
In this paper, we trace the routing of input images and apply gradient-based technique to visualize the important portions of the input that affect NDF’s decision-making process. We also apply NDF to a new multi-task regression problem and visualize the distribution of routing probabilities to fill the knowledge blank. In summary, our contributions are:
We trace the decision-making process of NDF and compute saliency maps to visualize which portion of the input influences it more.
We utilize NDF on a new regression problem and visualize the distribution of routing probabilities to validate its interpretability.
2 Related works
Traditional classification and regression trees make predictions by decision making, where hand-crafted features [5, 3] were computed to split the feature space and route the input. Deep neural decision forest (NDF) and its regression variant 
were proposed to equip traditional decision trees with deep feature learning ability. Gradient-based method
was adopted to understand the prediction made by traditional deep convolutional neural network (CNN). However, this visualization technique has not yet been applied to NDF. Another orthogonal line of research attempts to learn more interpretable representation and organize the inference process into a decision tree . Our work is different from them since it is more of a visualization-based model diagnosis and no other loss function is used in the training phase to drive semantically meaningful feature learning as in .
A deep neural decision forest (NDF) is an ensemble of deep neural decision trees. Each tree consists of splitting nodes and leaf nodes. In general each tree can have unconstrained topology but here we specify every tree as full binary tree for simplicity. We index the nodes sequentially with integer as shown in Figure 1.
A splitting node is associated with a recommendation (splitting) function that extracts deep features from the input and gives the recommendation score (routing probability) that the input is recommended (routed) to its left sub-tree.
We denote the unique path from the root node to a leaf node a computation path . Each leaf node stores one function
that maps the input into a prediction vector. To get the final prediction , each leaf node contributes its prediction vector weighted by the probability of taking its computation path as
and is the set of all leaf nodes. The weight can be obtained by multiplying all the recommendation scores given by the splitting nodes along the path. Assume the path consists of a sequence of splitting nodes and one leaf node as , where the superscript for a splitting node denotes to which child node to route the input. Here means the input is routed to the left child and otherwise. Then the weight can be expressed as
Note that the weights of all leaf nodes sum to 1 and the final prediction is hence a convex combination of all the prediction vectors of the leaf nodes. In addition, we assume the recommendation and mapping functions mentioned above are differentiable and parametrized by at node . Then the final prediction is a differentiable function with respect to all the parameters which we omit above to ensure clarity. A loss function defined upon the final prediction can hence be minimized with back-propagation algorithm.
Note here all computation paths will contribute to the final prediction of this model, unlike traditional decision tree where only one path is taken for each input. We believe the model is more interpretable and similar to tradition decision trees when only a few computation paths contribute to the final prediction. This has been shown to be the case for classification problem in . Here we also demonstrate the distribution of routing probabilities for a regression problem.
To understand how the input can influence the decision-making of this model, we take the gradient of the routing probability with respect to the input and name it decision saliency map (DSM),
For classification problem, the prediction vector for each leaf node
is a discrete probability distribution vector whose length equals the number of classes. Theth entry gives the probability that the input belongs to class . For regression problems, is also a real-valued vector but the entries do not necessarily sum to 1. The optimization target for classification problems is to minimize the negative log-likelihood loss over the whole training set containing instances , . For a multi-task regression problem with instances , we directly use the squared loss function, .
In the experiment, we use deep CNN to extract features from the input and use sigmoid function to compute the recommendation scores from the features. The network parameters and leaf node prediction vectors are optimized alternately by back propagation and update rule, respectively. Details about the network architectures, training algorithm and hyper-parameter settings can be found in our supplementary materials (included in the GitHub repository).
4.1 Classification for MNIST and CIFAR-10
Standard datasets provided by PyTorch111https://pytorch.org/docs/0.4.0/_modules/torchvision/datasets are used. We use one full binary tree of depth 9 for both datasets, but the complexity of the feature extractor for CIFAR-10 is higher. Adam optimizer is used with learning rate specified as 0.001. Test accuracies for different datasets and feature extractors are shown in Table 1. We record the computation path for each test image that has the largest probability been taken, and compute DSMs for some random samples as shown in Fig. 2 and Fig. 3. The tree is very decisive as indicated by the probability of arriving at each splitting node. In addition, the foreground usually affect the decision more as expected and also similar to 
. Interestingly, the highlight (yellow dots) for different DSMs along the computation path vary a lot for some examples. This means the network is trying to look at different regions of the input while deciding how to route the input. Another interesting observation is that the model mis-classify dog as bird when it is not certain about its decision.
4.2 Cascaded regression on 3DFAW
Here we study the decision-making process for a more complex multi-coordinate regression problem on 3DFAW dataset . To our best knowledge, this is the first time NDF is boosted and applied on a multi-task regression problem. For an input image , the goal is to predict the position of 66 facial landmarks as a vector . We start with an initialized shape
and use a cascade of NDF to update the estimated facial shape stage by stage. The final predictionwhere is the total stage number and is the shape update (model prediction) at stage . We concatenate 66 local patches cropped around current estimated facial landmarks as input and every leaf node stores a vector as the shape update. We use a cascade length of 10, and in each stage an ensemble of 3 trees is used where each has a depth of 5. The model prediction is shown in Fig. 4.
The distribution of recommendation scores for this regression problem is shown in Fig. 5, which is consistent with the results for classification in . This means NDF is also decisive for a regression problem and the model can approximate the decision-making process of traditional regression trees. The input patches to the model and their corresponding DSMs for a randomly chosen splitting node are shown in Fig. 6. From these maps we can tell which part of the face influence the decision more during the routing of the input.
We visualize saliency maps during the decision-making process of NDF for both classification and regression problems to understand with part of the input has larger impact on the model decision. We also apply NDF on a facial landmark regression problem and obtain the distribution of routing probabilities for the first time. The distribution is consistent with the previous classification work and indicates a decisive behavior.
Acknowledgement. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
-  L. A. Jeni, S. Tulyakov, L. Yin, N. Sebe, and J. F. Cohn. The first 3d face alignment in the wild (3dfaw) challenge. In European Conference on Computer Vision, pages 511–520. Springer, 2016.
-  V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1867–1874, June 2014.
-  P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulò. Deep neural decision forests. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1467–1475, Dec 2015.
-  S. Liao, A. K. Jain, and S. Z. Li. A fast and accurate unconstrained face detector. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2):211–223, Feb 2016.
-  W. Shen, Y. Guo, Y. Wang, K. Zhao, B. Wang, and A. L. Yuille. Deep regression forests for age estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
-  Q. Zhang, Y. N. Wu, and S. Zhu. Interpretable convolutional neural networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8827–8836, June 2018.
-  Q. Zhang, Y. Yang, Y. N. Wu, and S. Zhu. Interpreting cnns via decision trees. CoRR, abs/1802.00121, 2018.