Improving the Interpretability of Deep Neural Networks with Knowledge Distillation

12/28/2018 ∙ by Xuan Liu, et al. ∙ Dalhousie University 0

Deep Neural Networks have achieved huge success at a wide spectrum of applications from language modeling, computer vision to speech recognition. However, nowadays, good performance alone is not sufficient to satisfy the needs of practical deployment where interpretability is demanded for cases involving ethics and mission critical applications. The complex models of Deep Neural Networks make it hard to understand and reason the predictions, which hinders its further progress. To tackle this problem, we apply the Knowledge Distillation technique to distill Deep Neural Networks into decision trees in order to attain good performance and interpretability simultaneously. We formulate the problem at hand as a multi-output regression problem and the experiments demonstrate that the student model achieves significantly better accuracy performance (about 1% to 5%) than vanilla decision trees at the same level of tree depth. The experiments are implemented on the TensorFlow platform to make it scalable to big datasets. To the best of our knowledge, we are the first to distill Deep Neural Networks into vanilla decision trees on multi-class datasets.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Despite Deep Neural Networks’ (DNN) superior discrimination power in many fields, the logics of each hidden feature representations before the output layer still remain to be black-box. Understanding why a specific prediction is made is of utmost importance for the end-users to trust and adopt the model, and for the system designers to refine the model by performing feature engineering and parameter tuning. This is especially true for high stakes domains such as clinical decision support, disaster response and recidivism prediction.

For instance, decision trees are preferred over DNN in the health care domain for disease diagnosis due to their ease of interpretation [1] [2] [3]. However, decision trees overfit easily and performs bad on large heterogeneous electronic health records (EHR) datasets [4]

. It is therefore desirable to develop models to find a spot where both interpretability and performance could be ultimately optimized. In recent years, we found resurgent interest in designing interpretable machine learning models. It should be noted that interpretability or transparency of a model is still not clearly defined in the literature

[5] [6].

An intuitive and natural way to interpret neural networks is through visualization. There is a number of works done in this area [7]. In [8]

two tools are introduced: one plots the activations produced on each layer of a trained neural network; the other visualizes the learned features computed by each neurons at each layer of a neural network. A review of visualization methods for interpreting deep convolutional neural nets is provided in

[9]. However, recent research [10] shows that it is space, not the individual units, that contains the semantic information in the higher layers of neural networks , which means that the common approach: activation maximization [7] [11] [12] [13] [14] applied previously for interpretation has flaws. A related suggestion was given in [15] to abandon the idea of inspecting individual hidden units. Thus alternative solutions for interpretation are required.

Network diagnosis [16] is another approach. Earlier research in this area focuses on designing inherently interpretable models such as decision lists [17], decision sets [18], additive models[19], sparse linear models [20], etc. However, this approach presents a severe constraint on the selection of algorithms. Besides, although human can comprehend these models, they fail to model more complex problems with good accuracy performance.

In this paper, we apply the most recent model-agnostic approach [21] which performs post-hoc explanations on the trained models. Past research focus on either global interpretations [22] [23] or local explanations [24][25][26]. We concentrate on the global interpretations. In this paper, we adopt knowledge distillation to improve the global interpretation results.

Knowledge distillation refers to the process of transferring the dark knowledge learned by a teacher model (usually sophisticated and large) to a student model (usually shallow and small). Dark Knowledge [27] [28]

is the salient information hidden in the “soft targets”: predicted probabilities for all classes, which are more informative than the “hard targets”: predicted classes. Maybe the pioneer work to distill the knowledge from a neural network into another algorithm is by Craven and Shavlik

[29] who used a symbolic algorithm: the decision tree [30] to approximate the functions learned by a neural network with one hidden layer using hard targets.

Knowledge distillation originates from model compression [31]. In [31], the teacher model was built using the ensemble selection algorithm [32], which was then used to label unseen unlabeled data: the training data for the student model (also called the transfer data). This approach uses the hard targets produced by the teacher model. A followed work [33]

distills deep nets into shallow feed-forward nets adopting the method of “matching logits” (scores before the softmax activations), which would avoid the information loss when passing through logits to the probability space. Then the concept of “knowledge distillation” was officially introduced in


. It is a more general solution to transfer knowledge from a cumbersome model to a compact model. They try to find an optimal temperature (which they inset into the term of the softmax layer) by raising the temperature of the final softmax layer of the teacher model until a suitable set of soft targets are generated. Then they apply the same temperature to the student model. They also proved that “matching logits” was actually a special case of their distillation approach.

Afterwards, a number of works followed such as [34][35][36], just to name a few. Most of these works concentrate on distilling complex and deep neural nets into simple and shallow neural nets. And are mainly applied for scenarios like edge computing hardware and on-the-fly training where there are memory, resource, power, time and space constraints, without significant loss in performance.

In our work, we employ knowledge distillation for another purpose: interpretation. We resolve the tension between interpretability and accuracy performance by distilling deep neural nets into vanilla decision trees. This is a work in progress and as the first step of our attempts we apply the matching logits approach in [33]

. The main obstacle to execute this plan is that for pure classification tasks there exist no logits in decision trees as in neural nets which could be used in the loss function. We address this issue by reformulating it into a multi-output regression problem

[37] and achieved significant accuracy improvements (about 1% to 5%) on the experiments. Hence, the success of our approach opens a door for turning those inherently interpretable algorithms (which are highly interpretable, but worse in accuracy performance) into models attaining both accuracy and interpretability simultaneously.

Ii Related Work

Perhaps the most related work is the model in [15] which uses a type of soft decision tree to mimic the input-output functions of a trained DNN. This soft decision tree produces hierarchical decisions, which is more easier to interpret than DNN that relies on hierarchical features. It is modeled based on hierarchical mixture of experts and trained with gradient descent. The way they design the soft decision tree is quite similar to [38]. Knowledge distillation was then used to improve the soft decision tree’s accuracy. The difference between their approach and ours is that we use vanilla decision tree as the student model while their student model is the soft decision tree which has similar architecture with neural networks and could be more easily adapted to the original knowledge distillation framework.

In the health care domain, two pipelines [41][4]

are proposed to distill the knowledge from a DNN to Gradient Boosting Trees (GBTs)


. One of them extracts the logits from a learned DNN and uses the logits and the true labels of the original training data to train a logistic regression algorithm to obtain the soft prediction scores. The next step is to train GBTs with the original training data’s features and the soft predictions. The second pipeline directly applies the soft prediction scores of the trained DNN on the original training data as targets for training a mimic model with GBTs. However, GBTs lack transparency as they rely on post-hoc determinations: partial dependence

[39], which would result in bias in this process [49]. The differences between their approach and ours are apparent. The strategy we applied when training the mimic model is matching logits, not the soft targets. Also, our student model is decision tree.

Another approach that distills neural networks into GBTs is in [42]. They tried two student models: tree-based generalized additive models (GA2Ms) [19][43][44]

and GBTs. The teacher model they adopted is multilayer perceptrons. For the student model’s training process, they applied the method of matching logits instead of soft targets in

[41][4]. They investigated both classification and regression problems. However, their model is limited to the binary class problems and their results are not conclusive yet and not published. Compared to their method, our teacher model is DNN and the student model is decision tree. We aims at multi-class classification problems.

Instead of doing post-hoc interpretations, this work [45] focuses on finding more interpretable neural networks during the training process. They created a new model complexity penalty function: tree regularization to favor models whose decision boundaries could be well approximated by small decision trees. They measure human simulatability as the average decision path length and make the decision tree loss differentiable by adopting the technique of derivative-free optimization techniques [46]. Their experiments show that using tree regularization could achieve high accuracy at low complexity. Our method belongs to the post-hoc interpretations, which is different from what they proposed.

A most recent work [47]

combines knowledge distillation and dimension reduction to visualize the results of deep classifiers. They pointed out that the method: t-distributed stochastic neighbor embedding (t-SNE)


commonly used for visualizing the activations of hidden layers was problematic. They propose to visualize the data points that are assigned similar probability vectors to give practitioners a sense of how the decisions are made on test cases. They train a simpler and more interpretable classifier using the soft targets generated by a deep classifier. The student model they applied is Naive Bayes.

Iii Methodology

Rather than common approaches that distill DNN into shallow neural networks, we investigate the distillation into non-neural nets. And the deep models we focus on is Convolutional Neural Networks (CNN). We first introduce some background information about CNN, decision trees and knowledge distillation and then describe our own methodology in details.

Iii-a Convolutional Neural Networks

The architecture of CNN is similar to the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral path-way [50]

. It is designed to process data that has a known, grid-like topology, e.g. image data that has a 2D grid of pixels. Its typical framework is a stack of convolutional-pooling layers followed by fully connected layers. Also, the results of the convolutional layer has to pass through a nonlinear activation function. A commonly used one is the rectified linear unit ReLU

[51]. The convolutional and pooling layers originates from the concepts of simple cells and complex cells in visual neuroscience [52].

  • Convolutional layer. The feature maps of the convolutional layer are generated by performing discrete convolutions between a series of weights and the results of the previous layer. These weights are named as filter banks [53] or kernels [54]. For 2-D grayscale inputs, the value of a specific unit of feature map in the first convolutional layer with kernel size is calculated as


    The first term in (1) is the convolution operations and the second term is the bias for this feature map. If there are multiple channels of the input image, the first term will be summed over all these channels to produce one unit in the corresponding feature map. Within a feature map, all units share the same filter bank and bias. The convolutional operation is accomplished after the filter bank slide across the width and height of the input image.

  • ReLU. Following the convolutional layer is a non-saturating nonlinearity function: ReLU. For a specific unit , its values after passing through this function is


    It was reported [55] that for gradient descent training, using ReLU could speed up the training time several times faster than saturating nonlinearities such as .

  • Pooling layer. It works as a down-sampling tool and merges semantically similar units into one. At this layer, the value of the output unit is a summary statistic of the nearby outputs of the previous layer. Usually, the max pooling function [56] is applied here. It outputs the maximum value within a rectangular area. For instance, if this rectangular area has size and one feature map of the previous layer has size . The resulting feature map will have size . Pooling layer reduces the dimensions of representations, hence, helps speed up the training process. In addition, it helps to make the feature maps invariant to small shifts and distortions of the inputs.

In the training process, CNN performs backward propagation similar to the regular fully connected networks so that all the weights could be updated.

Iii-B CART for Regression

There are several versions of the decision tree algorithm. The earliest version: Iterative Dichotomiser 3 (ID3)[57] was proposed by Quinlan in 1986. It uses information gain as its attribute selection measure and requires features to be categorical. C4.5[58] is a successor of ID3 by Quinlan and the restrictions of ID3 on features are removed. Classification and Regression Trees (CART) [59] was introduced in 1984 by Breiman et al. Although CART and C4.5 were invented by different authors, they follow similar ideas for training decision trees. Owing to the reason that CART supports numerical target values (regression) and the key to our methodology is to solve a multi-output regression problem, we introduce briefly here the algorithm of CART. CART applies a greedy approach which constructs the decision tree in a top-down recursive divide-and-conquer manner. As our experiments applies CART for regression, the descriptions focus on regression tasks.

This algorithm partitions the feature space and groups instances with the same labels together. Initially, it constructs a root node with all training samples S with features as for and labels as and split the node into two child nodes recursively. The splitting criterion is: , where is the attribute to split on and is the threshold at node n. This criterion partitions into


The impurity at node is calculated with an impurity function . For our regression task, we applied the Mean Squared Error method to calculate the impurity. Hence, is calculated as


is the number of instances in the corresponding child node. Hence, based on , the impurity for both nodes can be expressed as


Then the parameters in C could be optimized by minimizing


Thus, the optimal attribute and the splitting threshold are found. Then the algorithm recursively splits and until the maximum depth specified by the user is reached, a node becomes pure, or .

Iii-C Matching Logits

Knowledge distillation transfers the generalization ability of a complex teacher model to a simple student model. Using the teacher model’s soft targets for distillation could produce much better outcome than hard targets. Fig. 1 shows an example of hard and soft targets. Hard targets just contain the information for the predict label while soft targets reveal all the predicted probabilities for all the classes. Many previous works [22] [31] [60] just adopt the hard targets (the predicted labels of the teacher model) for distillation, where soft targets could as a matter of fact boost the results significantly.

When we exam closely into Fig. 1, we notice that the probabilities for “cow” and “car” are much smaller than those of “dog” and “cat”. When training student models applying the cross-entropy cost function, these much smaller probabilities would vanish to zero. Take CNN for example, the last hidden layer before the softmax layer is a fully connected layer with logits as the output


Here is the logit for one of the classes: . is the number of hidden nodes for layer . and are weights and bias respectively. The softmax layer calculates the output probabilities for each class as


The cross-entropy function is then applied to calculate the loss of the model


Hence, to avoid the loss of information, it is desirable to use logits instead of the predicted probabilities . This method is called “matching logits” and the pioneer work was done in [33]. Hinton et al. [27] extended their work to a more general case by inserting a temperature term into (10)


and they demonstrated mathematically that in the high temperature limit and when the logits were zero-meaned separately for each training instance of the student model, matching logits was a special case of using the soft targets for distillation. They proved it by performing gradient descent on the cross-entropy function


Fig. 1: Examples of hard and soft targets.

Here is the logit of the teacher model for training instance , is the number of instances of the training data for the student model. For more elaborated derivations, please refer to [27].

Iii-D Distilling CNN into Decision Trees

In this work, as the first step of our attempts, we employ the matching logits method when distilling CNN into vanilla decision trees. Fig. 2 illustrates the framework of our method. In this figure, the architecture of the CNN is the one used to train the MNIST data as in [61]. It comprises of two convolutional layers and two pooling layers followed by two fully connected layers: fc1 and fc2. After this deep CNN is trained, we feed the feature part of the original training data to the trained model to obtain the corresponding logits . Then we train CART with and which is treated as the targets.

However, here arise some problems for deployment. First, for classification tasks, the targets are limited to categorical, not numerical and continuous values. For this, we can resolve by treating it as a regression problem. Second, even for regression tasks, most algorithms only support single-output regressions. For multi-class datasets, this is actually a multi-output regression problem [37]. And we apply the algorithm adaptation method, where we use decision trees to directly handle multi-output data sets simultaneously. This is anticipated to produce much better results than the problem transformation method which transforms the multi-output regression problem into independent single-output problems and are then solved by single-output regression algorithms. This is due to the fact that problem transformation methods don’t consider the dependencies among the targets.

Fig. 2: Framework of our method.

So the key novelty of this paper is that we treat the problem at hand as a multi-output regression problem first and then try to translate the regression results to achieve the goal of classification. Hence, the regression data for CART should have features as with for and labels as with . And the impurity function in CART is calculated as


Once CART is trained, in order to obtain the final prediction results on test cases we need to add a softmax layer over the test results of CART to turn numerical test results into categorical ones. Assuming the test results on CART is , the final output probabilities for class therefore is


Iv Experiments

We performed the experiments on two datasets to demonstrate the effectiveness of our distillation approach. All teacher models are implemented on the TensorFlow [62] platform to make them scalable to big datasets.

Iv-a Datasets

The two datasets we selected are the MNIST dataset [63] and the Connect-4 dataset from the UCI repository [64]

. MNIST is a famous benchmark dataset for deep learning. It contains the pixel values of handwritten digits from 0 to 9. Each instance stands for a

grayscale image and contains 784 features when flattened into a one dimensional space. The Connect-4 dataset stores the information about the two players’ positions for the the game of connect-4. It has a seven-column, six-row vertically suspended grid. There are two players and each spot on the grid represents whether it has been taken by the first player, or the second player or left blank. The classes are the outcome for the first player. Details of these datasets could be found in Table I.

Dataset Details
#Features #Train #Test Labels
MNIST 784 55,000 10,000 0-9
Connect-4 42 57,557 10,000 win, loss, draw
TABLE I: Datasets

Iv-B Experimental Setup

The deep learning model we applied to train the MNIST dataset is a deep CNN which has an architecture of two convolutional layers followed by two fully connected layers. The parameter settings for this network is depicted in Table II. The first convolutional layer uses filters with window size ,

and the ‘same’ padding in TensorFlow. When the stride length is 1, ‘same’ padding generates a feature map with the same size as the input image. This stage produces 32 feature maps each with size

. The followed max pooling layer over blocks with generates 32 feature maps with size . The parameter settings for the second convolutional layer and pooling layer are the same as the previous one except that this stage generates 64 feature maps. Hence, we have 64 feature maps each with size . Then we flatten these features into a one dimensional list and then apply a fully connected layer: fc1 with 1024 hidden nodes. Immediately after fc1 is the dropout [65] layer, where we set the dropout rate as 0.5. The second fully connected layer: fc2 is the output layer with 10 hidden nodes, each representing one of the 0-9 digits. These outputs are also the logits of this model.

The Connect-4 dataset has a class distribution of win (65.83%), loss (24.62%) and draw (9.55%). We randomly sample 10,000 test instances which satisfy the original class distributions. The algorithm we applied to train the Connect-4 dataset is a multilayer perceptron (MLP) with parameter settings in Table III. It has three hidden layers, the first hidden layer with 256 hidden nodes, the second hidden layer with 128 hidden nodes, the third hidden layer also with 128 hidden nodes and the output layer with 3 nodes representing the three outcomes of the connect-4 game. We also apply the dropout rate after each of the hidden layers and the value is set as 0.8. When calculating the training loss, in addition to TensorFlow’s own cross entropy function, we also added a L2 penalty (regularization term) parameter as in Python’s scikit-learn machine learning tool. This penalty parameter is set as 0.0001 which could help to improve the MLP’s performance.

Network Type: CNN
conv:filter conv:stride pool:block pool:stride fc1
MNIST 1 2 1024
TABLE II: Parameter settings for MNIST
Network Type: MLP
1st hidden 2nd hidden 3rd hidden out dropout
Connect-4 256 128 128 3 0.8
TABLE III: Parameter settings for Connect-4

Iv-C Experimental Results

For decision tree classifications, we apply the modules in the scikit-learn machine learning tool. When we are performing classification tasks applying a decision tree, there are a variety of parameters to tune such as the minimum number of samples per leaf, the strategy used to choose the split at each node (either the best split or the best random split) and so on. We select two parameters that would influence the performance of a decision tree substantially: the maximum depth of the tree and the functions to measure the impurity of a split (either “gini” or “entropy”). The other parameters are left as default values as in scikit-learn.

For the MNIST dataset, the teacher CNN model achieves an accuracy of 99.25%. The performance for the student model and the vanilla decision tree classification results are shown in Table IV. “Acc_student” represents the accuracy of the student decision tree trained using the logits of the teacher CNN model on TensorFlow.

Tree Methods
Depth Acc_student Acc_gini Acc_entropy
6 0.7119 0.6644 0.6849
7 0.7685 0.7534 0.7228
8 0.8125 0.7914 0.8007
9 0.8512 0.8151 0.8304
10 0.8655 0.8445 0.8450
TABLE IV: Test Accuracy Results for MNIST

Fig. 3: Distillation results for MNIST.

“Acc_gini” is the accuracy of the decision tree without distillation when the impurity measure is “gini” in scikit-learn when trained utilizing the same training and test data as the CNN model. “Acc_entropy” is the classification accuracy of the decision tree when the impurity measure is “entropy”. We highlighted the best performance in bold. Under different tree depths, the student model always outperforms the vanilla decision tree. The same conclusion holds true for the Connect-4 dataset in Table V where the accuracy for the MLP teacher model is 86.62%. The reason we limit the tree depth to 10 is that we would like to construct interpretable models and trees over a depth of 10 becomes extremely hard for human cognitions to comprehend. We also illustrate these results in graphs in Fig. 3 and Fig. 4 to present the results more intuitively.

Iv-D Discussion

For the Connect-4 dataset, although we can fine tune the parameters of the teacher model or switch the teacher model to CNN to improve the teacher models’ performance, the distillation effect still relies largely on the student model’s own generalization ability. For instance, the teacher model for the MNIST dataset already has a very high accuracy of 99.25%, but the student model’s highest accuracy in Table II is only 86.55%. However, from our experiments we found that training a good teacher model indeed helped to boost the distillation results. In our experiments, we notice that distillation helps to improve the accuracy by 1% to 5%. Hence, there is still a long way to go for the student model to match the results of the teacher model.

Tree Methods
Depth Acc_student Acc_gini Acc_entropy
6 0.6943 0.6816 0.6835
7 0.6999 0.6919 0.6832
8 0.707 0.675 0.6625
9 0.723 0.6927 0.6974
10 0.7342 0.7044 0.7006
TABLE V: Test Accuracy Results for Connect-4

Fig. 4: Distillation results for Connect-4.

We are also curious about the performances of the student models and the vanilla decision trees when the maximum depth of the tree is not specified. In this situation, for the MNIST dataset, we found that the accuracy for the student model was 88.28% and the decision tree classification achieved 87.4% for the criterion of “gini”. For the Connect-4 dataset, when the teacher model has an accuracy of 83.22% the student model achieves 79.06% and vanilla decision tree has 77.57% when using “gini” as impurity measure. We notice that the accuracy improvements are smaller than the cases where the depth of the trees are specified. This is easy to explain as when the tree levels are not set the vanilla decision tree takes much deeper tree levels than the student model to arrive at the current accuracy results. Hence these decision trees are far less interpretable than the student models because the level of tree depth determines the interpretability for decision trees. After all, in our experiments we already proved that under the same tree level, the vanilla decision tree performs worse than the student models.

V Conclusion and Future Work

Based on the fact that inherently interpretable algorithms perform worse than some non-interpretable algorithms such as the deep learning algorithm, this paper presents an approach to improve the accuracy performance of an inherently interpretable algorithm: decision tree. This is achieved by utilizing the dark knowledge hidden in the soft predictions of DNN. We apply the matching logits method which employs the logits of DNN for training student decision tree models. Experiments on two datasets: MNIST and Connect-4 demonstrate the significant improvements on the accuracy of the distilled student model over vanilla decision trees.

Our work is still in progress and there are several directions for future work. First, as specified in [37], there are various methods to solve the multi-output regression problem. The method we adopted is the algorithm adaptation method. It is worthwhile to explore other methods to fully take advantage of the power of knowledge distillation. Second, our approach in this paper makes it possible to improve the performance of all inherently interpretable models and it is therefore rewarding to design new inherently interpretable models that could finally match the performance of non-interpretable models. Last, it should also have merits to add a temperature term into the softmax layer (as introduced in the methodology part) and use both soft targets and the true labels together (as carried out in [27]) to train the student model.


The authors would like to thank Xiang Jiang, Zhengping Che, Sarah Tan and Nicholas Frosst for helpful discussions.


  • [1] G. Bonner, “Decision making for health care professionals: use of decision trees within the community mental health setting,” Journal of Advanced Nursing, 35(3), pp.349–356,2001.
  • [2] Z. Yao, P. Liu, L. Lei and J. Yin, “R-C4. 5 Decision tree model and its applications to health care dataset,” In Services Systems and Services Management, Proceedings of ICSSSM’05. International Conference on (Vol. 2, pp. 1099-1103). IEEE.2005, June.
  • [3] C.Y. Fan, P.C. Chang, J.J. Lin and J.C. Hsieh, “A hybrid model combining case-based reasoning and fuzzy decision tree for medical data classification,” Applied Soft Computing, 11(1), pp.632–644,2011.
  • [4] Z. Che, S. Purushotham, R. Khemani and Y. Liu, “Interpretable deep models for ICU outcome prediction,” In AMIA Annual Symposium Proceedings, Vol. 2016, p. 371, American Medical Informatics Association, 2016.
  • [5] Z. C. Lipton, “The mythos of model interpretability,” In ICML Workshop on Human Interpretability in Machine Learning (WHI), 2016.
  • [6] F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable machine learning,” Technical Report, arXiv preprint arXiv:1702.08608, 2017.
  • [7] M.D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” In European conference on computer vision, pp. 818–833, Springer, Cham. September 2014.
  • [8] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs and H. Lipson, “Understanding neural networks through deep visualization,” 31st International Conference on Machine Learning, ICML 2015.
  • [9] Q.S. Zhang and S.C. Zhu, “Visual interpretability for deep learning: a survey,” Frontiers of Information Technology & Electronic Engineering, 19(1), pp.27–39, 2018.
  • [10] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199. 2013.
  • [11] I. Goodfellow, H. Lee, Q.V. Le, A. Saxe and A.Y. Ng, “Measuring invariances in deep networks,” In Advances in neural information processing systems, pp. 646–654, 2009.
  • [12]

    R. Girshick, J. Donahue, T. Darrell and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587, 2014.

  • [13] K. Simonyan, A. Vedaldi and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” arXiv preprint arXiv:1312.6034, 2013.
  • [14] G. Montavon, W. Samek and K.R. Müller, “Methods for interpreting and understanding deep neural networks,” Digital Signal Processing, 2017.
  • [15] N. Frosst and G. Hinton, “Distilling a neural network into a soft decision tree,” arXiv preprint arXiv:1711.09784, 2017.
  • [16] Q. Zhang, Y. Yang, Y.N. Wu and S.C. Zhu, “Interpreting CNNs via decision trees,” arXiv preprint arXiv:1802.00121, 2018.
  • [17] B. Letham, C. Rudin, T. H. McCormick, and D. Madigan, “Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model,” The Annals of Applied Statistics, 9(3), 1350–1371, 2015.
  • [18] H. Lakkaraju, S. H. Bach, and J. Leskovec, “interpretable decision sets: A joint framework for description and prediction,” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1675–1684, ACM, August, 2016.
  • [19] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, “Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission,” In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730, ACM, 2015.
  • [20] B. Ustun and C. Rudin, “Supersparse linear integer models for optimized medical scoring systems,” Machine Learning, 102(3), pp.349–391, 2016.
  • [21]

    M.T. Ribeiro, S. Singh and C. Guestrin, “Anchors: High-precision model-agnostic explanations,” In AAAI Conference on Artificial Intelligence, 2018.

  • [22] M. Craven and J.W. Shavlik, “Extracting tree-structured representations of trained networks,” In Advances in neural information processing systems, pp. 24–30, 1996.
  • [23] H. Lakkaraju, E.Kamar, R.Caruana and J.Leskovec, “Interpretable & Explorable Approximations of Black Box Models,” KDD’17 workshop, 2017.
  • [24] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen and K.R. MÞller, “How to explain individual classification decisions,” Journal of Machine Learning Research, 11(Jun), pp.1803–1831, 2010.
  • [25]

    E. Strumbelj, I. Kononenko, “An efficient explanation of individual classifications using game theory,” Journal of Machine Learning Research, 11(Jan), pp.1–18,2010.

  • [26] M. T.Ribeiro, S.Singh and C.Guestrin, “Why should I trust you? : Explaining the predictions of any classifier,” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144, ACM, August, 2016.
  • [27] G. Hinton, O. Vinyals and J. Dean, “Distilling the knowledge in a neural network,” NIPS Deep Learning Workshop, 2015.
  • [28] A.K. Balan, V. Rathod, K.P. Murphy and M. Welling, “Bayesian dark knowledge,” In Advances in Neural Information Processing Systems, pp. 3438–3446, 2015.
  • [29] M. Craven and J. W. Shavlik, “Extracting tree-structured representations of trained networks,” In Advances in neural information processing systems, pp. 24–30, 1996.
  • [30] J. Ross Quinlan, C4. 5: programs for machine learning, Elsevier, 2014.
  • [31] C. Buciluǎ, R. Caruana and A. Niculescu-Mizil, “Model compression,” In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541, ACM, 2006,August.
  • [32] R. Caruana, A. Niculescu-Mizil, G. Crew and A. Ksikes, “Ensemble selection from libraries of models,” In Proceedings of the twenty-first international conference on Machine learning, p.18, ACM. July, 2004.
  • [33] J. Ba and R. Caruana, “Do deep nets really need to be deep?” In Advances in neural information processing systems, pp. 2654–2662, 2014.
  • [34] Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C. and Bengio, Y., 2014. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550.
  • [35] G. Urban, K.J. Geras, S.E. Kahou, O. Aslan, S. Wang, R. Caruana, A. Mohamed, M. Philipose and M. Richardson, “Do deep convolutional nets really need to be deep and convolutional?” In ICLR, 2017.
  • [36] G. Zhou, Y. Fan, R. Cui, W. Bian, X. Zhu and K. Gai, “Rocket Launching: A Universal and Efficient Framework for Training Well-performing Light Net,” stat, 1050, p.16, AAAI 2018.
  • [37] H. Borchani, G. arando, C. Bielza and P. Larrañaga, “A survey on multi‐output regression,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(5), pp.216–233, 2015.
  • [38] O. Irsoy, O.T. Yıldız and E. Alpaydın, “Soft decision trees,” In Pattern Recognition (ICPR), 21st International Conference on, pp. 1819–1822, IEEE, November, 2012.
  • [39] J.H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp.1189–1232, 2001.
  • [40] J.H. Friedman, “Stochastic gradient boosting,” Computational Statistics & Data Analysis, 38(4), pp.367–378, 2002.
  • [41] Z. Che, S. Purushotham and Y. Liu, “Distilling Knowledge from Deep Networks with Applications to Healthcare Domain,” NIPS Workshop on Machine Learning for Healthcare (NIPS-MLHC), 2015.
  • [42] S. Tan, R. Caruana, G. Hooker and A. Gordo, “Transparent Model Distillation,” arXiv preprint arXiv:1801.08640, unpublished, 2018.
  • [43] Y. Lou, R. Caruana and J. Gehrke, “Intelligible models for classification and regression,” In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 150–158. ACM. August 2012.
  • [44] Y. Lou, R. Caruana, J. Gehrke and G. Hooker, “Accurate intelligible models with pairwise interactions,” In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 623–631, ACM. August 2013.
  • [45] M. Wu, M.C. Hughes, S. Parbhoo, M. Zazzi, V. Roth and F. Doshi-Velez, “Beyond sparsity: Tree regularization of deep models for interpretability,” in press, AAAI 2018.
  • [46] C. Audet and M. Kokkolaras, “Blackbox and derivative-free optimization: theory, algorithms and applications,” Optimization and Engineering, Vol. 17, Issue 1, pp.1-2, March 2016.
  • [47] K. Xu, D.H. Park, C. Yi and C. Sutton, “Interpreting Deep Classifier by Visual Distillation of Dark Knowledge,” in press, ICML 2018.
  • [48] L.V.D. Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of machine learning research, 9(Nov), pp.2579–2605, 2008.
  • [49] S. Tan, R. Caruana, G. Hooker and Y. Lou, “Auditing Black-Box Models Using Transparent Model Distillation With Side Information,” AAAI/ACM AIES 2018.
  • [50] D.J. Felleman and D.E. Van, “Distributed hierarchical processing in the primate cerebral cortex,” Cerebral cortex (New York, NY: 1991), 1(1), pp.1-47., 1991.
  • [51]

    V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010.

  • [52] D.H. Hubel and T.N. Wiesel, “Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex,” The Journal of physiology, 160(1), pp.106–154, 1962.
  • [53] Y. LeCun, Y. Bengio and G. Hinton, “Deep learning,” nature, 521(7553), p.436., 2015.
  • [54] I. Goodfellow, Y. Bengio and A. Courville, “Deep learning,” (Vol. 1). Cambridge: MIT press.2016.
  • [55]

    A. Krizhevsky, I. Sutskever and G.E. Hinton, “Imagenet classification with deep convolutional neural networks,” In Advances in neural information processing systems, pp. 1097-1105, 2012.

  • [56] Y.T. Zhou and R. Chellappa, “Computation of optical flow using a neural network,” In IEEE International Conference on Neural Networks, Vol. 1998, pp. 71-78, July, 1988.
  • [57] J.R. Quinlan, “Induction of decision trees,” Machine learning, 1(1), pp.81–106., 1986.
  • [58] J.R. Quinlan, “C4. 5: programs for machine learning,” Elsevier., 2014.
  • [59] L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, Classification and Regression Trees, 1st ed. New York: Routledge.1984.
  • [60] M.T. Ribeiro, S. Singh and C. Guestrin, “Why should i trust you?: Explaining the predictions of any classifier,” In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. ACM., August 2016.
  • [61] X. Liu, X. Wang and S. Matwin, “Interpretable Deep Convolutional Neural Networks via Meta-learning,” 2018 International Joint Conference on Neural Networks, IJCNN, in press, 2018.
  • [62] M.Abadi,et al. “TensorFlow: A System for Large-Scale Machine Learning,” In OSDI, Vol. 16, pp. 265–283, 2016.
  • [63] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, 86(11), pp. 2278–2324, November, 1998.
  • [64] D. Dua and E. K. Taniskidou, UCI Machine Learning Repository []. Irvine, CA: University of California, School of Information and Computer Science, 2017.
  • [65] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, 15(1), pp. 1929–1958, 2014.