1. Introduction
^{†}^{†} Authors’ emails: {merida, kalogeratos, mougeot}@cmla.enscachan.fr.This work was funded by the IdAML Chair hosted at ENS ParisSaclay.
For both cultures in modeling inference and prediction (Breiman2001, ), model selection and model procedure selection are essential steps to obtain a satisfactory solution to a problem at hand, and hence a wide variety of methods and metrics (Ding2018, ) do exist to help the user decide which model fits his data and is better to use.
In machine learning, where it is more usual to focus on the prediction ability of models, crossvalidation (CV) is a widespread procedure both for model and algorithm selection
(Ding2018, ). Indeed, CV has the advantages of being easy to implement and simplifying the comparison of models from different families on the basis of a metric and its variability. It is always necessary, though, to choose which candidate models are to be compared. In this paper, we focus on how to select the model family from which to get candidate models, and specifically to challenge a rigid trained model through relaxation.Currently, there exist packages such as AutoWEKA (Kotthoff2019, ) and Autosklearn (Feurer2015, )
, that provide tools to automate the algorithm and model selection process. The methods included in these packages apply Bayesian optimization and metalearning procedures, while considering the modeling algorithm to be a hyperparameter. In particular, they select the most appropriate algorithm and the values for its parameters according to a metric, and within a budget set by the user. Our approach differs in that its output is not a specific trained model, or even a specific model family , but a set of indicators that can help the user gain insight regarding the kind of model he needs, as portrayed by Fig.
1. The user is thus directly involved in the process of model selection, and this is how our method sets itself apart from existing automated machine learning approaches.In a similar direction, there are methods (e.g. (Ali2006, ; Smith2002, ) that have been developed to obtain insight into what kind of data mining algorithm is suitable to a dataset according to certain statistical and information theoretical measures. Both works employ methods that can be interpreted by users: (Ali2006, ) generates rules using the C5.0 algorithm and (Smith2002, )
clusters datasets according to their characteristics using selforganizing maps and associates each cluster with the algorithms that perform best on average for the datasets lying therein. These methods require a large set of datasets and their evaluation results for different algorithms. The method presented in this paper, however, can be applied directly to a dataset.
Our aim is not to pinpoint to a specific algorithm as the best for a given task, but rather to guide the user when selecting a model family to further explore the most promising one with model selection techniques. More specifically, we propose a methodology that is based on measuring the departure (i.e. disagreement at the decisions level) between a rigid reference model and a more flexible, which is initialized by the first one, and whose decision boundaries we relax in a controlled way. Here, decision trees (DTs), which are highly interpretable and partition the feature space in hyperrectangles, are confronted to neural networks (NNs) which are capable of approximating nonlinear interactions in the feature space. The procedure explores the space between DTs and NNs by optimizing the parametrization of a neural decision tree (NDT), following the procedure outlined in Fig. 1.
2. Background
2.1. Neural decision trees (NDTs)
NDTs are NNs whose architecture and weights initialization are obtained directly from an input DT, which we call ‘seed DT’. The variant of NDTs we use here is presented in (Biau2018, ). In this case, there is no need to search for the right number of layers or the number of neural units on each layer.
NDTs are always formed by four layers: an input layer, two hidden layers, and an output layer. The connections between the layers encode the information extracted from the seed DT. For a dataset with features and a seed DT with leaves, we have the architecture and initialization that we discuss next.
First hidden layer. It encodes the information from the inner nodes (or split nodes) of the input DT. It is formed by
units, each one representing a split node. The conditional splits in a tree are formed by a feature and a threshold on it. The NDT encodes this information in the weight matrices of the connections between this layer and the input layer: each column in the weight matrix is a onehot encoding of the feature used in each split and the values of the biases are the opposite values of the thresholds.
Second hidden layer. It contains the paths in the DT from the root to the leaves. It consists in neurons, one for each leaf of the DT. Then, the connections between the units of this layer and those of the previous layer encode the positions of the leaves with respect to each split node. The elements of the weight matrix take three possible values:

[leftmargin=0.6cm]

if the leaf is in the path of the inner node and is ”on the right side” of the split node;

if the leaf is in the path of the inner node and is ”on the left side” of the split node;

otherwise.
As for the biases, they take the value , where is the length of the path from the root of the tree to the leaf .
Output layer.
For a classification task, the NDT outputs the observed probability of an instance falling in the leaf
. The values of the weight matrix are , where is the number of training instances assigned to leaf , and is the number of training instances. The bias has the value .Activation functions.
In a DT, the splits, as well as the leaf memberships, are crisp. In order for an NDT to behave like a DT, its activation functions have to be crisp as well. However, a crisp function is not differentiable so it would not be possible to train the NDT using backpropagation. To mitigate this problem, in
(Biau2018, ) it is proposed to approximate the crisp threshold function of the trees with the function :(1) 
The parameter allows to control the smoothness of the function: the higher the value of , the steeper the curve of gets. Moreover, is used as an activation function for both of the hidden layers in the NDT, each with a different value of the parameter, which we denote by for the first hidden layer and by for the second one.
2.2. Metrics
Having obtained a trained NDT that was initialized by a specific input DT, we can compare their performance on the dataset of interest and measure the departure of the second with respect to the first one. For this purpose, we specifically measure the agreement between the an NDT and its seed DT. We employ different metrics for classification and regression tasks.
For classification, the agreement metric used is Cohen’s statistic (Cohen1960, )
, which calculates the extent to which the labels found by two classifiers agree, taking into account the probability for them to agree by chance:
(2) 
where is the probabilities that the two classifiers do agree, and the probability that they agree by chance. When the models agree completely ; when they agree by chance ; if their agreement is less than what is expected by chance, then .
By abusing the notation, we write as the agreement of an NDT with the seed DT that initialized it.
Finally, the metric, denoted by , that is used to measure and compare the performance of different models, is the accuracy of the model. The average value of the metrics over a number of experiments is denoted as and .
3. Proposed method
3.1. Outline
The idea is then to start from a DT whose training hyperparameters have been tuned (e.g. using CV). The structure of the trained DT is transferred to an NDT with a parameter set , as shown in Fig. 1
, including the batch size, the number of epochs, optimizer, the
and values. During the procedure, the latter two are the only hyperparameters of the NDT that are not fixed. Therefore, finding in our case means finding the values of and for which the best average performance is observed.and control the smoothness of the activation functions in the NDT: for higher values (e.g. 100) the NDT behaves very similarly to the seed DT, while for lower values it gets closer to an NN with hyperbolic tangent activation functions. Varying progressively and from higher to lower values causes a progressive departure of the NDT model: from the initial DT to an altered model that is relaxed and closer to an NN model. It is then possible to detect the point where better performance is obtained with respect to , and to measure by using how far from a DT is the best NDT model obtained with , which is also written as .
The core of the proposed method is then the search for and its subsequent interpretation, as well as that of and .
3.2. Relationship between and
In order to simplify the procedure and to provide easy to interpret results, we link the value of to that of , such that .
In (Biau2018, ) it is suggested to use , since a smoother activation function in the second hidden layer allows for stronger weight corrections in the first hidden layer during backpropagation.
On the other hand, a very low value would make the function approach the flat zerofunction. Since in our experiments , we set the reasonable lower bound for at 0.05.
We tested different functions on the datasets presented in Tab. 1 and for different values of using CV. The functions were:

[topsep=0pt]




.
Overall, was the function for which the NDTs achieved the best performance most of the times, across datasets and values.
In what follows, we use to actually refer to and then internally is computed by:
(3) 
3.3. The algorithm of the procedure
Eq. 3 reduces the process of finding the
to that of just estimating the
. As we are interested in a progressive departure from any given DT, we can in fact produce multiple seed DTs, and for each of those to build several NDTs by decreasing gradually the value of . The tested values are in the ordered set .4. Experiments
4.1. Datasets
The proposed method was tested on datasets, synthetic and containing real data taken from the UCI repository. Some of these datasets were chosen because either DTs or NNs are better suited to model them.
First, we mention the synthetic dataset we use:
the sim_1000_3
contains two classes, which are not linearly separable, and the data instances for each of them are generated by a normal distribution of different mean. For this dataset, we expect that an NDT with a low
will perform the best.Following the same selection principle for real data, the Gastrointestinal Lesions in Regular Colonoscopy dataset (lesions) was chosen as it is in high dimensions where DTs can be better in selecting only the most informative features (Brown1993, ). The mushroom dataset was selected because it can be accurately modeled using simple rules. The rest of the real datasets offer points of comparison for when the more suitable model family is not as clear as above. They are the Spambase (spam) and the Student Performance (studentmath) datasets. The characteristics of the datasets are given in Tab. 1.
4.2. Experimental pipeline
For each dataset, we first used CV to determine the depth of the DTs to be used (reported in Tab. 1
), which was in fact the hyperparameter with the biggest impact on DT performance. Aside from removing the categorical variables (except for mushroom, where all its variables are categorical and were encoded) and the instances with missing data, no other preprocessing preceded.
For the NDTs, we set the number of epochs to for all datasets and the batch size was according to the size of each of them. We decided to use the Adam optimizer with default parameters and for regularization we use early stopping with a patience of epochs.
To generate the training, validation, and test sets we use repeated random subsampling with proportions 50%/25%/25% respectively, which we repeat times (so ). We use stratification to ensure that classes are in the same proportions in the sets.
4.3. Results
The method outputs the , , and . These values are then to be interpreted to decide which model family is worth to further explore subsequently. We present here detailed examples of this interpretation for the sim_1000_3 and lesions datasets.
Fig. 2(a) shows the average performance of the NDT with respect to the value, and compares it to that of the initial DT. It tells us that, as expected, for higher values of the NDT has an average performance that is close to that of its seed DT, and as decreases, the accuracy of the NDT varies. In this case, the accuracy increases monotonically and reaches its maximum for . For this value, is closer to a function and so it is a point where the NDT is close to a classic NN. However, the agreement between and the DT is high (0.819, see Tab. 1). This means that a DT and an NN model may not differ much in their decisions, nonetheless if one is interested in obtaining better performance at the cost of sacrificing the explainability, then further exploring of the the NN family is the most promising direction to go.
Fig. 2(b) shows both the performance and the agreement decreasing as the value of decreases. Here the best average accuracy for the NDT is reached when . In this case the function is very close to a threshold function, and so the NDT behaves similarly to a DT. In contrast, there is no improvement over the seed DT, even for , and even for high values of the is fairly low. This is due to the setting of according to Eq. 3. Indeed, an NDT behaves most like its seed DT when both and are high. These factors taken into account, our interpretation of these results would be that exploring the DT model family further might be more propitious.
Similar reasoning can be developed for the rest of the datasets using the results reported in Tab. 1. Similar graphs can be drawn for all the datasets, however, their interpretation might be less straightforward than for the example of Fig. 2(a). It is worth noting though, that the preprocessing applied to the datasets can greatly affect the results.
5. Conclusion
The method introduced in this work allows us to determine if more flexible models can outperform a trained DT on a given learning task. This is done in a controlled way by progressively relaxing DTs’ decision boundaries. Furthermore, the agreement metric provides insight about how far it was necessary to depart from the reference model in order to achieve an improvement in performance (if there is one). This can be useful when deciding between a rigid but more explainable model and a more flexible but less interpretable one. Our method’s starting point is a trained DT. Nonetheless, the optimization of and remains an open problem. Extending this work could be attempted in the direction of enriching the experimental results and comparisons to other methods.
References
 (1) Ali, S., and Smith, K. A. On learning algorithm selection for classification. Applied Soft Computing 6, 2 (jan 2006), 119–138.

(2)
Biau, G., Scornet, E., and Welbl, J.
Neural random forests.
Sankhya A (6 2018).  (3) Breiman, L. Statistical modeling: The two cultures. Statistical Science 16, 3 (Aug. 2001), 199–215. Publisher: Institute of Mathematical Statistics.
 (4) Brown, D. E., Corruble, V., and Pittard, C. L. A comparison of decision tree classifiers with backpropagation neural networks for multimodal classification problems. Pattern Recognition 26, 6 (jun 1993), 953–961.
 (5) Cohen, J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1 (apr 1960), 37–46.
 (6) Ding, J., Tarokh, V., and Yang, Y. Model selection techniques: An overview. IEEE Signal Processing Magazine 35, 6 (nov 2018), 16–34.
 (7) Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., and Hutter, F. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2962–2970.
 (8) Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F., and LeytonBrown, K. AutoWEKA: Automatic model selection and hyperparameter optimization in WEKA. In Automated Machine Learning. Springer International Publishing, 2019, pp. 81–95.
 (9) Smith, K. A., Woo, F., Ciesielski, V., and Ibrahim, R. Matching data mining algorithm suitability to data characteristics using a selforganizing map. In Hybrid Information Systems. PhysicaVerlag HD, 2002, pp. 169–179.
Comments
There are no comments yet.