In machine learning, the problem of classification consists of inferring a model from observations (also called training examples
) that make it possible to identify which class of a set of classes a new observation belongs to. When the training examples used to infer a model are assigned to the classes to which they belong, it is called supervised learning. Many machine learning models exist to solve this problem, such as support vector machines
, artificial neural network, decision trees , etc. Because inferring such models is complicated and the number of training examples is generally very large, most existing inference algorithms are heuristic in the sense that the algorithms infer models without any guarantee of optimality.
Although these heuristic-based techniques generally work well, there are always cases where a new example is not correctly recognized by the model. In the context of critical systems in which errors are not allowed, such models are difficult to use. This requirement often goes together with the demand that models must also be understandable. Known as eXplainable AI (XAI), this area consists of inferring models capable of explaining their own behaviour. XAI has been the subject of several studies in recent years [10, 15, 20, 3], as well as several events [2, 1]. One approach to obtain an explainable model is to use decision trees because the reasons for classification are clearly defined .
In order to obtain accurate and explainable models, we are interested in the inference of optimal decision trees. The optimality of a decision tree is generally defined by the simplicity of the tree based on the principle of parsimony. Our chosen simplicity criteria are the depth of the tree and the number of nodes. In particular, for a fixed maximum depth of the tree, we want to infer a decision tree with a minimum number of nodes that is consistent with the training examples.
Although decision tree inference is a well-studied classic problem, the majority of known algorithms are heuristic and try to minimize the number of nodes without guaranteeing any optimality [17, 19, 13, 8]. It is because the problem is known to be NP-complete for several definitions of optimality [14, 11]. In addition, the first algorithms to infer optimal decision trees were ineffective in practice . However, in recent years, several studies have focused on improving the performance of optimal decision tree inference algorithms.
The first series of studies was carried out on a similar problem to ours: inferring optimal decision tree with a given depth such that the total classification error on the training examples is minimized [6, 5, 22]. For this problem, Verwer and Zhang 
propose a binary linear programming formulation that infers optimal decision trees of depths four in less than ten minutes.
The studies closest to ours are those of Bessier et al.  and Narodytska et al. . These authors were interested in a particular case of our problem: inferring decision trees with a minimal number of nodes without trying to minimize the depth. Bessier et al. propose a SAT formulation, but experiments show that the method only works for small models, i.e., trees of about fifteen nodes. The authors also propose a method based on constraint programming to minimize the number of nodes, but without necessarily reaching the optimal. Narodytska et al. propose a new SAT formulation that greatly improves the practical performance of optimal decision tree inference. Thus, with their new formulation, Narodytska et al. were able to build, for the “Mouse” dataset, a decision tree with a minimum number of nodes in 13 seconds, while this required 577 seconds with the SAT formulation of Bessier et al. The authors claim that to the best of their knowledge, their paper is the first presentation of an optimal decision tree inference method based on well-known datasets.
In this paper, we propose an even more efficient method than the last one. Our benchmarks show that we can process the “Mouse” dataset in only 75 milliseconds. Moreover, well-known datasets that were considered too large to infer optimal decision trees from them can now be processed by our algorithm.
The paper is organized as follows. In the next section, we provide definitions related to decision trees needed to formalize the approach. Section 3 provides a new Boolean formulation for passive inference of a decision tree from a set of training examples. We propose in Section 4 an incremental way of generating the Boolean formulas which ensures that the proposed approach scales to large datasets. Section 5 reports several experiments comparing our approach to others. Finally, we conclude in Section 6.
Let be a set of training examples, that is, Boolean valuations of a set of features, and let be a partition of into classes. Note that even if we only consider binary features, we can easily handle non-binary features by encoding them in a binary way . Features that belong to categories can be represented by Boolean features where each one represents the affiliation to one category. If the categories are ordered, then each Boolean feature can represent the affiliation to a smaller or equal category (see Example 1). The second encoding provides constraints of type on numerical features for example. In the following, we denote by the Boolean valuation of the feature in example . A decision tree is a binary tree where each node is labelled by a single feature and each leaf is labelled by a single class. A decision tree is said to be perfect if all leaves have the same depth and all internal nodes have two children. Formally, we denote a perfect decision tree by where is the set of internal nodes and is the set of leaves and where is the depth of the tree. We denote by the node in the tree and by the root of the tree. Then we define as the left child of and as the right child. In a similar way, if , we define the leaf as the left child of and the leaf as the right child. An illustration of this encoding is depicted by Figure 1.
This way of associating a number to every node and leaf may appear complicated, but it will be useful for our Boolean encoding. We will use the semantics associated with binary coding of node indexes to obtain compact SAT formulas.
If is a decision tree, and is a set of training examples, we say that is consistent with , denoted , if each example
is correctly classified by.
Let be a set of training examples where each example has a single integer feature . Let and be the partition of into two classes. Then, we can transform and into and such that each example has four Boolean features . If the feature is true, it means that the example is smaller or equal to for feature ; if the feature feature is true, it that the example is smaller or equal to for feature , etc. Thus, with this transformation we obtain and .
3 Passive Inference
There are two types of methods for solving the problem of inferring a decision tree from examples.
One group constitutes heuristic methods which try to find a relevant feature for each internal node in polynomial time [8, 19, 17, 13]. They are often used in practice because of their efficiency; however, they provide no guarantee of optimality. Thus, better choices in the feature order can lead to smaller decision trees that are consistent with training examples.
Another group includes exact algorithms to determine a decision tree with a minimal number of nodes. It is a much more complicated problem, as it is NP-complete . There are essentially two works that focus on this problem [16, 7].
We propose a SAT formulation that differs from them. Our approach has two main steps. In the first step, we seek a perfect decision tree of a minimal depth. In a second step, we add constraints to reduce the number of nodes in order to potentially obtain an imperfect decision tree.
3.1 Inferring perfect decision trees of fixed maximum depth
Our SAT encoding to infer decision trees of a fixed maximum depth is based on the way the nodes are indexed. As mentioned in Section 2, the index of a node depends on its position in the tree. In particular, the root node corresponds to the node , and for each node , the left child corresponds to and the right child to . This coding has a useful capability of providing precise information on the position of a node based on the binary coding of its index. Indeed, reading the binary coding of a node from the highest to the lowest weight bit indicates which branches to take when moving from the root to the node. For example, if the binary coding of is 1011, then the node is reached by taking the right branch of the root, then the left branch, and finally twice the right branch. Note that if a node is a descendant of the right branch of a node then we say that is a right ancestor of .
The idea of our coding consists in arranging the training examples in the leaves of the tree while respecting the fact that all training examples placed in the same leaf must belong to the same class. Moreover, if an example is placed in a leaf, then all the right ancestors of that leaf can only be labelled by features true for that example (and conversely for the left ancestor).
The encoding idea is formalized using the following types of Boolean variables.
: If the variable is true, it means that the example is assigned to a leaf that is a right ancestor of a node located at depth . If is false, then is assigned to a leaf that is a left ancestor of that node. Note that with this semantics on the variables , we have the property that the binary coding , denoted , corresponds to the index of the leaf where the example belongs, i.e., if , then the example is assigned to the leaf . We also denote by the number formed by the binary coding of .
: If is true, it means that the node is labelled by the feature .
: If is true, it means that the leaf is labelled by the class .
We then use the following set of clauses to formulate constraints that a perfect decision tree of depth should satisfy.
For each , we have the clauses:
These clauses mean that each node should have at least one feature.
For each and every features such that , we have the clauses:
These clauses mean that each node has at most one feature.
For every and such that , and each , we have:
And for every and such that , and each , we have:
These formulas add constraints that some features cannot be found in certain nodes depending on where the training examples are placed in the decision tree.
We use the binary coding of the index of a leaf to determine which nodes in the tree are its parents.
Thus, all the parent nodes for which the left branch has been taken cannot be labelled by features that must be true, and vice versa.
Note that it is not trivial to translate these formulas into clauses, but we show in Algorithm 1 how it could be done.
This algorithm performs a depth-first search of the perfect decision tree in a recursive way.
The variable corresponds to the index of the current node and constraints such that correspond to the index of a successor.
Each time the algorithm visits a state , it adds constraints on the features that can be labeled by this node based on where is placed in the left or right branch of .
If is placed in the left branch, then cannot contain a feature that is true for and vice versa with the right branch.
For each with and each integer , we have:
And for each , each integer , and we have:
These formulas assign the classes to the leaves according to the places of the training examples in the decision tree. Again, since it is not trivial to translate these formulas into clauses, we show in Algorithm 2 how it could be performed. This algorithm performs a depth-first search of the perfect decision tree such that when it reaches a leaf, corresponds to the index of that leaf. After that, for each leaf, the algorithm generates the constraints that if is present in that leaf, then that leaf must have the same class as .
3.2 Minimizing the number of nodes
Perfect decision trees are often considered as unnecessarily too large, i.e. they can contain too many nodes. For example, there may be an imperfect decision tree consistent with training examples, with the same depth as the perfect tree, but with fewer nodes.
In order to find a tree with a minimum number of nodes, we show in this section how we can add constraints to set a maximum number of nodes of the tree. The idea is to limit the number of leaves that can be assigned to a class in the perfect tree. Indeed, if a leaf is not assigned to a class, then the parent of this leaf can be replaced by its other child. By applying this algorithm recursively until all leaves are assigned to a class, we get a decision tree with exactly one leaf more than the internal nodes. Thus, limiting the number of nodes to has the same effect as limiting the number of leaves to .
To add the constraint of the maximum number of leaves that can be assigned to a class, we add two types of additional variables. The variables which are true if a class is assigned to the leaf , and the variables which will be used to count, with unary coding, the number of leaves labelled by a class. The variable will be true if there are at least leaves labelled by a class among the first leaves.
The clauses encoding the new constraint are as follows:
For each and each class , we have the clauses:
These clauses assign to true if the leaf is labelled by a class.
For each and each class , we have the clauses:
These clauses propagate the fact that if is true, then is also true.
For each and each class , we have the clauses:
These clauses increase the value of by one if is true.
Thus is true if there is at least leaves labelled by class among the first leaves.
Finally, we assign the start and end of the counter :
The first assignment prohibits having more than leaves, so nodes. The second assignment sets the counter to .
The formula for inferring a decision tree of depth (and a specific number of nodes) from training examples with features classed in classes require literals, and clauses.
It can be noted that the number of literals and clauses whether the maximum number of nodes is specified or not is of the same orders of magnitude. However, finding a tree with a minimum number of nodes will take more time because it requires us to search for this number by a dichotomous search.
4 Incremental Inference
To alleviate the complexity associated with large sets of training examples, we propose an approach which, instead of attempting to process all the training examples at once, iteratively infers a decision tree from their subset (initially it is an empty set) and uses active inference to refine it when it is not consistent with one of the training examples. While active inference usually uses an oracle capable of deciding to which class an example belongs, we assign this role to the training examples . Even if such an oracle is restricted since it cannot guess the class for all possible input features, nevertheless, as we demonstrate, it leads to an efficient approach for passive inference from training examples. The approach is formalized in Algorithm 3.
An illustration of the execution of this algorithm is given in Appendix on a simple example.
In this paper, we have presented two algorithms to solve two different problems.
The first algorithm, denoted , finds a perfect decision tree of minimal depth. It uses Algorithm 3 without defining . We initially set and while the algorithm is not finding a solution, we increase the value of .
The second algorithm, denoted , minimizes the depth of the tree and the number of nodes. It starts by applying to leanr the minimum depth required to find a decision tree consistent with the training examples. Then it performs a dichotomy search on the number of nodes allowed between and to find a decision tree with a minimal number of nodes.
The main metric we will compare is the execution time and accuracy. The accuracy is calculated with a -fold cross-validation defined as follows. We divide the dataset into equal parts (plus or minus one element). Then of these parts are used to infer a decision tree, the last part is used to calculate the percentage of its elements correctly classified by the decision tree. This operation is performed times to try all possible combinations among these ten parts and the average percentage is calculated.
The prototype was implemented in C++ calling the SAT solver MiniSAT  and we run the prototype on Ubuntu in a computer with 12GB of RAM and i7-2600K processor.
5.1 The “Mouse” dataset
Our first experiment is performed on the “Mouse” dataset that the authors Bessiere et al. shared with us. This dataset has the advantage of having been used with both algorithm and . In Table 1, we compare the time and accuracy for different algorithms. Each entry in rows and corresponds to the average over 100 runs. The first four columns correspond to inferring a decision tree from the whole dataset. The last column corresponds the 10-fold cross-validations.
|Algo||Time (s)||Examples used||Clauses||k||Nodes||acc.|
By analyzing Table 1, we can notice that our incremental approach is very effective on this dataset. Only 37 examples for and 33 examples for were used to build an optimal decision tree consistent with the entire dataset. Thus, thanks to our incremental approach and an efficient SAT formulation, our algorithms are much faster than and . We could not compare the accuracy because this data is missing in the two respective papers for and .
5.2 The “Car” dataset
Another data set provided to us and used by the authors Bessiere et al. and Naradytska et al. is “Car”. This dataset is much more complicated and to the best of our knowledge, no algorithm has been able to infer an optimal decision tree consistent with the entire dataset. The authors Bessiere et al. process this dataset using linear programming (denoted ) to minimize the number of nodes. However, they do not guarantee that the decision tree they find is optimal. The approach used by the authors Naradytska et al. simplifirs the dataset by considering only 10% of the data. Thus, they can infer an optimal decision tree consistent with the 10% of the data selected in 684 sec. Table 2 compares the results of different algorithms. Each entry in rows and corresponds to the average over ten runs. The first four columns correspond to inferring a decision tree from the whole dataset. The last column corresponds the 10-fold cross-validations.
|Algo||Time (s)||Examples used||k||Nodes||acc.|
We can see in Table 2 that of the examples in the “Car” dataset, our incremental approach uses less than half of it. Although this number is still much higher than the number of examples used by the algorithm , we can see that our algorithms run faster. Moreover, since our algorithms ensure that the resulting decision trees are consistent with all training examples, we can see that the accuracy remains very high compared to the algorithm which randomly considers only of training examples. Note that the algorithm performs better that our algorithm but their algorithm is a heuristic that infers a decision tree without any guarantee of optimality. In addition, because do not have an optimality constraint, it seek to minimize the number of nodes of a general decision tree without constraint on the depth of the tree. This way, find trees with fewer nodes, but deeper than ours.
5.3 Other datasets
As we mentioned in the introduction of the paper, there is a series of algorithms that addresses a different but very similar problem to ours: inferring optimal decision tree with a given depth such that the total classification error on the training examples is minimized. In this section, we compare our results against these algorithms. The datasets we use are extracted from the paper of Verwer and Zhang  and are available at https://github.com/SiccoVerwer/binoct. Each dataset corresponds to a 5-fold cross-validation. In their paper, Verwer and Zhang compare their approach to two other approaches. The first one is , run from sciki-learn with its default parameter setting but with a fixed maximum depth of the trees generated, and the second one is OCT from Bertsimas and Dunn . The time limit used is 10 minutes for and 30 minutes to 2 hours for . The depth of tree used is between and , but we report in Table 3 the best value among the three depths tried.
|Dataset||time (s)||acc.||k||time (s)||acc.||n||acc.||acc.||acc.|
It should be noted that only 6 of the 16 datasets present in the Verwer and Zhang paper could be executed. The reason is that the decision trees consistent with some datasets are too large and deep to be inferred. In contrast to the algorithms to which we compare ours, we cannot set a maximum tree depth value because all examples must be correctly classified by the tree we infer.
Note that in Table 3, our algorithms and are very fast even when the trees to be inferred are large. In fact, for the dataset “balance-scale”, our algorithms infer decision trees of depth in a few minutes while the other algorithms require more time for trees of depth .
Concerning the accuracy of the trees we infer, it seems that when the depth is small () accuracy is equal for all approaches. However, when the depth becomes bigger, then our algorithms get higher accuracy. The most obvious example is the dataset balance-scale where we get accuracy compared to for .
5.4 Artificial dataset
A last series of experimentations was carried out in order to see how the execution times of our two algorithms changes with different parameters of the datasets. The parameters we evaluate are the depth of the tree to infer, the number of features, the number of classes and the number of training examples. In order to perform the experiment, we randomly generate decision trees with the characteristics of depths, number of features and number of classes desired, then we randomly generate training examples from such trees. In Figures 5, 5, 5 and 5, we set three of these parameters and vary the remaining one to observe the effect on execution time.
In Figure 5, time seems to grow linearly. The algorithm appears to have a coefficient of which means that the fact that the algorithm is able to infer a decision tree that contains several classes does not bring any gain (or loss) in performance compared to the method consisting of inferring only decision trees with two classes and that would infer decision trees (one for each class). However, the algorithm has a coefficient lower than . Thus, the algorithm’s ability to infer decision trees with multiple classes provides a performance gain in this case. In Figure 5, time grows exponentially with . However, a tree has generally exponentially more nodes than the depth of the tree. So, the figure indicates that the inference time increases almost polynomially with the number of nodes of the inferred tree. In Figure 5, time grows almost exponentially with the number of features. It is thus the number of features that seems to have the most impact on the inferring time. Then one way to improve our method is to try to reduce this impact. In Figure 5, we observe that time is growing rapidly until it reaches its peak. This peak corresponds to the number of examples that the algorithm needs to infer a decision tree consisting with all training examples. Thus, adding more examples will not affect the inference time unlike the previous approaches using SAT solver.
We have presented a method that can infer an optimal decision tree for two definitions of optimality. The first definition a decision tree of minimal depth and consistent with the training examples is optimal. The second definition of optimality adds the constraint that the tree, in addition to having a minimum depth, must also have a minimum number of nodes. Although this optimal decision tree inference problem is known to be NP-complete [14, 11], we proposed an effective method to solve it.
Our first contribution is an effective SAT formulation that allows us to infer perfect decision trees for a fixed depth consistent with training examples. We have also shown how to add constraints in order to set the maximum number of nodes. In this case, the inferred decision tree will no longer necessarily be a perfect tree.
Our second contribution addresses the scalability issue. Indeed, the previous approach using SAT solver has the disadvantage that the execution time increases significantly with the number of training examples [16, 7]. Thus, we proposed an approach which does not process all the examples at once, instead it does it incrementally. The idea of processing a set of traces incrementally is to consider one example at a time, generate a decision tree and verify that it is consistent with the remaining examples. If it is not, choose an example that is incorrectly classified by the decision tree, i.e., a counterexample, and use it to refine the decision tree.
We evaluated our algorithms using various experiments and compared the execution time and quality of decision trees with other optimal approaches. Experimental results show that our approach performs better than other approaches, with shorter execution times, better prediction accuracy and better scalability. In addition, our algorithms have been able to process datasets for which, to the best of our knowledge, there are no other inference methods able of producing optimal models consistent with these datasets.
Ijcai workshop on explainable artificial intelligence.2017.
-  Acm conference on fairness, accountability, and transparency. 2018.
-  Elaine Angelino, Nicholas Larus-Stone, Daniel Alabi, Margo Seltzer, and Cynthia Rudin. Learning certifiably optimal rule lists. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 35–44. ACM, 2017.
-  Michal Bartnikowski, Travis J Klein, Ferry PW Melchels, and Maria A Woodruff. Effects of scaffold architecture on mechanical characteristics and osteoblast response to static and perfusion bioreactor cultures. Biotechnology and bioengineering, 111(7):1440–1451, 2014.
-  Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, 106(7):1039–1082, 2017.
-  Dimitris Bertsimas and Romy Shioda. Classification and regression via integer optimization. Operations Research, 55(2):252–271, 2007.
-  Christian Bessiere, Emmanuel Hebrard, and Barry O’Sullivan. Minimising decision tree size as combinatorial optimisation. In International Conference on Principles and Practice of Constraint Programming, pages 173–187. Springer, 2009.
-  Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. Classification and regression trees. belmont, ca: Wadsworth. International Group, page 432, 1984.
-  Niklas Eén and Niklas Sörensson. An extensible sat-solver. In International conference on theory and applications of satisfiability testing, pages 502–518. Springer, 2003.
-  Randy Goebel, Ajay Chander, Katharina Holzinger, Freddy Lecue, Zeynep Akata, Simone Stumpf, Peter Kieseberg, and Andreas Holzinger. Explainable ai: the new 42? In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, pages 295–303. Springer, 2018.
-  Thomas Hancock, Tao Jiang, Ming Li, and John Tromp. Lower bounds on learning decision lists and trees. Information and Computation, 126(2):114–122, 1996.
-  Simon Haykin. Neural networks, volume 2. Prentice hall New York, 1994.
-  Gordon V Kass. An exploratory technique for investigating large quantities of categorical data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 29(2):119–127, 1980.
-  Hyafil Laurent and Ronald L Rivest. Constructing optimal binary decision trees is np-complete. Information processing letters, 5(1):15–17, 1976.
-  Oscar Li, Hao Liu, Chaofan Chen, and Cynthia Rudin. Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Nina Narodytska, Alexey Ignatiev, Filipe Pereira, Joao Marques-Silva, and IS RAS. Learning optimal decision trees with sat. In IJCAI, pages 1362–1368, 2018.
-  J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
-  Lior Rokach and Oded Z Maimon. Data mining with decision trees: theory and applications, volume 69. World scientific, 2008.
-  Steven L Salzberg. C4. 5: Programs for machine learning by j. ross quinlan. morgan kaufmann publishers, inc., 1993. Machine Learning, 16(3):235–240, 1994.
-  Michael Van Lent, William Fisher, and Michael Mancuso. An explainable artificial intelligence system for small-unit tactical behavior. In Proceedings of the National Conference on Artificial Intelligence, pages 900–907. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2004.
-  Vladimir N Vapnik. The nature of statistical learning. Theory, 1995.
-  Sicco Verwer and Yingqian Zhang. Learning optimal classification trees using a binary linear program formulation. In 33rd AAAI Conference on Artificial Intelligence, 2019.
Illustration of the inferring algorithm
We present here an illustration of the inferring algorithm of a decision tree for depth 2. We use the following dataset and