1 Introduction
Decision trees (breiman1984classification) are highly interpretable models, which make them favorable in highstakes domains such as medicine (valdes2016mediboost; martinez2018circadian), criminal justice (bitzer2016analyse), and network analysis (backstrom2006group; armandpour2019robust)
. They are also resistant, if not completely immune, to the inclusion of many irrelevant predictor variables. However, trees usually do not have high accuracy, which somewhat limits their use in practice. Current main approaches to improve the performance of decision trees are making large trees or using ensemble methods
(dietterich2000experimental; dietterich2000ensemble; hastie2009elements; zhou2012ensemble), such as bagging (breiman2001random) and boosting (drucker1996boosting; freund1997decision), which all come with the price of harming model interpretability. There is a tradeoff challenge between the accuracy and interpretability of a decision tree model.Prior work has attempted to address the aforementioned challenge and improve the performance of trees by introducing oblique tree models (heath1993induction; murthy1994system). These families of models are generalizations of classical trees, where the decision boundaries are hyperplanes, each of which is not constrained to be axisparallel and can have an arbitrary orientation. This change in the decision boundaries has shown to reduce the size of the trees. However, the tree size is still often too large in a real dataset to make it amenable to interpretation. There has been an extensive body of research to improve the training of the oblique trees and enhance their performance (wickramarachchi2016hhcart; bertsimas2017optimal; carreira2018alternating; lee2019locally), yet their large size remains a challenge.
In this paper, we propose convex polytope decision trees (CPT) to expand the class of oblique trees by extending hyperplane cuts to more flexible geometric shapes. To be more specific, the decision boundaries induced by each internal node of CPT are based on noisyOR (pearl2014probabilistic)
of multiple linear classifiers. And since noisyOR has been widely accepted as an interpretable Bayesian model
(richens2020improving), our generalization keeps the interpretability of oblique trees intact. Furthermore, CPT’s decision boundaries geometrically resemble a convex polytope (, high dimensional convex polygon). Therefore, the decisions at each node have both logical and geometrical interpretation. We use the gamma process (ferguson1973bayesian; kingman1992poisson; zhou2018parsimonious), a nonparametric Bayesian prior, to infer the number of polytope facets adaptively at each internal tree node and regularize the capacity of the proposed CPT. A realization of the gamma process consists of countably infinite atoms, each of which is used to represent a weighted hyperplane of a convex polytope. The shrinkage property of the gamma process helps us to encourage having simpler decision boundaries, therefore help resist overfitting and improve interpretability.The training of CPT, like that of oblique trees, is a challenging task because it requires learning both the structure (, the topology of the tree and the cutoff for the decision boundaries) and the parameters (
, parameters of noisyOR). The structure is a discrete optimization problem, involving the search over a potentially large problem space. In this work, we present two fully differentiable approaches for learning CPT models, one based on mutual information maximization, applicable for both binary and multiclass classification, and the other based on variance minimization, applicable for regression. The differentiable training allows one to use modern stochastic gradient descent (SGD) based programming frameworks and optimization methods for learning the proposed decision trees for both classification and regression.
Experimentally, we compare the performance of CPT to stateoftheart decision tree algorithms (carreira2018alternating; lee2019locally)
on a variety of representative regression and classification tasks. We experiment with several realworld datasets from diverse domains, such as computer vision, tabular data, and chemical property data. Experiments demonstrate that CPT outperforms stateoftheart methods with higher accuracy and smaller size.
Our main contributions include: 1) We propose an interpretable generalization to the family of oblique decision trees models; 2) We regularize the expressive power of CPT, using a nonparametric Bayesian shrinkage prior for each node split function; 3) We provide two scalable and differentiable ways of learning CPT models, one for classification and the other for regression, which efficiently search for the optimal tree; 4) We experimentally evaluate CPT on several different types of predictive tasks, illustrating that this new approach outperforms the prior work in having higher accuracy achieved with a smaller size.
2 Related Work
Interperatble inference is of utmost importance (e.g. ahmad2018interpretable; zhang2018interpretable; sadeghian2019drum), among which decision trees are popular. Most of the literature on decision trees has been focused on how to train a single tree or an ensemble of multiple ones (hastie2009elements). There has been little work in making decision boundaries more flexible. One of the reasons for this lack of research is the fact that the computational complexity of the problem increases even for simple generalization of the decision boundaries. For example, with as the number of data points and as the dimension of the input space, the generalization of the coordinate wise to an oblique hyperplane cut, increases the number of possible splits of data points from to just for a single node (vapnik16ja).
Some methods perform hyperplane cuts in an extended feature space, created by concatenating the original features and newly generated ones (ahmad2014decision), to get more flexible decision boundaries. These new features can be engineered or kernelbased and are not designed for interpretability, but to gain performance in an ensemble of such trees using Random Subspaces (ho1998random). We will follow this section by a literature review of the training algorithms for (oblique) trees.
Conventional methods for decision tree induction are greedy, where they grow the tree nodes one at a time. The greedy construction of oblique trees can be done by using coordinate descent to learn the parameters of each split (murthy1994system), or by a projection of the feature space to a lower dimension then using coordinatecut (menze2011oblique; wickramarachchi2016hhcart). However, the greedy procedure often leads to suboptimal trees.
There have been several attempts to nongreedy optimization, which rely on either fuzzy or probabilistic split functions (suarez1999globally; jordan1994statistical; kontschieder2015deep). The probabilistic trees are sometimes referred to as soft decision trees (frosst2017distilling) and have been applied to computer vision problems (kontschieder2015deep; hehn2019end). In these methods, the assignment of a single sample to the leaf is fuzzy or probabilistic, and gradient descent is used for the optimization of the tree. Most of these algorithms remain probabilistic at the test time, which consequently leads to uninterpretable models as the prediction for each sample will be based on multiple leaves of the tree instead of just one. There are no probabilistic trees in the literature developed for the regression task to the best of our knowledge.
Other advances towards the training of an oblique tree are based on constructing neural networks that reproduce decision trees
(yang2018deep; lee2019locally). yang2018deepuse a neural network with argmax activations for the representation of classic decision trees with coordinate cuts, but they are not scalable to highdimensional data.
lee2019locallyuse the gradient of a ReLU network with a single hidden unit at each layer and skipconnections to construct an oblique decision tree. They achieve stateoftheart results on some molecular datasets, but they have to make complete trees for a given depth and need highdepth trees.
In contrast to our method, there are other training algorithms that require the structure of the tree at the beginning. Some of the works in this direction like bennett1994global and bertsimas2017optimal
use linear programming, or mixedinteger linear programming, to find a global optimum tree. Therefore, these methods are computationally expensive and not scalable.
norouzi2015efficient derive a convexconcave upper bound on the tree’s empirical loss and optimize that loss using SGD. A recent work (carreira2018alternating) proposes tree alternating optimization, where one directly optimizes the misclassification error over separable subsets of nodes, achieving the stateoftheart empirical performance on some datasets (zharmagambetov2019experimental) .We conclude this section by relating our splitting rule at each internal node to some relevant classification algorithms (aiolli2005multiclass; manwani2010learning; manwani2011polyceptron; wang2011trading; kantchelian2014large; zhou2018parsimonious). The two most related works are convex polytope machine (CPM) (kantchelian2014large) and infinite support hyperplane machine (iSHM) (zhou2018parsimonious), which both exploit the idea of learning a convex polytope associated decision boundary. In particular, iSHM can be considered as a decision stump (a tree of depth one) of CPT. iSHM is like a single hidden layer NN, which provides different values for different input features (at test time, therefore less interpretable) and is restricted to the binary classification task. However, CPT’s decision function at test time, at each node, just assigns two values to each feature space to send the data to the right or left. We stack those binary classifiers in the form of a decision tree. This provides a locally constant function for any task with a differentiable objective. Thus it is not necessarily restricted to the binary classification task.
3 Convex Polytope Tree and Its Inference Algorithms
Suppose we are given the training data , where pairs of are drawn independently from an identical and unknown distribution . Each is a dimensional data with a corresponding label . In the classification setting, and in regression scenario . The aim is to learn a function that will perform well in predicting the label on samples from .
Decision tree methods construct the function by recursively partitioning the feature space to yield a number of hierarchical, disjoint regions, and assign a single label (value) to each region. The final tree is comprised of branch and leaf nodes, where the branch nodes make the splits and leaf nodes assign values to each related region. For both classical (breiman1984classification; quinlan1986induction; quinlan2014c4) and oblique trees (murthy1993oc1; murthy1994system), the decision boundary at each branch node can be expressed as whether or . We do not explicitly consider the bias term in the decision boundary because we assume, for the sake of notational brevity, that includes a constant unitary component corresponding to a bias term. The , in the case of classical trees, is limited to having just one coordinate equal to one and the rest equal to zero, other than the coordinate corresponding to bias. However, oblique trees do not make any restriction on . In what follows, we will explain how we move beyond the oblique decision trees.
3.1 Convex Polytope Constrained Decision Boundary
By extending the idea of disjunctive interaction (noisyOR, also commonly referred to as probabilisticOR) (shwe1991probabilistic; jaakkola1999variational; zhou2018parsimonious) from probabilistic reasoning to the decision tree problem, we make the decision boundaries more flexible while preserving interpretability. To that end, we transform the problem of a node splitting, right or left, to a committee of experts that make individual binary decisions (“Yes” or “No”). Note the probabilisticOR construction shown below, while being closely related to iSHM (zhou2018parsimonious), is distinct from it in that the Yes/No decisions are latent rather than observed binary labels. The committee votes “Yes” if and only if at least one expert votes “Yes”, otherwise votes “No”. Thus, the final vote at each node is
where
denotes the logical OR operator. We model each expert as a linear classifier who votes “Yes” with probability
(1) 
where and are parameters of expert . Now assuming that each expert votes independently, we can express the probability of the committee voting “Yes” as P(vote =“Yes”  {ri,βi}i,x) = 1∏i=1K(1pi)=1e∑i=1Kriln(1+eβi’x), where is the probability of expert voting “Yes” and is the total number of experts. We can now define the split function at each node by thresholding the committee voting probability:
where and are the related splits of the space.
To elaborate on the geometric shape and interpretability of the decision boundaries, consider ( a single expert). In this scenario, the decision boundary becomes a hyperplane, which is perpendicular to . In fact, the probability function for each expert is based on the signed distance of to the hyperplane perpendicular to . And, parameter controls how smoothly the probability transitions from 0 to 1, where a larger leads to sharper changes. This class of models with and are identical to oblique trees, which are interpretable. The interpretability of is provided by the fact that linear classifiers and probabilisticOR operation are interpretable (richens2020improving).
To geometrically analyze the implied decision regions, we provide the following theorem.
Theorem 1.
For any , such that and , let:
where:
(2) 
then is a convex set, confined by a convex polytope.
The proof is provided in the Appendix.
The above theorem shows for , the decision region () is a convex set confined by a convex sided polytope. More precisely, each facet of the convex polytope is a hyperplane corresponding to an expert perpendicular to its . Also worth noting, an expert with a larger has more effect on the decision boundary, making sharper changes to the probability function. This can also be perceived as the value of their decision in the committee. Therefore, our method not only has a strong relationship with probabilisticOR type models that provide interpretability for the model parameters (almond2015bayesian), but also has decision boundaries with interpretable geometric characteristics. We propose a class of models, Convex Polytope Trees (CPT), where each node of the tree follows the above splitting function.
3.2 Gamma Process Prior
To regularize CPT, and motivate simpler decision boundaries we use a nonparamteric Bayesian shrinkage prior. Specifically, we put the gamma process prior (ferguson1973bayesian; kingman1992poisson; zhou2018parsimonious)
on the splitting function of each node in the tree. Each realization of the gamma process, consisting of countably infinite weighted atoms whose total weight is a finite gamma random variable, can be described as
(3) 
where represents an atom with weight . More details about the gamma process can be found in kingman1992poisson. We put the prior on the CPT by considering and as the parameters of the splitting function related to equation (3.1). Due to the gamma process’s inherent shrinkage property, just a small finite number of experts will have nonnegligible weights at each node. This behavior encourages the model to have simpler decision boundaries ( smaller number of experts or equivalently fewer polytope facets) at each node. This improves the interpretability and regularization of the model. The gamma process allows a potentially infinite number of experts at each node. For the convenience of implementation, we truncate the gamma process to a large finite number of atoms.
To further encourage simpler models at each node of the tree, we also put a shrinkage prior on of each expert. In particular, we consider the prior:
(4) 
and which motivates sparsity due to the InvGamma distribution on the scale parameter (tipping2001sparse; zhou2018parsimonious).
3.3 Training Algorithm
Finding an optimal CPT requires solving a combinatorial, nondifferentiable optimization problem. This is due to the large number of possibilities that any single node can separate the data. We propose a continuous relaxation of the splitting rule of each node to alleviate this computationally challenging task. Particularly, each internal node makes probabilistic rather than deterministic decisions to send samples to its right or left branch. We set the probability of going right the same as (3.1), or any monotonic function of it. We use this probabilistic version to train the tree in a differentiable manner. At the test time, we threshold the splitting functions to provide a deterministic tree. Below we explain in detail the proposed training algorithm for the parameters and structure of the tree.
3.3.1 Learning Split Parameters
Assuming the tree structure is given, we first explain how to infer the tree parameters.
For classification, we formulate the training as an optimization problem by considering the mutual information between the two random variables (category label) and (leaf id) as our objective function. This may seem similar to previous literature on learning a classical decision tree but it differs in two main ways: 1) we develop and optimize the mutual information for a probabilistic rather than deterministic tree, and 2) we learn the parameters of all nodes jointly rather than learning them in a greedy fashion.
We model our probabilistic tree by letting
(5) 
where is the set of all leaf nodes and is the probability of arriving at leaf given the sample feature . We assume each internal node makes decisions independent of the others and use the probabilities in (3.1) when sending a data sample to the left or right branch. This assumption lets us derive a formula for as
(6) 
where
(7) 
and is a path from the root to leaf and encodes the right or left ( or ) direction taken at node . The mutual information between and can be expressed as
(8) 
where indicates the entropy of a random variable, and follows a categorical distribution. Notice that, since does not depend on the tree parameters, to optimize mutual information, we only need to minimize . However, the evaluation of the conditional entropy term requires knowledge of the entire data distribution, thus we can not directly optimize (3.3.1).
To make the training possible, we approximate the true data distribution with the empirical one to get
(9) 
Denote as an indicator function. By using Bayes’ rule, we derive
, the estimated probability vector of the categorical distribution for
, as(10) 
Now by using (10), we can approximate the entropy term as
(11) 
Therefore, we can provide an estimator for the as
(12) 
By minimizing with respect to , we are in fact maximize the mutual information . As discussed in Section 3.2, we also regularize CPT by adding a penalty term to (12). We consider the negative log probability of the gamma process prior truncated with atoms by letting
where are the parameters of internal nodes splitting function.
The penalty term for each internal node can be mathematically formulated as
∑k=1K((γ0K1)lnrk +c0 elnrk) +(aβ+1/2)∑j=0d∑k=0K [ ln(1+βjk2/(2bβk) ) ].
The above procedure provides a differentiable way of learning branch node parameters, which dictates how a data sample will arrive at a leaf node. At the end of the training algorithm, we also need to assign the leaf node parameters, which determine how the tree predicts a sample. We pass the whole training set through the tree and assign the empirical distribution of all categories at each leaf node as its node parameters. This way of determining the leaf parameters has been shown to achieve the highest AUC in binaryclassification (ferri2002learning).
For regression, we replace the mutual information optimization by a variance reduction criteria. To be more specific, we learn the tree parameters with
(13) 
such that
(14) 
and the calculation of is exactly the same as in the classification case. The leaf parameters are set as the mean response of the points arriving at that leaf.
Note for both scenarios, we use the threshold of 0.5 to do deterministic splits at the test time. Now that we know how to train a tree given its structure and how to use it at the test time, the next section will describe how to find an optimal topology and initial parameters. Algorithm 1 of the Appendix summarizes the training method.
3.3.2 Topology Learning
We start by assuming the tree structure to be just a root node and its two child leaf nodes. We train this tree using the algorithm explained in Section 3.3.1. After training, we split the training set to two subsets (right and left), by thresholding the assigned probability. We calculate this threshold in the classification (regression) task, which achieves maximum mutual information (variance reduction) in the deterministic tree. To be more specific, for the classification task, we set the threshold , such that it minimizes
where
The is the ’th smallest probability assigned by the root node to the data samples. We further split each child node by considering it as root and applying the above algorithm using its data samples. We stop splitting a node with very few data samples and stop growing the tree when we reach a certain predefined maximum depth.

Note that during the proposed treestructure greedy training, we do not perform any parameter refining using the method presented in Section 3.3.1. The parameter refining at each step of adding a new node can further improve the accuracy, but it will come with the price of increased computational complexity.
4 Experiments
In this section, we empirically assess the qualitative and quantitative performances of CPT on datasets from various domains. We show that CPT learns significantly smaller trees than its counterparts and makes more robust and accurate predictions. That is partly due to the high variance of the leaf node’s prediction in classical (oblique) trees, resulting from fewer data samples in each partition. However, since CPT usually has fewer leaf nodes, each partition has a significant proportion of the dataset.
4.1 Synthetic Dataset
The aim of this section is to provide an illustrative example of why CPT achieves better performance when compared to other decision tree based algorithms. To that end, we consider a dataset of 2,000 points, as shown in Figure 1
. The data samples are independent draws of the uniform distribution on a twodimensional space as
, with the data points labeled as red if they lay between two concentric circles, and blue otherwise.We compare CPT with LCN (lee2019locally), which is a stateoftheart oblique tree method, and CART (breiman1984classification), which is a canonical axisaligned tree algorithm. Figure 1 from left to right shows the decision boundaries of CART, LCN, and CPT, respectively. CART and LCN are trained for three different maximum depth parameters 2, 6, and 10; the corresponding plots are shown from the top to bottom. Due to the large number of nodes in LCN, we do not directly plot its decision boundaries. Instead, we illustrate the regions assigned to each of its leaf nodes with different shades of gray, and the darker the grey, the more red labels in that region. It is worth noting that both LCN and CART are restricted to partition the feature space into disjoint convex polytopes and assign each region to a leaf node. However, CPT does not have such a limitation, and each region can be the result of applying any set of logical operations on a set of convex polytopes. Figure 1 clearly shows that both LCN and CART need a large depth to successfully classify the data, while CPT only needs two splits. To be more specific, the AUC results for CART are with leaves and the AUC results for LCN are with leaves.
By contrast, CPT achieves the AUC of (the highest score) with just 3 leaf nodes at maximum depth 2.
Figure 0(c) shows that CPT uses a heptagon (7sided 2D convex polytope) for each split. However, this number was not fixed at training time. We only limit the maximum number of polytope sides to , the gamma process truncation level at each node. The model owes this adaptive shrinkage to the gamma process prior, which improves the simplicity and interpretability of the model.
4.2 Classification and Regression
We evaluate the performance of CPT for regression, binary classification, and multiclass classification tasks. For regression and binary classification, we conduct experiments on datasets from MoleculeNet (wu2018moleculenet). We follow the literature to construct features (wu2018moleculenet; lee2019locally) and use the same training, validation, and testing split as lee2019locally. For multiclass classification, we perform experiments on four benchmark datasets from LibSVM (chang2001libsvm), including MNIST, Connect4, SensIT, and Letter. We employ the provided training, validation, and testing sets when available; otherwise, we create them under the criterion specified in previous works (norouzi2015efficient; hehn2019end). The datasets statistics are summarized in Table 1.
4.2.1 Compared Baselines
We evaluate the performance of CPT against several stateoftheart decision tree methods, including FTEM (hehn2019end, “Endtoend learning of decision trees and forests”), Tao (carreira2018alternating, “Oblique decision trees trained via alternating optimization”), and LCN (lee2019locally, “Oblique decision trees from derivatives of ReLU networks”). We also consider several additional baselines, including Cart (cart84), Hhcart (wickramarachchi2016hhcart), GUIDE (loh2014fifty), and CO2 (norouzi2015efficient). Moreover, to empirically show the importance of flexible boundaries, we also added a baseline CPT, where , with hyperplane cuts.
Our algorithm is implemented in PyTorch and can be trained by gradientbased methods. We use Adam
(kingma2014adam)optimization for inferring the tree split parameters. A 10fold crossvalidation on the combined train and validation set is used to learn the hyperparameters, namely the maximum number of polytope sides, number of training epochs, learning rate, and batchsize. However, we decide the depth of the tree based on the performance of CPT on the validation set during training, which can be perceived as early stopping for trees. Following the literature, we use the Area Under the receiver operating characteristic Curve (AUC) on the test set as the evaluation metric for binary classification, accuracy (ACC) for multiclass classification, and rootmeansquared error (RMSE) for regression.
Finally, we report the average and standard error of each method’s performance by repeating our experiments for 10 random seed initializations. More details about our implementation and the exact values of hyperparameters for each dataset are presented in the Appendix. Our code to reproduce the results is provided at
https://github.com/rezaarmand/Convex_Polytope_Trees.4.2.2 Experimental Results
Tables 2 and 3 present the results for a variety of decision tree based algorithms. Some results are quoted from previous works (lee2019locally; hehn2019end; zharmagambetov2019experimental). The depth and leaf numbers are averaged over 10 repetitions of training, and then rounded. For some large datasets, namely MNIST, SensIT, Connect4, and Letter, we fix the depth parameter as opposed to adaptively tuning it based on the validation set on each run. The reason for some missing values in the Table 2 is some methods like TAO did not provide their code, so we could not provide their performance on datasets they had not experimented. Regarding the hyperparameter tuning for CPT, the total number of different hyperparameter tuning setups for each dataset was less than 25 (5*5) cases. For the baseline methods, we quote the best results tuned and reported by the authors (e.g., LCN, according to their paper, does at least (3*11) hyperparameter tuning). Also, for TAO, we report the best results by the authors. For each dataset, the best result and those with no statistically significant difference (by using two sample test and value of 0.05) are highlighted.
From the results, it is evident that the added flexibility in splitting rules combined with an efficient training algorithm allows CPT to outperform the baseline algorithms. Our method achieves the stateoftheart performance, while, notably, using significantly shallower trees. For instance, CPT obtains the best performance in Connect4 with only depth 2 and 4 leaves, while other methods need a depth of at least 8.
It also improves the regression performance on the PDBbind dataset by a large margin. Although LCN achieves competitive results in terms of accuracy on some datasets, it needs to grow the tree’s size exponentially, significantly sacrificing the model interpretability. That is mainly because LCN, in contrast to our model, always learns a complete tree and generally needs to have a considerably large depth to achieve competitive results. For instance, consider its enormous size when trained on MNIST and Connect4 in Table 3.
5 Conclusion
We propose convex polytope trees (CPT) as a generalization to the class of oblique trees that improves their accuracy and shrinks their size, which consequently provides more robust predictions. CPT owes its performance to two main components: flexible decision boundaries and an efficient training algorithm. The proposed training algorithm well addresses the challenge to learn not only the parameters of the tree but also its structure. Moreover, we demonstrate the efficacy and efficiency of CPT on a variety of tasks and datasets. The empirical successes of CPT show promise for further research on other interpretable generalizations of decision boundaries. This can lead to a significant performance gain for the family of decision tree models. Another promising direction for future work is investigating the combination of CPT with various ensemble methods, such as boosting.
References
Appendix A Proofs
In this section, we show why the splitting function at each internal node results in a convex set confined within a convex polytope. We start by proving a lemma which is needed to prove the main theorem.
Lemma 2.
For any , such that and , the function:
(15) 
is convex over its domain .
Proof.
Since the sum of convex functions is also convex, it suffice to show each term of is a convex function. We demonstrate this by using the following theorem: “A function is convex iff its second derivative is a positive semidefinite matrix over the domain.” One can omit ’s in the following calculations because a positive scalar does not change the convexity.
The fist derivative of each term is:
(16) 
and by taking the derivative of the above vector, we will have:
(17) 
where is a matrix in . Since the scalar is positive for any , we just need to show the matrix is positive semidefinite. To that end, we prove for any :
And, that can be shown by:
Therefore the proof of the lemma is complete.
∎
Theorem 3.
For any , such that and , let:
where:
(18) 
then is a convex set, confined with a convex polytope.
Proof.
We start by showing is a convex set. Note that, due to the duality
(19) 
By the definition of a convex set, we just need to prove the following:
(20) 
Let and be:
(21) 
Since is monticaly increasing with respect to , replacing by and by in (20), results in a mathematically equivalent expression. Now, we can prove the new statement using Jensen’s inequality. To be more specific, based on Lemma 2 ( is convex) and Jensen’s inequality, we have:
(22) 
So if :
proving is convex.
We are just remained with showing is confined within a convex polytope. This can be shown by:
which completes the proof. ∎
Appendix B Additional Details on Experimental Settings
As mention in the paper, we train CPT in a probabilistic manner and switch to a deterministic tree at test time. To make the transition smoother, we conduct annealing during training. To be more specific, we transform the probability function at each node to , where:
(23) 
Larger results in a sharper change of probability from 0 to 1, and controls where that change happens. During training, we gradually increase to make the gap between probabilistic and deterministic tree progressively smaller. We also learn like other parameters of the model using SGD. Notice, the change of to keeps the mathematical and geometrical interpretation of CPT intact. That is because any thresholding of has an equivalent counterpart for since and are strictly monotonic function of each other.
Appendix C Algorithm
Input: Data , initial tree from GreedyTopologyLeaner algorithm, maximum number of polytope sides , hyperparameters of the gamma process prior