Splitting matters: how monotone transformation of predictor variables may improve the predictions of decision tree models

11/14/2016 ∙ by Tal Galili, et al. ∙ Tel Aviv University 0

It is widely believed that the prediction accuracy of decision tree models is invariant under any strictly monotone transformation of the individual predictor variables. However, this statement may be false when predicting new observations with values that were not seen in the training-set and are close to the location of the split point of a tree rule. The sensitivity of the prediction error to the split point interpolation is high when the split point of the tree is estimated based on very few observations, reaching 9 misclassification error when only 10 observations are used for constructing a split, and shrinking to 1 compares the performance of alternative methods for split point interpolation and concludes that the best choice is taking the mid-point between the two closest points to the split point of the tree. Furthermore, if the (continuous) distribution of the predictor variable is known, then using its probability integral for transforming the variable ("quantile transformation") will reduce the model's interpolation error by up to about a half on average. Accordingly, this study provides guidelines for both developers and users of decision tree models (including bagging and random forest).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 16

page 37

page 39

page 40

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Algorithms for decision tree learning (DTL) construct a decision tree model (DTM), which uses predictor variables (also known as features or measurements) to match an item to its target value. Decision tree models are very commonly used for predictive modeling in statistics, data mining, and machine learning. DTL is often performed by searching for a series of decision rules which split (partition) the space spanned by the features into disjointed regions of the space. Each partition region of the space is associated with some prediction of the item’s target value, which could be categorical or numerical. In these tree structures, leaves include the predicted targets and branches represent conjunctions of features that lead to these predictions. A short survey of decision tree models in general, and the CART methodology in particular, is provided in the appendix at section

5.1.

A classical survey of decision trees’ adaptation is provided by S. K. Murthy (Murthy, 1998), and a 2013 survey by S. Lomax and S. Vadera concluded that there are (at least) over 50 different algorithms for decision tree learning (including ID3, CHAID, C5.0, oblique trees, etc.) (Lomax and Vadera, 2013). A drawback of using a single decision tree is that it tends to either give biased predictions or over-fit the data. In recent decades, better alternatives have been presented by extending a single decision tree to an ensemble of decision trees. Leo Breiman is responsible for two celebrated extensions of the CART model - Bagging (Breiman, 1996) and Random Forest (Breiman, 2001). Another prominent extension, by Freund and Schapire (Freund and Schapire, 1995), is the idea of boosting as implemented in AdaBoost. Comparison of these methods has shown that each model may be better suited to different scenarios (Banfield et al., 2007)

. Further extensions include gradient boosting

(Friedman, 2001).

It is a commonly held paradigm stating that algorithms based on decision tree models are generally invariant under strictly monotone transformations of the individual predictor variables. As a result, scaling and/or more general transformations are not considered an issue to be concerned with (see page 352 in Hastie et al. (2001), page 20 in Timofeev (2004), or page 181 in (Friedman, 2006)). However, the CART book includes a more limited statement (see page 57 in Breiman et al. (1984)): ”In a standard data structure it (the decision tree) is invariant under all monotone transformations of individual ordered variables”. Both statements are true if the training data set is supported by all the possible values of the features that would be found when predicting future observations. Specifically, if the training data sets use some ordered explanatory variables that include all the possible values of these features, then using a strictly monotonic transformation should not make a difference in the model’s prediction accuracy.

The aforementioned general statement may not hold when the tree is used for generating an interpolated prediction which is close to the split point location. In general, an interpolated prediction occurs when the model predicts the label of an observation based on a value of the predictor variable that has not been observed in the training data set. This can sometimes happen for discrete predictor variables and will always happen for continuous observations. An interpolated prediction which is close to the split point location is illustrated in the following example. Consider data with only one predictor variable that can get integer values from 1 to 10, and a deterministic dependent variable that gets 0 if and 1 otherwise. Suppose that the data used for training the decision tree had only observations with -values 1, 2, and 10 (and -values 1, 1, and 0). If the model would be used to predict a new observation with

it would correctly classify it as 1. But what should the model give when used to interpolate a prediction for an observation that is close to the split point 9, such as

? There are three methods used in practice by decision tree learners for making such interpolated predictions. The first method, ”Sweep Left”, classifies any observation above (but not including) 2 as 0 (), and estimates the cutoff quantile () of the decision tree as . Under the second method, ”Sweep Right”, only observations with the values would be classified as 0 (), with cutoff . The third method, with cutoff between the previous two, classifies observations according to the rule . In the above example, will correctly classify as 0, while would mistakenly classify the observation as 1. Both of these methods would give the same prediction regardless of whatever monotone transformation would be used on . The third method, , would wrongly classify the observation as 0. However, if the variable is transformed to be , then the new rule would classify observations with as 0. In this case, will be correctly classified as 0. In this example, the monotone transformation influenced (specifically, helped) the interpolated prediction of the decision tree rule for an observations that was close to . It is evident that different transformations could either help or damage the prediction under different possible values of , through their influence on the way will make the interpolation prediction. It will be shown that the distribution from which new observations of arrive could be used as a guide for the best transformation to use in order to minimize the misclassification error of the model on new observations.

All three described methods for split point interpolation prediction are used in practice. A survey of existing R packages (R Core Team, 2015) reveals a variability in how different decision tree implementations interpolate their split point. The packages tree (Ripley, 2016), rpart (Therneau et al., 2015), oblique.tree (Truong, 2013), randomForest (Liaw and Wiener, 2002), and Rborist (Seligman, 2016) all use . The packages C50 (Kuhn et al., 2015), partykit (Hothorn et al., 2006) (i.e., the ctree function (Hothorn and Zeileis, 2015)), RWeka (i.e.: the J48 function which implements Quinlan’s C4.5 (Hornik et al., 2009)

), xgboost

(Chen et al., 2016), and ranger (Wright, 2016) all use . Lastly, the package evtree (Grubinger et al., 2014) uses .

As mentioned earlier, the various methods of split point interpolation may disagree on discrete variables and are sure to differ on continuous variables - for new observations that are close to the split point location. There has been an active discussion in the literature on how to treat continuous variables for DTL. Until now, however, research has mostly focused on discretization techniques for dealing with the computational complexity of testing too many potential splits (see Chickering et al. (2001); Liu et al. (2002); Kotsiantis and Kanellopoulos (2006); Fayyad and Irani (1992); Kohavi and Sahami (1996)). As far as we know, no previous work has explored how to make split point interpolation so as to minimize the near-split-point interpolated misclassification error of the model for continuous (or discrete) predictor variables.

To address this issue, we formalize in section (2

) the problem of split point interpolation by introducing the supervised uniform distribution, with random

() and a deterministic binary (), where is unknown. A variety of estimators for are proposed, and their statistical properties are investigated. The concluding recommendation will be to use the middle point for interpolation (). The effect of near-split-point interpolated prediction error is directly influenced by the sample size used for training the decision tree in each split. For example, if only 10 observations are used for estimating the split point, then could lead to 9% misclassification error while a model using would reduce it to approximately 4%. Using 100 observations for training will reduce the misclassification error by a factor of 10 (as compared to 10 observations) to around 0.9% and 0.4% respectively.

Moreover, as discussed earlier in the introduction, the estimator is sensitive to monotone transformations on . Section (3) illustrates by simulation that if the distribution of a continuous predictor variable is (at least approximately) known, then using its probability integral transformation (termed here “quantile transformation”) will reduce the model’s interpolation error by up to half on average (depending on the split point location and the original distribution of ). A Bayesian interpretation reveals that using the quantile transformation procedure brings the decision tree closer to taking the median posterior distribution of the split point quantile (under a uniform prior on

). This method is most powerful when the split point is interpolated in an area of the distribution with a monotone but very non-linear density, in which case the median could be far from the mid-point prediction of the two observations in the training set that are used for inducing a node’s split point interpolation. The simulation study also explores cases where the cumulative distribution function (CDF) is estimated in various ways, and a case where the distribution of

used for training is different from which future observations are drawn.

Our method is most effective when a decision tree makes a split based on a small training sample, and it would also be more useful for improving algorithms relying on large trees with many nodes, such as bagging or random forests. Our study provides guidelines for designers of decision tree models to use in their implementation (instead of or ). Also, users of decision trees are advised to use the quantile transformation we propose when there is good parametric knowledge about the distribution of the predictor variables.

2 Estimating the cutoff quantile in the supervised uniform distribution

2.1 The supervised standard uniform distribution

The problem of split point interpolation can be formulated as follows. Consider a random vector

whose entries are

i.i.d. standard uniform random variables

, and an unknown parameter strictly between and to be estimated from the following additional data. Each is augmented by the indicator variable with value if and otherwise. Denote by the vector with entries . Let be the largest entry in that is Left of , and the smallest entry that is Right of . Clearly, and . Let , the i’th order statistics of , where and , be augmented by and . Since is the number of observations from that are less than then . The density function of a single observation is

(1)

and the joint density of and is

(2)

Accordingly, the expectation of a general integrable function is

(3)

It will later be shown that the estimation of should be based on , and formula (2.1

) will be instrumental in analytically deriving the bias and variance of various estimators of

.

We say that the pair is drawn from the supervised uniform distribution defined by the endpoints () and the parameter (). This section will focus on the standard case where and , which could later be extended to any other distribution.

In terms of interpretation, at each node of a decision tree, the DTL is responsible for defining based on a split in . The split in (as defined in that node) partitions the space, based on optimizing some criterion (such as misclassification, gini, or impurity) with regards to some predicted variable. Regardless of the original distribution of , as long as its cumulative distribution function is known (or estimated, closely enough, through parametric assumptions about the distribution using labeled and un-labaled observations), then the monotone probability integral transformation (which we will term the quantile transformation) could be used to get . Hence, once the DTL defines

from the response variable, then the couple

follows the standard supervised uniform distribution. Furthermore, once a split is made, the following (conditional) split will be on observations that come also from a uniform distribution (since observations from a conditional uniform distribution on an interval are also uniform). Hence, for the following discussion, it is not important how a DTL specifically decides on the partitions as long as the split properly partitions the space into two non-overlapping sets of observations with a different -behavior ( for when and otherwise). The results in this study are applicable to any method of decision tree learning, be it a single tree or an ensemble, as long as it is based on a DTL that recursively partitions the space.

Several potential estimators of shall be introduced in the following sections, and their performance will be explored via the Mean Squared Error (MSE) and Mean Absolute Error (MAE). The latter will be evaluated because, for the purpose of classification trees, care is often taken to minimize the prediction misclassification error of a binary response variable (). If the cost of an erroneous classification is symmetric (i.e. incorrect classification of and are treated equally), then the risk function of depends on the area under the density function between and . Here, is an estimator of the cutoff point parameter for observations from . If then a new observation that is between the estimated and the real () will be misclassified as 0. The chance this would happen when (i.e. the expected error rate) is simply . Integrating this (and the mirror case of misclassification as ) over is simply the Mean Absolute Error (MAE) function .

2.2 The minimal sufficient non-complete statistic for the family of distributions is

The goal is to estimate from observations coming from the supervised standard uniform distribution. While

can be estimated by the natural unbiased estimator

, the statistic ignores information from , in which case can be improved. A minimal sufficient statistic that will capture most efficiently all relevant information about the parameter can be found by studying the likelihood function. Since are i.i.d pairs, the likelihood function can be written as

(4)

The Fisher-Neyman factorization theorem implies that the two-dimensional statistic is a minimal sufficient statistic for (see appendix 5.4 for a partial proof). Hence, it is enough to consider estimators of functions of which are exclusively based on .

A natural candidate for estimating would be the maximum likelihood estimator (MLE). Since the likelihood function is a rectangular function that gets its maximal value for whichever satisfies , then the MLE is the closed set . The MLE is agnostic towards any point estimator for that is a weighted average of and . But this does not mean that every such combination of and is equally good in estimating . If the estimation of an unknown parameter (such as ) relies on a non-complete minimal sufficient statistic, it can happen that the MLE would be inefficient (both asymptotically and for finite sample sizes) and that Rao-Blackwell improvements would be non-unique and improvable. See Galili and Meilijson (2016) for a discussion of this behavior. In the case of the supervised uniform distribution the statistic is a two-dimensional minimal sufficient statistic for estimating a scalar parameter () and is suspected to be not complete (a partial proof for is given in the appendix, section 5.4). Since the MLE does not offer a specific point estimator for split point interpolation, alternative point estimators are explored in the next sections.

2.3 An unbiased estimator for using and (via Rao-Blackwell)

The Rao-Blackwell theorem (Rao, 1945; Blackwell, 1947) offers a procedure (coined “Rao-Blackwellization” seemingly by Berkson (1955)) for improving a crude unbiased estimator of a parameter

into a better one (in mean-squared-error or any other convex loss function), by taking the conditional expectation of

given some sufficient statistic , i.e., (this is a statistic because is sufficient).

The unbiased estimator can be improved by Rao-Blackwell based on the minimal sufficient couple , to yield .

If , all the observations are to the left of and therefore . If , all the observations are to the right of and therefore . If both and are strictly between 0 and 1, then it is clear that at least one observation is to the left of (contributing to ) and at least one is to the right of (contributing to ). This leaves

observations from a Bernoulli distribution with success probability

to be on the left of . Hence . Combining these results, the Rao-Blackwell unbiased improvement of the estimator is

(5)

If and happen to be very close to each other then the estimator for from eq. (5) is seen to obtain the value , showing that may be outside the feasibility interval if the latter is short enough.

The variance of coincides with its and, for , is

(6)

Specifically, for the variance is . This is as it should be, since for this case . More details are provided in the appendix in section 5.2.1 (specifically see eq. (5.2.1) and eq. (38)).

Since is not complete, the Lehmann-Scheffé theorem (Lehmann and Scheffé, 1950, 1955) does not hold, and this estimator may or may not have minimal variance among the unbiased estimators of . In fact, there may not exist an unbiased estimator of with uniformly minimal variance. As will be seen in the next section, there exist estimators of with very small bias but with that is noticeably smaller than that of for all values of (other than and ). Such is the case for , with of order of magnitude (constant in ), which coincides with the RHS of eq. (2.3) for but exceeds it otherwise, although never reaching as much as twice. In order to produce the MAE for curves in Figure 2 they were calculated using Monte Carlo methods, since we could not derive them analytically.

2.4 Estimating using and separately

In this section and are considered individually for estimating , and their bias, variance, MSE, and MAE terms are evaluated. The following calculations suggest that there may not be a way to create unbiased estimators when solely relying on or .

The expectation of can be evaluated applying to eq. (2.1) which leads to

(7)

Combined with (see eq. (5.2.2) in section 5.2.2), the variance is

(8)

leading to

(9)

which (for large ) is four times the MSE of the Bayes estimator and at least twice the MSE of the RB estimator. It should be observed in eq. (39) that the bias of , as an estimator of

, is in the same order of magnitude as its standard error.

Also, the MAE is

(10)

For the sake of completeness, and are given by

(11)

and

(12)

leading to

(13)

And also, by using eq. (43), the MAE is

(14)

As expected, the variance, MSE, and MAE of and are obtained from each other by exchanging the roles of and .

2.5 An optimal Bayes rule for estimating using and together

From an ad-hoc Bayesian perspective, consider as prior on the uniform distribution . Since the likelihood is constant wherever positive, the posterior distribution of is . This makes the Bayes estimator (under square loss) to be the posterior expectation

(15)

As a proper Bayes rule, this estimator is biased but admissible. From eq. (39) and (43), its expectation (as a function of ) is

(16)

Its variance is

(17)

And combining the two leads to the following MSE

Variance and MSE are symmetric around , as expected.

The Mean Absolute Error function is minimized by taking the median of the posterior (uniform) distribution, which, yet again, is , leading to the following MAE

(18)

2.6 Sweeping estimators for

As mentioned in the introduction, a modification of the Bayes estimator that is commonly applied as a split rule lets be taken as rather than if , and as rather than if . This modified estimator will be called Swept Bayes (), as it is obtained from by “sweeping” mass to the endpoints. The motivation for the terminology comes from a similar concept “bálàyage” or “sweeping” used in martingales (Meilijson, 2012). Similarly, an estimator alternative to will get value (instead of ) if and the corresponding gets (instead of ) if .

(19)
(20)
(21)

The expectation of and resembles that of and , but with an added term to the bias.

(22)
(23)
(24)

The MSE of is:

(25)

Similar to eq. (16) and eq. (2.5), the bias and MSE of , given in eq. (24) and eq. (25), reveals a symmetry of the values around .

2.7 Comparison of estimators for

Since different software packages choose different methods for split point interpolation, the question is which of the methods is best among , , , their swept versions, or the Rao-Blackwell estimator? From the previous section, when the sample size () is large, the Root Mean Square Error (RMSE) of the different estimators is as follows

(26)
(27)
(28)

From equations (26, 27, 28) it is clear that all four estimators are -consistent in both RMSE and in probability, and the same holds true for the sweep versions of the estimators. For reasonably large and , , but increases as deviates from 0.5. Figure 1 illustrates the RMSE behavior of the four estimators for various values of under sample sizes .

Figure 1: The RMSE of the four estimators, across , for different sample sizes ().

Figure 8 (in the appendix, section 5.5.1) incorporates the sweep estimators as dashed lines. The unbiased Rao-Blackwell and the three swept versions are calibrated so as to decide that there is only one class in the population whenever this is the case in the sample, whatever the sample size. This property makes these estimators have vanishing MSE at extreme values of . But this very same property makes these estimators pay a relatively high price in terms of RMSE at a range of nearly extreme values of . The Bayes rule has stable, nearly constant RMSE, paying a price for ignoring the one-class scenario only at extreme . Figure 8 shows (for sample sizes ) that the one-sided estimators of (relying on only L or R) have roughly twice the RMSE of the symmetric-type estimators and should thus be avoided. Furthermore, the results in Figure 8 reveal that the Swept Bayes estimator () fully dominates the Rao-Blackwell estimator () in RMSE, for all possible values of . For the sake of brevity, the rest of this paper will ignore the sweep estimators, since assertions for comparing to , and will be similar when discussing their swept versions.

Repeating this analysis on the MAE reveals a similar behavior as shown for the MSE - . While the MAE for was not derived analytically, it can still be compared to the other estimators of as displayed in figure 2, where the curve was estimated through simulation ( simulations per point). This figure shows a similar pattern as was seen for RMSE in Figure 1.

Figure 2: The MAE of the four estimators, across , for different sample sizes ().

In the context of decision tree models for predicting a binary outcome with an observed predictor variable from (i.e. supervised uniform), the MAE (Mean Absolute Error) is the expected misclassification error of the model. When comparing the four estimators based on MAE, it is clear that the Bayes estimator is better (i.e. has lower MAE) than the Rao-Blackwell estimator for most values of , except for close to 0 or 1. The larger the sample size, the closer needs to be to the edges of the support in order for Rao-Blackwell to improve on Bayes, and if then the two estimators’ performance coincide. The and estimators produce double the MAE than the Bayes estimator, unless is near 0 or 1. These findings lead to the conclusion that unless is known to lie close to 0 or 1, the best estimator to use is (or , which is simpler to implement).

3 Beyond the supervised uniform distribution

3.1 Transforming predictors to the supervised uniform distribution

Previous sections demonstrated the superiority of for interpolating the split point when . Under a uniform prior over the split point , this estimator is Bayes-optimal for squared error (MSE) as well as for absolute difference error functions (MAE) since it is the mean and median of the posterior distribution. It gives the best point-wise result in MSE and MAE over all non-extreme values of .

Using would also be the best solution for any since it would still be the median (and mean) of the posterior distribution. However, this would no longer hold when comes from a non-uniform cumulative distribution . If is known, it is possible to transform back into the uniform distribution using (termed quantile transformation) and thus return to the supervised uniform distribution problem ( ). The proposed algorithm is to first use the quantile transformation on the predictor variables before training the decision tree (), and then apply the same quantile transformation on the new observations () before predicting their outcome using the trained decision tree model. Since the quantile transformation is monotone, the performance when predicting the training data using the trained decision tree model would be invariant to whether the transformation was used or not. However, for the predictions of new observations (for most possible cases of ), using the quantile transformation is expected to improve the interpolated misclassification error of the model (MAE).

Let (and respectively) be the maximal (minimal) observation Left (Right) of the quantile in the -scale. In our proposed algorithm, can be expressed in terms of the statistic () as . If F is not known, one could still use as a split-point on the -scale. This will typically be somewhat distant from as it would be like estimating in the uniform-scale as

(29)

For general , is likely to be distant from the median of the posterior distribution, and therefore give suboptimal misclassification error when predicting new observations (MAE). This shall be explored in the following sections via simulations.

Whether using or , using only or for estimating would be the same as using or , which would give (approximately) double the MAE (for most values of ) and should therefore be avoided.

3.2 The benefit of transforming the predictor variable in various distributions

In order to investigate the benefit of knowing , allowing to use (or ) instead of , several simulations were conducted for various known distributions. Each simulation checked a range of possible locations and sample sizes - measuring the Mean Absolute Error (MAE, or misclassification error - since this is measured on the quantile scale) for each of the two estimators ( vs ).

Each simulation measured, on the quantile scale, the misclassification error (MAE) for a range of scenarios with sample sizes 2, 10, 20, and 100, on eight Beta distributions - as depicted in Figure

3. The split point is always taken to be all multiples of 0.01 in . Each iteration in the simulation draws observations from and calculates the absolute difference between and (already depicted in Figure 2), as well as between and . This was repeated times and averaged to produce the (MAE) lines in the figures.

Figure 3: The densities of six parameter combinations for the Beta distribution.
Figure 4: The misclassification error (Mean Absolute Error) of a model estimating using the original data or the quantile transformed data, for a range of positions and sample sizes. In the figure and .

As can be seen from Figure 4, the advantage of using over is highly dependent on the shape of the distribution, the location of the split (), and the sample size. Although can have a lower MAE than at some -ranges, generally performs better and can never be fully dominated by the performance of (as Bayes estimators are admissible). Using instead of offers a gain that increases the steeper the density is near the split point (). For example, notice the hump for and near (second row from the top, and third column from the left, in Figure 4). is expected to make approximately 4% misclassification error, as compared to approximately 6% if using . Figure 3 (third column from the left) displays a steep decline of this density near its 80% quantile 0.248. In such a case acts similarly to in the quantile scale, while (generated by the quantile transformation ) gives better results, invariant in , by estimating as the median of the posterior distribution (which, as was shown in the previous section, is the optimal flat-prior Bayes solution for minimizing the MAE).

The Bayes estimator is less precise when the true value of is near the edges of the support (near 0 or 1). From Figure 9 it seems that the raw -scale estimator (not quantile-transformed) is less biased near the edges of , thus helping the estimator gain more precision in these areas over the Bayes estimator. Similar simulations were conducted for other commonly used distributions (Cauchy, Standard Normal, Double Exponential, Chi-squared with , Standard Exponential, Log-Normal, and a mixture of two normals), with the results presented in Figure 11. From both figures (11) and (4) it is clear that a larger sample size both reduces the MAE (as expected) and also changes the locations in which is better than . For example, in unimodal and symmetric distributions (such as Beta(2,2), Beta(10,10), Cauchy, Standard Normal, and the Double Exponential), the larger the sample size gets the more the MAE of both estimators becomes similar for values of close to 0.5, while for values of closer to 0 and 1, demonstrates better MAE than . We note that in no case does one estimator completely dominate the other, and also that only for does one estimator perform twice as good than the other for near 0.5 (i.e. in this case performs similarly to using only or ).

Lastly, bimodal distributions are compared in simulation. These are mixture models of two normal distributions with different means and four combinations of variances and proportions; their densities are presented in Figure

6. The simulation results given in Figure 5 reveal that if is located in the “middle point” between the two modes, then the performance of will outperform , indicating that staying in the original scale of the -scale will help yield the best results in such cases. However, this performance gain is offset by a large MAE for if happens to be near a mode.

Figure 5: The misclassification error (Mean Absolute Error) of a model estimating using the original data or the quantile transformed data, for a range of positions, sample sizes, for four combination of two mixed densities of the standard normal distribution in chances of and with another distribution (chances of and ) with and or . In the figure and .
Figure 6: The densities for mixtures of two normal distributions

3.3 Transforming the predictor variable when its distribution is unknown

It is often the case, in real-world data, that the distribution of the predictor variable is not known and should (if possible) be estimated from labeled and (if possible) unlabeled observations. Once estimated, the predictor variable(s) could be transformed to using the quantile transformation () for improving the interpolated misclassification error. Some previous studies have already proposed semi-supervised methods for improving decision trees (such as Criminisi et al. (2012); Tanha et al. (2015)) but not for improving the split-point interpolation error.

Sometimes the observations come from a postulated parametric family (such as the Normal Distribution), but the exact parameters need to be estimated from the data (e.g. ). In the following simulation the observations are normally distributed, and the quantile transformation applied is based on estimated parameters. Figure 7 demonstrates how powerful (an adequate!) parametric assumption can be. Already by estimating and on the labeled data, the estimated quantile transformation (red line) performs almost as well as the true one (green line), both outperforming -scale methods (blue line) in prediction accuracy.

Figure 7: The misclassification error (Mean Absolute Error) of a model estimating using the original data which comes from a standard normal distribution (blue), estimation after the quantile transformed data (green), and after assuming the distribution is normal and using the estimated and to estimating the exact distribution. This is done for a range of positions and sample sizes (2, 10, 20, 100). In the figure , , and refers to using using the CDF with the estimated parameters.

Even when a normal assumption cannot be justified, for some unimodal data it might be reasonable to assume normality after performing a Box-Cox (Box and Cox, 1964) or a Yeo-Johnson (Yeo and Johnson, 2000) transformation (for example, using the caret R package (Kuhn, 2008)). The unimodality of the distribution can be tested (for example, using the dip test (Hartigan and Hartigan, 1985; Maechler, 2015)), and the normality could be tested as well (using shapiro or Kolmogorov-Smirnov tests for normality (Shapiro and Wilk, 1965; Lilliefors, 1967)).

When no parametric assumptions are made, could be estimated using the empirical CDF. Simulation results, presented in Figure 12 and Figure 13 from the appendix, indicate that unlabeled observations from seem sufficient in order to yield an empirical CDF that is precise enough for building a decision tree model based on labeled observations. In such cases, the empirical quantile transformation of could give the superior performance of over (as presented in the following case study). But these figures also show that estimating the CDF by adding only a few unlabeled observations to the labeled ones, is detrimental to performance. Other semi-parametric smoothing kernels may also prove to be useful.

3.4 Case study - Deciding if a day is rainy or not

The weatherAUS dataset comes bundled with the rattle R package (Williams, 2011). The data includes 35,000 daily observations from over 45 Australian weather stations. The data was originally obtained from the Australian Commonwealth Bureau of Meteorology and processed to create a sample dataset for illustrating data mining using R and Rattle. The data in the package has been processed to provide a target variable RainTomorrow on whether there is rain on the following day (No/Yes) and also includes a risk variable RISK_MM which is the amount of rain recorded during the next day. The variable RISK_MM

is strictly positive, with a sharply declining, right tailed density function (similar to an Exponential Distribution). As is expected, when the amount of rain in a given day is 0, the

RainTomorrow variable will indicate that there was no rain. However, a positive amount of measured rain does not necessarily mean that a day would be classified as rainy. In fact, the data shows that a day would be declared rainy only when observations had more than 1.1 (or any location between 1.0 and 1.2) units of rain, and otherwise it would be labeled that there was no rain that day. This split position () is in the 78% quantile of the distribution. Could a sample smaller than 35,000 observations be enough for making this distinction? A simulation experiment was conducted based on the dataset; observations were sampled from the full dataset, including the amount of rain (RISK_MM) and whether the day was rainy or not. The sample was used for finding a split rule, and then the complete dataset of 35,000 observations was used to check the misclassification error of the prediction. This was repeated times, every time with a new sample of observations. The split rule and misclassification errors were averaged over the simulation runs. This was repeated when using the original scale of X (RISK_MM), after using the empirical CDF based on all 35,000 observations (the quantile transformation), and also by using and .

Number of
observations
Average estimated
split point
Interpolated
Misclassification error
10 2.65 2.62 0.40 4.89 0.0417 0.0409 0.0608 0.0640
20 1.70 1.67 0.60 2.80 0.0241 0.0233 0.0350 0.0423
100 1.12 1.06 0.93 1.31 0.0021 0.0019 0.0045 0.0056
1000 1.1 1.0 1.0 1.2 0.0 0.0 0.0 0.0
Table 1: The effect of the quantile transformation on prediction whether or not it would rain using the amount of measured rain (RISK_MM). The simulation tested a sample of 10, 20, 100, and 1000 labeled observations from 35,000 daily rainfall observations. The estimation of the CDF used all 35,000 observations (ignoring their labels). samples were drawn for the estimations. In the estimation is done in the original scale, in the scale is after the quantile transformation, while and uses only the Left and Right locations for estimation.

The results, given in Table 1, show that the proposed scenario, in which there is a wealth of unlabeled observations for estimating the CDF, can be leveraged for improving prediction based on only 10, 20 or 100 labeled observations. It is clear that using either or is superior to and , without knowing the exact distribution of . If the CDF of is known, than there is a small gain to be made when using instead of . However, once the number of labeled observations is 1000 then all estimators give perfect results. This might be explained in that while there are 35,000 observations, there are only 477 unique values in the sample. This hints that the observations have been rounded, and using 1000 observations is already enough to represent all the precision offered by data into the ECDF of the underlying distribution.

4 Discussion

This study offered several actionable items for both authors and users of statistical algorithms that rely on decision tree models. For writers of such statistical software, the theoretical work in section (2) leads to the recommendation to use the minimal sufficient statistic for split point interpolation - preferably via (or more practically, ). Even when the distribution of is not uniform, this split point interpolation will yield better results than only using or for predicting observations with new values. This (average) improvement of interpolated predictions will sometimes take place for discrete variables but will always be relevant for continuous predictor variables.

For users of decision tree algorithms, the simulation results in section (3) indicate that knowledge about the distribution of can be used to improve the prediction of the algorithm on future observations (if the DTL uses ), by using the quantile transformation to get . The most significant gain for transforming and to and occurs when the number of observations used for deciding on a split is small. Transforming

to the quantile scale will be most beneficial when the density near the split is skewed (thus, making the median of the posterior distribution close to one of the edges of either

or ). The method is less likely to be helpful if is essentially separated by into two clusters (see the mixture cases in Figure 5).

When the distribution of

is not known, adequate parametric modeling may permit improvement in prediction accuracy even when estimation is based on the labeled data exclusively (see Figure

7, even for ).

While this study focused on a single split for a binary deterministic response variable (), the results are in fact indicative to any type of recursive binary decision tree, be it a multi-class problem or a regression problem. A recursive binary decision tree is such that at every node the split point interpolation problem is the same as we have dealt with, in the sense that one is estimated at a time (for this subset of the data). The limitation is that now the support of the observations is not fixed (but may depend on previous splits), but if the split is away from the support of the conditional distribution (and given that there are observations from both sides of ) then it should not influence the conclusions in this study. As for non-binary response, for example, if and while then the split point interpolation problem of estimating is still similar to everything that was discussed until now. The main difference is that the performance might use a different metric than the misclassification error used in this paper. When the decision tree model uses multiple splits, as is often the case, the benefit of using the quantile transformation on the predictor variables depends on the relation between the number of observations and the complexity of the estimated model. See section 5.7 in the appendix for some simulation results.

Lastly, it is often the case in real-world problems that the variable (rather than deterministic) is a stochastic monotone function of , in which case the pair is not well defined, and the supervised uniform distribution model should undergo some isotonic generalization. This is a subject for further research.

We thank Professor Yoav Benjamini, Professor Eilon Solan, Professor Dafna Shahaf, Professor Saharon Rosset, Yoni Sidi, Barak Brill, Marilyn Friedes, and Deborah Galili for their valuable input on this manuscript. This work was supported in part by the European Research Council under EC–EP7 European Research Council grant PSARPS-297519, and in part by the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 604102 (Human Brain Project).


References

  • Banfield et al. (2007) Robert E Banfield, Lawrence O Hall, Kevin W Bowyer, and W Philip Kegelmeyer. A comparison of decision tree ensemble creation techniques. IEEE transactions on pattern analysis and machine intelligence, 29(1):173–180, 2007.
  • Berkson (1955) Joseph Berkson. Maximum likelihood and minimum x 2 estimates of the logistic function. Journal of the American Statistical Association, 50(269):130–162, 1955.
  • Blackwell (1947) David Blackwell. Conditional expectation and unbiased sequential estimation. The Annals of Mathematical Statistics, 18(1):105–110, 1947.
  • Box and Cox (1964) George EP Box and David R Cox. An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological), pages 211–252, 1964.
  • Breiman (1996) Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
  • Breiman (2001) Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • Breiman et al. (1984) Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classification and regression trees. CRC press, 1984.
  • Chen et al. (2016) Tianqi Chen, Tong He, and Michael Benesty. xgboost: Extreme Gradient Boosting, 2016. URL https://CRAN.R-project.org/package=xgboost. R package version 0.4-4.
  • Chickering et al. (2001) David Maxwell Chickering, Christopher Meek, and Robert Rounthwaite. Efficient determination of dynamic split points in a decision tree. In Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on, pages 91–98. IEEE, 2001.
  • Criminisi et al. (2012) Antonio Criminisi, Jamie Shotton, and Ender Konukoglu. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Foundations and Trends® in Computer Graphics and Vision, 7(2–3):81–227, 2012.
  • Fayyad and Irani (1992) Usama M Fayyad and Keki B Irani. On the handling of continuous-valued attributes in decision tree generation. Machine learning, 8(1):87–102, 1992.
  • Freund and Schapire (1995) Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In

    European conference on computational learning theory

    , pages 23–37. Springer, 1995.
  • Friedman (2001) Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
  • Friedman (2006) Jerome H Friedman. Recent advances in predictive (machine) learning. Journal of classification, 23(2):175–197, 2006.
  • Galili (2016) Tal Galili. edfun: Creating Empirical Distribution Functions, 2016. https://cran.r-project.org/package=edfun, https://github.com/talgalili/edfun/, http://www.r-statistics.com/tag/edfun/.
  • Galili and Meilijson (2016) Tal Galili and Isaac Meilijson. An Example of an Improvable Rao–Blackwell Improvement, Inefficient Maximum Likelihood Estimator, and Unbiased Generalized Bayes Estimator. The American Statistician, 70(1):108–113, jan 2016. ISSN 0003-1305. doi: 10.1080/00031305.2015.1100683. URL http://dx.doi.org/10.1080/00031305.2015.1100683http://www.tandfonline.com/doi/full/10.1080/00031305.2015.1100683.
  • Grubinger et al. (2014) Thomas Grubinger, Achim Zeileis, and Karl-Peter Pfeiffer. evtree: Evolutionary learning of globally optimal classification and regression trees in R. Journal of Statistical Software, 61(1):1–29, 2014. URL http://www.jstatsoft.org/v61/i01/. R package version 1.0-0.
  • Hartigan and Hartigan (1985) John A Hartigan and PM Hartigan. The dip test of unimodality. The Annals of Statistics, pages 70–84, 1985.
  • Hastie et al. (2001) Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
  • Ho (1998) Tin Kam Ho. The random subspace method for constructing decision forests. IEEE transactions on pattern analysis and machine intelligence, 20(8):832–844, 1998.
  • Hornik et al. (2009) Kurt Hornik, Christian Buchta, and Achim Zeileis. Open-source machine learning: R meets weka. Computational Statistics, 24(2):225–232, 2009. R package version 0.4-27.
  • Hothorn and Zeileis (2015) Torsten Hothorn and Achim Zeileis. partykit: A modular toolkit for recursive partytioning in r. Journal of Machine Learning Research, 16:3905–3909, 2015. R package version 1.0-0.
  • Hothorn et al. (2006) Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical statistics, 15(3):651–674, 2006.
  • Hyafil and Rivest (1976) Laurent Hyafil and Ronald L Rivest. Constructing optimal binary decision trees is np-complete. Information Processing Letters, 5(1):15–17, 1976.
  • Kohavi and Sahami (1996) Ron Kohavi and Mehran Sahami. Error-based and entropy-based discretization of continuous features. In KDD, pages 114–119, 1996.
  • Kotsiantis and Kanellopoulos (2006) Sotiris Kotsiantis and Dimitris Kanellopoulos. Discretization techniques: A recent survey. GESTS International Transactions on Computer Science and Engineering, 32(1):47–58, 2006.
  • Kuhn (2008) Max Kuhn. Caret package. Journal of Statistical Software, 28(5), 2008.
  • Kuhn et al. (2015) Max Kuhn, Steve Weston, Nathan Coulter, and Mark Culp. C code for C5.0 by R. Quinlan. C50: C5.0 Decision Trees and Rule-Based Models, 2015. URL https://CRAN.R-project.org/package=C50. R package version 0.1.0-24.
  • Lehmann and Scheffé (1955) E. L. Lehmann and Henry Scheffé. Completeness, similar regions, and unbiased estimation: Part ii. Sankhyā: The Indian Journal of Statistics (1933-1960), 15(3):pp. 219–236, 1955. ISSN 00364452. URL http://www.jstor.org/stable/25048243.
  • Lehmann and Scheffé (1950) El L Lehmann and Henry Scheffé. Completeness, similar regions, and unbiased estimation: Part i. Sankhyā: the Indian Journal of Statistics, 10(4):305–340, 1950.
  • Liaw and Wiener (2002) Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002. URL http://CRAN.R-project.org/doc/Rnews/. R package version 4.6-12.
  • Lilliefors (1967) Hubert W Lilliefors. On the kolmogorov-smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association, 62(318):399–402, 1967.
  • Liu et al. (2002) Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. Discretization: An enabling technique. Data mining and knowledge discovery, 6(4):393–423, 2002.
  • Lomax and Vadera (2013) Susan Lomax and Sunil Vadera.

    A survey of cost-sensitive decision tree induction algorithms.

    ACM Computing Surveys (CSUR), 45(2):16, 2013.
  • Maechler (2015) Martin Maechler.

    diptest: Hartigan’s Dip Test Statistic for Unimodality - Corrected

    , 2015.
    URL https://CRAN.R-project.org/package=diptest. R package version 0.75-7.
  • Meilijson (2012) Isaac Meilijson. From classical to modern probability: Cimpa summer school 2001. volume 54, chapter 6, pages 207–218. Birkhäuser, 2012.
  • Morgan and Sonquist (1963) James N Morgan and John A Sonquist. Problems in the analysis of survey data, and a proposal. Journal of the American statistical association, 58(302):415–434, 1963.
  • Murthy (1998) Sreerama K Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey. Data mining and knowledge discovery, 2(4):345–389, 1998.
  • Quinlan (1993) J Ross Quinlan. C4. 5: programs for machine learning. 1993.
  • R Core Team (2015) R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2015. URL https://www.R-project.org/. R version 3.3.1.
  • Rao (1945) C. R. Rao. Information and accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37(3):81–91, 1945.
  • Ripley (2016) Brian Ripley. tree: Classification and Regression Trees, 2016. URL https://CRAN.R-project.org/package=tree. R package version 1.0-37.
  • Seligman (2016) Mark Seligman. Rborist: Extensible, Parallelizable Implementation of the Random Forest Algorithm, 2016. URL https://CRAN.R-project.org/package=Rborist. R package version 0.1-3.
  • Shapiro and Wilk (1965) Samuel Sanford Shapiro and Martin B Wilk. An analysis of variance test for normality (complete samples). Biometrika, 52(3/4):591–611, 1965.
  • Tanha et al. (2015) Jafar Tanha, Maarten Someren, and Hamideh Afsarmanesh. Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics, pages 1–16, 2015.
  • Therneau et al. (2015) Terry Therneau, Beth Atkinson, and Brian Ripley. rpart: Recursive Partitioning and Regression Trees, 2015. URL https://CRAN.R-project.org/package=rpart. R package version 4.1-10.
  • Timofeev (2004) Roman Timofeev. Classification and regression trees (CART) theory and applications. PhD thesis, Humboldt University, Berlin, 2004.
  • Truong (2013) Alfred Truong. oblique.tree: Oblique Trees for Classification Data, 2013. URL https://CRAN.R-project.org/package=oblique.tree. R package version 1.1.1.
  • Williams (2011) Graham J. Williams. Data Mining with Rattle and R: The art of excavating data for knowledge discovery. Use R! Springer, 2011. URL http://www.amazon.com/gp/product/1441998896/ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=togaware-20&linkCode=as2&camp=217145&creative=399373&creativeASIN=1441998896. R package version 4.1.0.
  • Wright (2016) Marvin N. Wright. ranger: A Fast Implementation of Random Forests, 2016. URL https://CRAN.R-project.org/package=ranger. R package version 0.5.0.
  • Wu et al. (2008) Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu, S Yu Philip, et al. Top 10 algorithms in data mining. Knowledge and information systems, 14(1):1–37, 2008.
  • Yeo and Johnson (2000) In-Kwon Yeo and Richard A Johnson. A new family of power transformations to improve normality or symmetry. Biometrika, 87(4):954–959, 2000.

5 Appendix

5.1 Introduction to Decision Tree Models and CART

5.1.1 Introduction to Decision Tree Models

Decision tree learning (DTL) is any algorithm which constructs a decision tree model (DTM) as a predictive model for mapping observations about an item (also known as explanatory or predictor variables) to conclusions about the item’s target value (also known as dependent or response variable). Decision tree model (or classifier) are very commonly used for predictive modeling in statistics, data mining and machine learning. DTL is often performed by searching for a series of decision rules which partition the space spanned by the independent variables into disjoint region of the space. Each partitioned region of the space is attributed with some prediction of the dependent variable. In these tree structures, leaves represent the predicted target and branches represent conjunctions of features that lead to those predictions. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees, and when the target variable can take a finite set of values the models are called classification trees. Generally speaking, decision trees are primarily intended as prediction models for cases where we have no preconceived notion about the structure of the model that would fit the data. The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality (Hyafil and Rivest, 1976). Hence, many DTL methods use a mixture of randomization and greedy (e.g. forward step-wise) approaches for searching the data space, in the hopes of stumbling upon a ”useful” patterns for the prediction of interest. In order to protect from over-fitting to the training data at hand, DTL algorithms employ various methods for restricting the model’s complexity while optimizing some measure of prediction accuracy.

Research on DTL dates back to works in the social sciences from the 60’s by Morgan and Sonquist (Morgan and Sonquist, 1963) which was later improved in the 80’s by Breiman, et al. “Classification and regression trees” (CART) methodology (Breiman et al., 1984), and in the 90’s by Quinlan’s C4.5 (Quinlan, 1993)

. As time passed, DTL started to be used in more fields, and today they are applied in multiple disciplines such as statistics, pattern recognition, decision theory, signal processing, machine learning and artificial neural networks. A survey of decision trees’ adaptation is provided by S. K. Murthy

(Murthy, 1998), and a more recent survey by S Lomax and S Vadera concluded that, in 2013, there has already been over 50 different algorithms for producing decision trees (including ID3, CHAID, C5.0, oblique trees, etc.) (Lomax and Vadera, 2013). This work will use the CART DTL methodology for illustration and for some of the simulations, although the conclusion from this work are equally applicable to most of the other algorithms.

5.1.2 Introduction to CART

Depicted in 2008 as one of the top ten algorithms in data mining (Wu et al., 2008), CART (Breiman et al., 1984) offers a good example for a DTL. This section outlines the major steps in the CART algorithm. The construction of the decision tree starts in the root node which consists of the entire learning set, then (1) all possible variables are scanned for possible splits, and (2) the variable+split with the best ”impurity” measure is picked - possible impurity measures can be misclassification, Gini or entropy for classification trees and mean square error for regression trees. The root node is split into two nodes by choosing one of the explanatory variables and making a rule on the chosen predictor variable which divides it into two groups (based on steps 1 and 2). Then (3) steps 1 and 2 are recursively repeated for each of the child nodes until a predefined stopping rule is met for all terminal nodes (e.g., pre-pruning, or a node reached a certain minimal number of observations), (4) each node is assigned a prediction by funneling the training dataset to it, and picking the majority class for nominal dependent variables or the average for numerical variables. From the complete tree (5) a series of nested (pruned) sub-trees are defined based on misclassification error when using the training data-set, (6) a k-fold cross validation (CV) is performed by which the above process is repeated k times (each time on a fraction of the sample size), (7) using the CV hold-out samples on their respective models (each on a fraction of the sample size), a complexity parameter is determined so that the cross-validated error is minimized, and the entire tree is pruned by cost-complexity trade-off. The prediction of a new observation is made by funneling it through the nodes (based on their corresponding values in the explanatory variables) until it falls into one of the terminal nodes (i.e., a leaf node which has no splits), where it is given a prediction based on some aggregate of values from the training set (see step (4) above).

While the CART methodology offers a useful search mechanism in the model space, it has several known limitations when used for prediction. Small CART trees tend to give biased predictions while large trees have high variance and tend to over-fit the data. While CART’s cost-complexity pruning offers a reasonable compromise on tree size, better alternatives have since been presented by extending a single decision tree to an ensembles of decision trees. Leo Breiman is responsible for two celebrated extensions of CART - Bagging (Breiman, 1996) and Random Forest (Breiman, 2001). Both of these methods draw many bootstrap samples from the training data in order to grow full-sized (un-pruned) decision trees – and then aggregate their predictions (majority vote for classification or averaging for regression trees). Bagging samples can choose among all the potential features when deciding on a split while random forest chooses at each split only among a randomly chosen subset of the features. Another (intermediate) method is the random subspace method by Ho (Ho, 1998), which samples the features only once for each of the bagged samples (instead of at each node, as is done in random forest). These extensions search a wider range of models, and by combining large trees with aggregation over many trees they reduce the bias and variance of the model - often yielding superior predictive performance over CART. Another prominent alternative, suggested by Freund and Schapire (Freund and Schapire, 1995), is the idea of boosting as implemented in algorithms such as AdaBoost. Comparison of these methods has shown that each may be superior to others in alternative scenarios (Banfield et al., 2007). Further extensions include gradient boosting and others.

5.2 Integrals for section 2

5.2.1

As an exercise, is indeed unbiased:

(30)

Needed calculations for the variance of