This paper presents a new ensemble learning method for classification problems called projection pursuit random forest (PPF). PPF uses the PPtree algorithm introduced in Lee et al. (2013). In PPF, trees are constructed by splitting on linear combinations of randomly chosen variables. Projection pursuit is used to choose a projection of the variables that best separates the classes. Utilizing linear combinations of variables to separate classes takes the correlation between variables into account which allows PPF to outperform a traditional random forest when separations between groups occurs in combinations of variables.
The method presented here can be used in multi-class problems and is implemented into an R (R Core Team, 2018) package, PPforest, which is available on CRAN.
This paper presents the projection pursuit random forest (PPF), a new ensemble learning method for classification problems, built on combinations of predictors in the tree construction.
PPF builds on the projection pursuit tree (PPtree) algorithm (Lee et al., 2013), available in the R package PPtreeViz (Lee, 2018a) which fits a single multi-class tree to the data. Projection pursuit is used to find the linear combination of variables that best separates groups, and many different rules to make the actual split are provided.
Trees that use linear combinations of predictors in a split are known in the literature as oblique trees (Kim and Loh, 2001; Brodley and Utgoff, 1995; Tan and Dowe, 2004; Truong, 2009; Lee et al., 2013)
. All these algorithms use different approaches for finding linear combinations of predictors upon which to make a split. Some of the methods used for selecting the linear combination include random coefficient generation, linear discriminant analysis, and linear support vector machines. Theoretically, these could also be used as a base underlying PPF.
For each split, a random sample of predictors is selected, then an optimal linear combination for separating the classes is computed by using a projection pursuit index. The algorithm is targeted for problems where classes can be separated by linear combinations of predictors, which define separating hyperplanes that are oblique to the axes rather than orthogonal to them. Additionally PPF accommodates class imbalance by using stratified bootstrap samples and variable importance measures are computed using the coefficients of the projections. PPF can be used for multi-class problems and is implemented into an R package, calledPPforest. Only the LDA and PDA projection pursuit indexes are available in PPF.
In the machine learning literature numerous work has been conducted on algorithms for building forests from oblique trees(Tan and Dowe, 2006), (Menze et al., 2011) and (Do et al., 2010). The performance is reported to be better than random forests, which is what we have determined with our algorithm also. A limitation of building on these approaches is the lack of readily available software.
This paper is organized as follows. Section 2 explains the projection pursuit tree underlying PPF. Section 3 describes the PPF algorithm; diagnostics, including how to compute variable importance and the implementation details. Section 4 evaluates the algorithm using a simulation study and performance on benchmark machine learning data in comparison with other methods. Section 5 discusses the choice of parameters, and compares the diagnostics relative to random forests. Section 7 discusses possible extensions and future directions.
2 Background on the projection pursuit tree
The projection pursuit algorithm searches for a low dimensional projection that optimizes a continuous function which measures some aspect of interest; for PPF, this is class separation. Friedman and Tukey (1973) coined the term “projection pursuit”, but the ideas existed earlier than this (Kruskal, 1969). Lee et al. (2005) developed an index, derived from the linear discriminant analysis, for finding projections that separate classes. Let be a -dimensional data vector, -th observation of the -th class, , is the number of classes, , and is the number of observations in class . The LDA index is defined as follows:
where is the between-group sums of squares, and is the within-group sums of squares. If the LDA index value is high, there is a large difference between classes.
A second index, PDA, was developed to address large , small data (Lee and Cook, 2010). The main idea used in construction of the index is that when
or the variables are highly correlated, the maximum likelihood variance-covariance matrix estimator will be close to being singular, and this will affect the inverse calculation. The PDA index adjusts the variance-covariance matrix calculation, and is defined as follows:
where A is an orthonormal projection onto a -dimensional space and is a pre-determined parameter. is the between-class sums of squares and .
The PPtree algorithm uses a multi-step approach to fit a multi-class model by finding linear combinations to split on. Figure 1 compares the boundaries that would result from a classification tree fitted using the rpart algorithm (Therneau et al., 2010) and the PPtree algorithm.
Figure 2 illustrates the PPtree algorithm for three classes, and the algorithm steps are detailed below. Let be the data set where is a p-dimensional vector of explanatory variables and () represents class information with .
Optimize a projection pursuit index to find an optimal one-dimensional projection, , for separating all classes in the current data yielding projected data .
On the projected data, , redefine the problem into a two class problem by comparing means, and assign a new label, either or to each observation, generating a new class variable . The new groups and can contain more than one original class.
Find an optimal one-dimensional projection , using to separate the two class problem and . The best separation of and is determined in this step providing the decision rule for the node,
if then assign to the left node else assign to the right node,
where is the mean of .
For each group, all the previous steps are repeated until and have only one class from the original classes. The depth of PPtree is at most the number of classes.
3 Projection pursuit random forest
This section provides the definition of PPF for classification and the algorithm. Diagnostics for the classifier are also defined.
Let the random vector of predictor variables
and the output random variable, where is a finite set such that . The training sample is defined as of i.i.d random variables . The objective is to build a classifier which predicts from using given an ensemble of classifiers .
A projection pursuit classification random forest can be defined as a collection of randomized classification trees where are i.i.d. random vectors. includes the two sources of randomness in the tree (random variable selection and random bootstrap sample), then has information about which variables were selected in each partition and which cases were selected in the bootstrap sample.
For each tree, , a unique vote is collected based on the most popular class for the selected predictor variables. Equation 3.1 defines the PPF estimator based on combining the trees.
is the expectation wrt , conditionally on and . In practice, the PPF estimator is evaluated by generating random trees and take the average of the individual outcomes. This procedure is justified in a similar way to the original random forest defined by Breiman (2001), and is based on the Law of the Large Numbers (Athreya and Lahiri, 2006).
Equation 4 describes the prediction of a new observation .
Let the total number of cases in the training set . stratified bootstrap samples from are taken. Then for each class, independently and uniformly re-sample cases from (training data set for group ) with size to create a stratified bootstrap data set .
Use a bootstrap sample to grow a PPtree to the largest extent possible without pruning. (Note that the depth of the PPtree is at most , where is the number of classes).
Start with all the cases in in the root node.
A simple random sample of predictor variables from the set of all the predictor variables is drawn, where .
Find the optimal one-dimensional projection to separate all the classes in .
If more than two class, then reduce the number of classes to two by comparing means, and assign new labels, and to each case (called the new response in ).
Find the optimal one-dimensional projection, , using the bootstrap data set with the relabeled response, , to separate and . The linear combination is computed by optimizing a projection pursuit index to get a projection of the variables that best separates the classes using the random selected variables. Two index options are available LDA or PDA.
Compute the decision boundary . Eight different rules to define the cutoff value of each node can be used. All the rules are defined in Lee (2018b).
Keep and .
Separate the data into two groups using the new labels and .
Repeat from (b) to (h) if or have more than two original classes.
Repeat 2 for .
The output is the ensemble of PPtrees, .
Split values on the projected data can be computed by one of eight methods, which use the group means, or medians, sample size and variance or IQR weighting
Figure 3 has a diagram illustrating the PPforest algorithm.
The initial code for PPforest was developed entirely in R. It was subsequently profiled using profvis (Chang and Luraschi, 2016), and two code optimization strategies were employed: translate main functions into Rcpp (Eddelbuettel et al., 2011) and parallelization using plyr. The microbenchmark package was used to compare the speed before and after optimization. Figure 4 shows the performance before and after optimization. The decrease in speed is linear as the number of groups increases. The improvement is between 3- and 9-fold for this range of parameters. The machine used for this comparison was a MacBook Pro with a processor of 2.4 GHz Intel Core i7 with a memory of 8GB and 1867MHz LPDDR3.
|number of classes|
|obs. by class|
|number of variables|
|number of trees||(50, 500)|
|numbers of cores|
|PPforest version||(only R, C code)|
3.4 PPF diagnostics
The process of bagging and combining results from multiple trees produces numerous diagnostics which can provide a lot of insight into the class structure in high dimensions. Because ensemble methods are composed of many models fitted to subsets of the data, many statistics can be calculated to be analyzed as a separate data set. This provides the ability to understand how the model is working. The diagnostics of interest are the error rate, variable importance measure, vote matrix, and proximity matrix.
3.4.1 Error rate
Using the out-of-bag (oob) cases from bagged trees in the forest construction allows ongoing estimates of the generalization error for an ensemble of trees, described in Breiman (2001).
Given a training data set , bootstrap samples from are taken. For each bootstrap sample (), a
tree classifier is constructed, and a majority vote is used to get the PPF predictor.
The oob cases are used to get the error rate estimates. For each in , the votes are aggregated only for the classifiers that do not contain . Hence, PPF is called the out-of-bag classifier, and the error rate for this classifier (out-of-bag error rate) is the estimate of the generalized error. The out-of-bag error rate is a measure for each model that is combined in the ensemble and is used to provide the overall error of the ensemble.
3.4.2 Variable importance
PPF calculates variable importance in two ways: (1) permuted importance using accuracy, and (2) importance based on projection coefficients on standardized variables. The permuted variable importance is comparable to the measure defined in the classical random forest algorithm. It is computed using the oob cases for the tree for each predictor variable. Then the permuted importance of the variable in the tree can be defined as:
where is the predicted class for the observation in the tree , and is the predicted class for the observation in the tree after permuting the values for variable . The global permuted importance measure is the average importance over all the trees in the forest.This measure is based on comparing the accuracy of classifying oob observations using the true class with permuted (nonsense) class.
For the second importance measure, the coefficients of each projection are examined. The magnitude of these values indicates importance if the variables have been standardized. The variable importance for a single tree is computed by a weighted sum of the absolute values of the coefficients across node, then the weights take the number of classes in each node into account() (Lee et al., 2013) . The importance of the variable in the PPtree can be defined as:
where is the projected coefficient for node and variable and the total number of node partitions in the tree .
The global variable importance in a PPforest then can be defined in different ways. The most intuitive are the average variable importance from each PPtree across all the trees in the forest.
Alternatively, a global importance measure is defined for the forest as a weighted mean of the absolute value of the projection coefficients across all nodes in every tree. The weights are based on the projection pursuit indexes in each node (), and 1-(OOB-error of each tree)().
3.4.3 Vote matrix
An uncertainty measure for each observation, across models, is the proportion of times that a case is predicted to be in each class. If a case is always predicted to be the one class, there is no uncertainty about its group, and if this matches the true class then it is correctly labeled. Cases that are proportionately predicted to be multiple classes indicate difficult-to-classify observations. These cases may be important in that they might indicate special attention is needed in some neighborhoods of the data space, or more simply, could be errors in measurements in the data.
3.4.4 Proximity matrix
In a tree, each pair of observations can be in the same terminal node or not. Tallying this up across all trees in a forest gives the proximity matrix, an
matrix of the proportion of trees that the pair shares a terminal node. A proximity matrix can be considered to be a similarity matrix. This is typically used to do a follow-up cluster analysis to assess the strength of the class structure, and whether there are additional unlabeled clusters.
These diagnostics are used to assess model complexity; individual model contributions; variable importance and dimension reduction; and uncertainty in prediction associated with individual observations.
4 Performance comparison
This section presents simulation results and a benchmark data study to examine the predictive performance of PPF in comparison to other methods. In the benchmark data study, PPF is compared with PPtree, CART and RF. The simulation results are designed to compare PPF with RF on data with linear projections defining class differences.
4.1 Benchmark data study
The performance of PPF is compared with the classification methods, PPtree, CART and RF using 10 benchmark data sets taken from the UCI Machine Learning archive (Lichman, 2013). Table 2 presents summary information about the benchmark data, number of groups, cases, and predictors for each data set. The imbalance between groups is measured by the range of group size proportions and correlation is the average of all pairwise correlation coefficients among predictor variables.
For each benchmark data set, of the observations are randomly chosen and used for training while the remaining are used as test data for computing predictive error. This procedure is repeated 200 times and the mean error rate is reported in Table 1. In PPF, the number of variables selected in each node partition is a tuning parameter, the proportion of variables selected at each partition. Three different values were used (0.6, 0.9 and the RF default). The test error reported for PPF is the best from these.
The results show that PPF has a better performance in the test data set than the other methods for the crab, fishcatch, leukemia, lymphoma, olive and wine data, while the RF test error is smaller for glass, image, NCI60 and parkinson data.
4.2 Boundary comparison with random forest
To illustrate why and where PPF outperforms RF, results from a small simulation are shown. We expect PPF to outperform RF when the separation between classes is in linear combinations of variables. The simulated data is similar to the crab data.
Each 2D simulated data set was rotated from 0 through 90
, and 20 replications were conducted. Average (and standard deviation) of error was computed. Figure6 shows the boundaries for two of the rotations generated by the RF and PPF models, and shows the summary of the errors by rotation angle. PPF uniformly outperforms RF in this scenario and produces better boundaries.
5 Diagnostics comparison
The diagnostics computed by PPF (Section 3) and RF are compared for the lymphona data, which helps to understand why and how PPF outperforms RF with this data.
5.1 Variable importance
Figure 7 illustrates how the variable importance differs, using the lymphoma data. PPF outperformed RF for this data. There are three groups, and it is a high-dimension, low sample size data set. With PPF, the PDA index is used, and the 60% of variables are available at each node. The number of trees used is the same as the RF default. Only the top ten most important variables are shown. There are some common on both lists and a some differences. Showing just the first two variables from each list is sufficient to illustrate the different type of boundaries induced by the classifiers. The two ways of computing importance in PPF do produce a different hierarchy of variables. With the global average importance, Gene35 and Gene50 are the top two, and these distinguish the small group FL best. With the global importance, Gene35 and Gene44 are featured, and together these find a big gap between DLBCL and the other two groups. PPF is utilizing the association between variables to classify groups, as would be expected.
5.2 Vote matrix
Figure 8 shows the vote matrices returned by PPF and RF for three classes of the lymphona data. It is represented in two ways: as a ternary plot and as a side-by-side jittered dotplot. The vote matrix has three columns corresponding to the proportion of times the case was predicted to be class B-CLL, DLBCL or FL, and thus is constrained to lie in a 2D triangle in 3D space. A ternary diagram is created using a helmert transformation of the vote matrix to capture the 2D subspace. The way to read it is: points near the vertex are clearly predicted to be one class, points along an edge are confused between two classes, and points in the middle are confused between the three classes. PPF provides more distinct classification of observations than RF, because the points are more concentrated in the vertices, and along one edge.
The side-by-side jittered dotplot is an alternative representation that readily can be used for any number of classes. The proportion each case is classified to a group is diplayed vertically along a horizontal axis representing the categorical class variable. Points are jittered a little horizontally to better see the distribution of proportions, and colour represents the true class. Points concentrated at the top part indicate cases that are clearly grouped into a class, and if the colour matches the true class then these are correct classifications. The message is similar to the ternary diagram: DLBCL is much more clearly distinguished by PPF, and FL is actually distinguishable from B-CLL by PPF but confused by RF.
Figure 9 shows multidimensional scaling plots of the proximity matrix produced by PPF and RF classification of the lymphoma data. PPF provides the cleaner proximities. This means that more frequently observations from the same class reside in the same terminal node of the trees making up the PPF, than those of RF.
6 Parameter selection
The primary parameters for PPF are mostly the same as those for RF: number of trees, and number of variables used in each node partition, with the addition of when PDA is used as the index.
Figure 10 (left) shows the effect of proportion of variables for the benchmark data comparison. The average error over 200 training/test splits is shown. For all data sets error is lower when the more variables are used. Most converge to low error rate when half the variables are included.
The right plot compares the number of trees needed to optimise the OOB error for both PPF and RF on the lymphona data. Both need around 100 trees to produce best performance.
This article has presented a new ensemble method (PPF) for classification problems, that is built on an oblique tree classifier (PPtree). PPF takes the correlation between variables into account. The forest algorithm enhances the single tree performance, adding diagnostics to assess variable importance, confusion of observations between groups and proximity of observations. It is best for medium sized data sets, both in number of observations and variables.
The benchmark data study showed that PPF predictive performance is always at least as good, or better, than CART and PPtree, and often better than RF. Simulation results show that PPF performs better than RF when the classes are separated by a linear combination of variables and when the correlation between variables increases. The variable importance diagnostic shows that different variables are combined to create the classification using a PPF than RF.
There are several directions where the work could be extended. The two projection pursuit indexes, LDA and PDA, can be readily supplemented by other indices. An example would be to add a regression index for a continuous response. Another direction is to adapt the PPtree algorithm to allow more than splits. This constraint protects the single tree model from overfitting. There is some protection against this with the bagging, and we expect it would enable deeper non-linear boundaries to be constructed by PPF. Lastly, because the accuracy of each tree is collected, automatic pruning of poor performing trees is a possibility.
- Amit and Geman (1997) Amit, Yali, and Donald Geman. 1997. Shape quantization and recognition with randomized trees. Neural computation 9 (7): 1545–1588.
Athreya and Lahiri (2006)
Athreya, Krishna B, and Soumendra N
Measure theory and probability theory. Springer.
- Breiman (1996) Breiman, Leo. 1996. Bagging predictors. Machine learning 24 (2): 123–140.
- Breiman (2001) Breiman, Leo. 2001. Random forests. Machine learning 45 (1): 5–32.
Breiman et al. (1996)
Breiman, Leo, et al.. 1996. Heuristics of instability and stabilization in model selection.The annals of statistics 24 (6): 2350–2383.
Brodley and Utgoff (1995)
Brodley, Carla E, and Paul E Utgoff. 1995. Multivariate decision trees.Machine learning 19 (1): 45–77.
- Chang and Luraschi (2016) Chang, W., and J. Luraschi. 2016. profvis: Interactive Visualizations for Profiling R Code. Version 0.3.2.
Do et al. (2010)
Do, Thanh-Nghi, Philippe Lenca, Stéphane Lallich, and Nguyen-Khang Pham. 2010. Classifying very-high-dimensional data with random forests of oblique decision trees. InAdvances in knowledge discovery and management, 39–55. Springer.
- Eddelbuettel et al. (2011) Eddelbuettel, Dirk, Romain François, J Allaire, John Chambers, Douglas Bates, and Kevin Ushey. 2011. Rcpp: Seamless r and c++ integration. Journal of Statistical Software 40 (8): 1–18.
- Friedman and Tukey (1973) Friedman, Jerome H, and John W Tukey. 1973. A projection pursuit algorithm for exploratory data analysis.
- Ho (1998) Ho, Tin Kam. 1998. The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on 20 (8): 832–844.
- Kim and Loh (2001) Kim, Hyunjoong, and Wei-Yin Loh. 2001. Classification trees with unbiased multiway splits. Journal of the American Statistical Association 96 (454).
Kruskal, Joseph B. 1969. Toward a practical method which helps uncover the structure of a set of multivariate observations by finding the linear transformation which optimizes a new ‘index of condensation’. InStatistical computation, 427–440. Academic Press, New York. Academic Press, New York.
- Lee (2018a) Lee, Eun-Kyung. 2018a. PPtreeViz: An R package for visualizing projection pursuit classification trees. Journal of Statistical Software 83 (8): 1–30. doi:10.18637/jss.v083.i08.
- Lee (2018b) Lee, Eun-Kyung. 2018b. Pptreeviz: An r package for visualizing projection pursuit classification trees. Journal of Statistical Software 83 (1): 1–30.
- Lee and Cook (2010) Lee, Eun-Kyung, and Dianne Cook. 2010. A projection pursuit index for large p small n data. Statistics and Computing 20 (3): 381–392.
- Lee et al. (2005) Lee, Eun-Kyung, Dianne Cook, Sigbert Klinke, and Thomas Lumley. 2005. Projection pursuit for exploratory supervised classification. Journal of Computational and Graphical Statistics 14 (4).
- Lee et al. (2013) Lee, Yoon Dong, Dianne Cook, Ji-won Park, Eun-Kyung Lee, et al.. 2013. PPtree: Projection pursuit classification tree. Electronic Journal of Statistics 7: 1369–1386.
- Lichman (2013) Lichman, M. 2013. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.
- Menze et al. (2011) Menze, Bjoern H, B Michael Kelm, Daniel N Splitthoff, Ullrich Koethe, and Fred A Hamprecht. 2011. On oblique random forests. In Machine learning and knowledge discovery in databases, 453–469. Springer.
- R Core Team (2018) R Core Team. 2018. R: A language and environment for statistical computing. Vienna, Austria. R Foundation for Statistical Computing. https://www.R-project.org/.
Tan and Dowe (2004)
Tan, Peter J, and David L
Mml inference of oblique decision trees.
Ai 2004: Advances in artificial intelligence, 1082–1088. Springer.
- Tan and Dowe (2006) Tan, Peter J, and David L Dowe. 2006. Decision forests with oblique decision trees. In Micai 2006: Advances in artificial intelligence, 593–603. Springer.
- Therneau et al. (2010) Therneau, Terry M, Beth Atkinson, and Maintainer Brian Ripley. 2010. The rpart package.
- Truong (2009) Truong, Alfred. 2009. Fast growing and interpretable oblique trees via probabilistic models. Univ. of Oxford, A thesis submitted for the degree of Doctor of Philosophy, Trinity term.
- Wickham et al. (2015) Wickham, H., R. Francois, and Rstudio. 2015. dplyr: A Grammar of data manipulation. http://cran.r-project.org/web/packages/dplyr/index.html. Maintained by Wickham, H..
- Wickham (July 2009) Wickham, Hadley. July 2009. ggplot2: Elegant graphics for data analysis. useR. Springer.
- Xie (2015) Xie, Yihui. 2015. Dynamic documents with R and knitr, 2nd edn. Boca Raton, Florida: Chapman and Hall/CRC. ISBN 978-1498716963. http://yihui.name/knitr/.