Ensemble Models with Trees and Rules

12/16/2011 ∙ by Deniz Akdemir, et al. ∙ 0

In this article, we have proposed several approaches for post processing a large ensemble of prediction models or rules. The results from our simulations show that the post processing methods we have considered here are promising. We have used the techniques developed here for estimation of quantitative traits from markers, on the benchmark "Bostob Housing"data set and in some simulations. In most cases, the produced models had better prediction performance than, for example, the ones produced by the random forest or the rulefit algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Review of Ensemble Methods

Ensemble learning ([12], [9], [14]) provides solutions to complex statistical prediction problems by simultaneously using a number of models. By bounding false idealizations, focusing on regularities and stable common behavior, ensemble modeling approaches provide solutions that as a whole outperform the single models.

Some early developments in ensemble learning include by Breiman with Bagging (bootstrap aggregating) ([1]) and random forest ([4]), and Freund and Shapire with AdaBoost ([5]). These methods involve ”random” sampling the ”space of models” to produce an ensemble of base learners and a ”post-processing” of these to construct a final prediction model.

In this article, we review several different approaches for ensemble post-processing and propose some new ones. The main point of this article is that the base learners in an ensemble can be used as an input to any regression model. The choices of different models here in are based on the experience and preferences of the authors.

In the remainder of this section, we will review the recently proposed important sampling learning ensembles (ISLE) framework [7] for ensemble model generation. The rule ensembles are also reviewed herein. In Section 2, we propose new ensemble post processing methods including partial least squares regression, multivariate kernel smoothing and use of out-of-bag observations. Section 3 is reserved for examples and simulations by which we compare the methods proposed here with the existing ones. Some remarks about hyper parameter choice and directions for future research are provided in Section 4.

1.1 ISLE Approach

Given a learning task and a relevant data set, we can generate a set of models from a predetermined model family. Bagging bootstraps the training data set [1] and produces a model for each bootstrap sample. Random forest ([11, 4]) creates a diverse set of models by randomly selecting a few aspects of the data set while generating each model. AdaBoost [5] and ARCing [3] iteratively build models by varying case weights (up-weighting cases with large current errors and down-weighting those accurately estimated) and employs the weighted sum of the estimates of the sequence of models. There have been few attempts to unify these ensemble learning methods. One such framework is the ISLE due to Popescu & Friedman [7].

We are to produce a regression model to predict the continuous outcome variable from vector of input variables We will generate models from a given model family indexed by the parameter The final ensemble models considered by the ISLE framework have an additive form:


where are base learners selected from ISLE uses a two-step approach to produce . The first step involves sampling the space of possible models to obtain . The second step proceeds with combining the base learners by choosing weights in (1).

The pseudo code to produce models under ISLE framework is given below:

ISLEM, ν, η


is a loss function,

is a subset of the indices chosen by a sampling scheme is a memory parameter.

The classic ensemble methods of Bagging, Random Forest, AdaBoost, and Gradient Boosting are special cases of ISLE ensemble model generation procedure

[15]. In Bagging and Random Forests the weights in 1 are set to predetermined values, i.e. and for Boosting calculates these weights in a sequential fashion at each step by having positive memory estimating and takes as the final prediction model.

Friedman & Popescu [7] recommend learning the weights using lasso [17]. Let be the matrix of predictions for the observations by the models in an ensemble. The weights are obtained from


is the shrinkage operator, larger values of decreases the number of models included in the final prediction model. The final ensemble model is given by


1.2 Rule Ensembles

The base learners in the preceding sections of this article can be used with any regression model, however usually they are used with regression trees. Each decision tree in the ensemble partitions the input space using the product of indicator functions of ”simple” regions based on several input variables. A tree with terminal nodes define a partition of the input space where the membership to a specific node, say node can be determined by applying the conjunctive rule

where is the indicator function, are the input variables. The regions

are intervals for a continuous variable and a subset of the possible values for a categorical variable.

Given a set of decision trees, rules can be extracted from these trees to define a collection of rules. Let be the matrix of rules for the observations by the rules in the ensemble. The rulefit algorithm of Friedman & Popescu [8] uses the weights that are estimated from


in the final prediction model


2 Post Processing Ensembles Revisited

We can use the base learners in an ensemble as input variables in any regression method. Since the number of models in an ensemble can easily exceed the number of observations, we prefer regression methods that can handle high dimensional input. A few such approaches like principal components, partial least squares regression, multivariate kernel smoothing and weighting are illustrated in this section. We will compare these approaches to the existing standards random forests and rulefit in the next section.

2.1 Principal Components and Partial Least Squares Regression

The models in an ensemble are all aligned with the response variable and therefore we should expect that they are correlated with each other. Principal component regression (PCR) and partial least squares regression (PLSR) are two techniques which are suitable for high dimensional regression problems where the predictor variables exhibit multicollinearity.

PCR and PLSR decompose the input matrix into orthogonal scores and loadings

and regress on the first few columns of the loadings

using ordinary least squares. This leads to biased but low variance estimates of the regression coefficients in model

1. PLSR incorporates information on both and in the loadings.

Both of these methods behave as shrinkage methods [6] where the amount of shrinkage is controlled by the number of loadings included. An obvious question is to find the number of loadings needed to obtain the best generalization for the prediction of new observations. This is, in general, achieved by cross-validation techniques such as bootstrapping.

The illustrations following section demonstrate the good performance of PLSR for post processing trees or rules. PLSR, as opposed to lasso, achieves shrinkage without forcing sparsity on the input variables. The ensemble learners are all ”directed” towards the output variable and therefore they exhibit strong multicollinearity. This is a case where we would expect PLSR to work better than lasso.

The coefficients of the tree ensemble model in 2 or the rule ensemble model in 4 can be used to evaluate importances of trees, rules and individual input variables [8]. For the tree ensembles the importance of the th tree is evaluated as

measures the importance of the trees or rules, here

denotes the standard deviation for the output of the

th tree over the individuals in the training sample. For the rule ensembles the importance of the th rule is calculated similarly as

where is the support of rule The individual variable importances are calculated from sum of the importances of the trees or rules which contain that variable.

The PLSR model is in the same additive form as in 1, therefore the weights in the model can be used to calculate tree rule or variable importances the same way they were calculated for the lasso post processing approach.

2.2 Multivariate Kernel Smoothing

We will concentrate on kernel smoothing using the Nadaraya-Watson estimator. For a detailed presentation of the subject, we refer the reader to ([16]). The Nadaraya-Watson estimator is a weighted sum of the observed responses Let the value of base learners at an input point be written in a dimensional vector The final prediction model at input point can be obtained as

The kernel function is a symmetric function that integrates to one, is the smoothing parameter. In practice, the kernel function and the smoothing parameter are usually selected using the cross validated or bootstrap performances for a range of kernel functions and smoothing parameter values.

2.3 Weighting Ensembles using Out-of-Bag Observations

As we have mentioned earlier, most of the earlier important ensemble methods combine the base models using weights. Both bagging and random forest algorithms use equal weighting. Estimating by minimizing

subject to the constraint gives the Stacking approach of Wolpert [18] and Breiman [2]. In stacking final prediction model is given by

The ensemble generation algorithms based on bootstrapping the observations builds the base learners from the observations in the bootstrap sample, and leaves us with the out-of-bag observations to evaluate the generalization performance of that particular learner. The following weighting scheme will down weight the base learners which have bad generalization performance. Let denote the th out-of-bag observation for We have base learners We can use

as the prediction of the response at input value This involves keeping track of the out-of-bag performance each model in the ensemble and using the weights

The value of controls the smoothness of the model. For large values of this parameter the kernel method will assign approximately equal weights to the learners and hence it is equivalent to random forest weighting. Smaller values of the parameter assigns higher weights to the models with small out of bag errors. It is customary to choose that minimizes the cross-validated or bootstrapped errors. In addition, it is sometimes beneficial to eliminate the models with lowest weights from the final ensemble.

3 Illustrations

The following ensemble models are compared in this section:

  1. r(pslr): Partial Least Squares Regression with Rules,

  2. t(pslr): Partial Least Squares Regression with Trees,

  3. r(lasso): lasso with Rules,

  4. t(lasso): lasso with Trees,

  5. w(oob): Weighting Using Out-of-Bag performance,

  6. wt(oob): Weighting Using Out-of-Bag performance (best of the trees),

  7. rf: Random Forest,

  8. ksr: Kernel Smoothing with Rules,

  9. kst: Kernel Smoothing with Trees.

In all these models hyper parameters of the models are set using 10 fold cross validation in the training sample.

Our first example involves the Fusarium head blight (FHB) data set that is available from the author upon request. A very detailed explanation of this data set is given in [13].

Example 3.1.

FHB is a plant disease caused by the fungus Fusarium Graminearum and results in tremendous losses by reducing grain yield and quality. In addition to the decrease in grain yield and quality, another damage due to FHB is the contamination of the crop with mycotoxins. Therefore, breeding for improved FHB resistance is an important breeding goal. Our aim is to build a prediction model for FHB resistance in barley based on available genetic variables. The FHB data set included FHB measurements along with 2251 single nucleotide polymorphisms (SNP) on 622 elite North American barley lines. The 10 fold cross validated accuracies measured by the correlations of true responses to the predicted values are displayed in Figure 1.

Figure 1: 10 fold cross validated accuracies measured by correlation for the FHB data set. The ensemble of rules with PLSR has slightly higher accuracy compared to its alternative rules with lasso. The number of trees was set to Maximum depth allowed for each tree or rule was set to 5.
Example 3.2.

In our second example we repeat the following experiment 100 times. Elements of the input matrix X are independently generated from a distribution. The elements of the coefficient matrix were also generated independently from and of these were selected randomly and set to zero. dimensional response vector was generated according to where was generated from so that the signal ratio was about 2 to 1. The data was separated as training data and test data in the ratio of 2 to 1. The box plots in Figure 2 compare the different approaches to ensemble post processing in terms of the accuracies in the test data set.

Figure 2: The box plots in Figure 2 compare the different approaches to ensemble post processing for the scenario in Example 3.2. The number of trees generated was 200, maximum depth parameter was set to 2.
Example 3.3.

In this example, we repeat the experiment in Friedman & Popescu ([8]). Elements of the input matrix are independently generated from distribution. dimensional response vector was generated according to where was generated from The data was separated as training data and test data in the ratio of 2 to 1. The box plots in Figure 3 compare the test data performances of the different approaches over 100 replications of the experiment.

Figure 3: The box plots in Figure 2 compare the different approaches to ensemble post processing for the scenario in Example 3.3. Number of trees was set to 200, and the maximum depth parameter was set to 2.
Example 3.4.

In order to compare the performance of prediction models we use the benchmark data set ”Boston Housing” ([10]). This data set includes n=506 observations and p=14 variables. The response variable is the median house value from the rest of the 13 variables in the data set. 10 fold cross validated accuracies are displayed by the box plots in Figure 4. The PLSR approach has the best cross validated prediction performance.

Figure 4: 10 fold cross validated accuracies for the ”Boston Housing” data are displayed by the box plots. The PLSR approach has the best cross validated prediction performance. We have generated 300 trees by the ISLE approach and maximum depth parameter was set to 4. For the methods that use a kernel function we have uniformly used the Gaussian kernel. The sparsity parameters of the lasso or PLSR and the kernel width’s parameters were obtained by minimizing 10 fold cross validated errors in the training data.

4 Conclusion

In this article, we have proposed several approaches for post processing a large ensemble of prediction models or rules. The approach taken here is to treat the ensemble models or the rules as base learners and use them as input variables in the regression problem. Some weighting approaches to ensemble models are also considered.

The results from our simulations and benchmark experiments show that these post processing methods are promising. In most cases, the proposed models had better prediction performances than the ones given by the popular random forest or the rulefit algorithms. PLSR with rules uniformly produced the models with best prediction performances. The ensembles based on rules extracted from trees, in general, had better performances.

The complexity of trees or rules in the ensemble increases with the increase in number of nodes from the root to the final node (depth). The maximum depth is an important parameter since it controls the degree of interactions between the input variables incorporated by the ensemble model and the its value should be set carefully. It might also be useful to use some degree of cost pruning while generating the trees by the ISLE algorithm.

One last remark: This article argues that individual trees or rules should be treated as input variables to the statistical learning problem. It is almost always possible to incorporate other input variables like the original variables or their functions to our prediction model. The rulefit algorithm of Friedman & Popescu optionally includes the input variables along with the rules in an additive model and uses lasso regression to estimate the coefficients in the model. Integrating additional input variables into the final ensemble is also straightforward with PLSR and kernel smoothing.


I take this opportunity to express my gratitude to the people who have been instrumental in the successful completion of this article. This research was also supported by the USDA-NIFA-AFRI Triticeae Coordinated Agricultural Project, award number 2011-68002-30029.


  • [1] L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
  • [2] L. Breiman. Stacked regressions. Machine learning, 24(1):49–64, 1996.
  • [3] L. Breiman.

    Arcing classifier (with discussion and a rejoinder by the author).

    The annals of statistics, 26(3):801–849, 1998.
  • [4] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • [5] Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Machine Learning-International Workshop then Conference-, pages 148–156. Morgan Kaufmann Publishers, Inc., 1996.
  • [6] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer Series in Statistics, 2001.
  • [7] J.H. Friedman and B.E. Popescu. Importance sampled learning ensembles. Journal of Machine Learning Research, 94305, 2003.
  • [8] J.H. Friedman and B.E. Popescu. Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3):916–954, 2008.
  • [9] L.K. Hansen and P. Salamon. Neural network ensembles. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 12(10):993–1001, 1990.
  • [10] D. Harrison, D.L. Rubinfeld, et al. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1):81–102, 1978.
  • [11] T.K. Ho. Random decision forests. In Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on, volume 1, pages 278–282. IEEE, 1995.
  • [12] T.K. Ho, J.J. Hull, and S.N. Srihari. Combination of structural classifiers. 1990.
  • [13] J.L. Jannink, KP Smith, and AJ Lorenz. Potential and optimization of genomic selection for fusarium head blight resistance in six-row barley. Crop Science, 52(4):1609–1621, 2012.
  • [14] EM Kleinberg. Stochastic discrimination.

    Annals of Mathematics and Artificial intelligence

    , 1(1):207–239, 1990.
  • [15] G. Seni and J.F. Elder. Ensemble methods in data mining: improving accuracy through combining predictions. Synthesis Lectures on Data Mining and Knowledge Discovery, 2(1):1–126, 2010.
  • [16] K. Takezawa. Introduction to nonparametric regression. Wiley Online Library, 2006.
  • [17] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  • [18] D.H. Wolpert. Stacked generalization*. Neural networks, 5(2):241–259, 1992.