Tomal et al. (2015)
introduced a novel approach for building diverse classification models, for the ensembling of classification models in the context of rare-class detection in two-class classification problems. They proposed an algorithm to divide the often large number of features (or explanatory variables) into subsets adaptively and build a base classifier (e.g. Random Forests) on each subset. The various classification models are then ensembled to produce one model, which ranks the new samples by their probabilities of belonging to the rare class of interest. The essence of the algorithm is to automatically choose the subset groups such that variables in the same group work well together for classification tasks; such groups are called phalanxes.
In this paper, we propose a different class of phalanxes for application in general regression tasks. We define a “Regression Phalanx” – a subset of features that work well together for regression (or prediction). We then propose a novel algorithm, with hierarchical clustering of features at its core, that automatically builds Regression Phalanxes from high-dimensional data sets and builds a regression model for each phalanx for further ensembling.
In principle, any given regression method can be used as the base regression model. The goal is to improve the prediction accuracy of the base method. In this paper, we mainly focus on two well-established regression models: Lasso (Tibshirani, 1996) and Random Forests (Breiman, 2001). These two methods are known to have superior performance in various regression and prediction applications. For each application in this paper, we first compare the performances of Lasso and Random Forests (RF). The better performing method between the two is then chosen as the base regression model for building Regression Phalanxes.
The idea of ensembling Regression Phalanxes is promising because each Regression Phalanx is relatively low-dimensional. Thus, each variable makes a more significant contribution in the fitted model. Compared to training a full model where variables compete with each other in contributing to the final fit, more useful variables are likely to contribute to the ensembled regression model.
Our proposed phalanx-forming procedure resembles a hierarchical clustering of features (instead of samples), where “similarity” between a pair of features (or subsets of features) is defined by how well they work together in the same regression or prediction model. With properly defined similarity measures, features can be then hierarchically merged into different phalanxes.
The rest of paper is organized as follows. Section 2 presents the details of our proposed algorithm for building Regression Phalanxes. Section 3 presents a simple illustrative example, which forms the basis for simulation studies in Section 4. In Section 5, we demonstrate the performance of Regression Phalanxes on four additional real data sets. Finally, we conclude with some remarks and discussion of future work.
2 Phalanx-formation algorithm
In this section, the details of the Regression Phalanx formation algorithm are presented. The procedure is an agglomerative approach to build Regression Phalanxes, which is, in essence, a hierarchical clustering of variables. There are four key steps:
Initial grouping. Form initial groups from the original variables ().
Screening of initial groups. Screen out the underperforming initial groups to obtain groups.
Hierarchical merging into phalanxes. Hierarchically merge the screened groups into candidate phalanxes.
Screening of candidate phalanxes. Screen out the underperforming candidate phalanxes to obtain final phalanxes.
A sketch of the procedure is presented in Figure 1. Each step of the phalanx-forming procedure is explained in more details in the following sections.
2.1 Initial grouping
This is an optional step. If this step is omitted, each initial group contains a single individual feature and the number of initial groups equals the number of features. As a result, the following steps in the phalanx-formation steps become more computational intensive since the time complexity is quadratic in the number of groups. Thus, it is recommended that the features be grouped into fewer initial groups if they lend themselves to natural grouping (e.g. initial groups can be identified by features with similar names). Also, if an initial group only contains a binary feature, its corresponding model is likely to be weak since it can only predict two possible values. On the other hand, an initial group with binary features can produce up to different predictions. Thus, we recommend the grouping of the binary features into initial groups. If the data set contains a large number of features but no obvious hints for natural grouping, we can still use hierarchical clustering to obtain the initial groups. For binary features, Tomal et al. (2015) proposed to use the Jaccard dissimilarity index, defined between binary features and as
Here is the number of observations where and both take the value 1, and is the number of observations where or take the value 1. It is easy to see that . For continuous features (or a mix of binary and continuous features), we propose to use the “1-Abs(Correlation)” dissimilarity measure. That is, the dissimilarity between variable and is calculated as
This step partitions the original features into initial groups .
2.2 Screening of initial groups
High-dimensional data are likely to contain noise features which contribute little or even negatively to the prediction task. In such cases, initial groups need to be screened so that noisy initial groups do not participate in the following steps. We first introduce some notation and then we present two tests for the screening of the initial groups.
We first define some notations to be used in the screening procedure. Denote by the assessment criterion of a given regression task. Typically is defined as the mean squared error (MSE) of prediction
where and are the observed values and their predictions, respectively, of the response at test points. The data available for the application in Section 5.2 allow separate test data, but usually the test points will be generated by cross validation or related methods using the training data.
To assess accuracy based on training data only, different strategies are used for the two candidate regression methods, Lasso and RF.
In Lasso, the predictions are produced by -fold Cross-Validation (we choose through out the paper). More specifically, the data set and the corresponding are randomly grouped into folds , , with observations respectively (). Then the predictions for , namely is obtained by
where is the Lasso model fit from the folds other than the -th fold, and
is the corresponding feature vector of. We denote the assessment criterion for Lasso as the Cross-Validation MSE (CV-MSE).
Random Forests (RF).
In RF, can be obtained from the out-of-bag (OOB) predictions. The prediction is obtained from only the trained trees that do not have in their bootstrapped sample, hence, the prediction is called the out-of-bag prediction. We denote the assessment criterion for RF as the out-of-bag MSE (OOB-MSE).
We further denote as the vector of predictions from the base regression model (e.g. Lasso or RF) using only the variables in . Let denote the assessment measure. Denote by the predictions when the model is fit using the variables in and , and denote
as the resulting performance. Similarly, let
be the performance of the ensemble of the predictions from and .
2.2.2 Tests for screening initial groups
A group survives the screening if it passes the two tests described as follows. A group passes the first test if its own performance is “strong”, i.e. high prediction accuracy. A group survives the second test if it produces “significant combining improvement”, i.e., after combining with another group , the model trained on the combined variables (from and ) produces significantly better accuracy comparing to that from .
We use a null permutation distribution to establish the baseline for strong individual performance and significant combining improvement. Denote
as the vector of permuted response values in which the original vector of response variable valuesis randomly permuted relative to the features. Then the counterparts of and are calculated with as the response and denoted as and respectively. Denote the quantile of as and the quantile of by . Then a group survives the screening if it passes both the following two tests:
is strong itself:
The rationale is that the strength of an individual initial group should be competitive with the quantile of the strengths of initial groups with a randomly permuted response.
improves the strength of any other group when and are combined to build a regressor:
The rationale is that the improvement from combining and should be competitive with the quantile of combining improvements of initial groups with a randomly permuted response. The quantile is to adjust for the tests for each initial group.
After the screening of initial groups, a list of surviving groups is relabeled as for the next step.
2.3 Hierarchical formation of candidate phalanxes
We use simple greedy hierarchical clustering techniques to merge Groups , , into phalanxes, which proves to be effective in all of our applications. Each iteration merges the pair of groups and that minimizes
Here indicates that combining and to build a single model provides more strength than ensembling two models built separately on and . The number of groups, , decreases by 1 after each merge. The merging process stops when for all , indicating that further merging damages the performance and the resulting groups, i.e. candidate phalanxes , should be now considered for ensembling.
2.4 Screening of candidate phalanxes
Searching for the best subset of candidate phalanxes for further ensembling is a combinatorial problem. In order to reduce the computational cost, we establish the search path in a forward selection fashion. We initialize with the candidate phalanx with the smallest value of the criterion . At each stage all the remaining candidate phalanxes will be considered for ensembling with phalanxes in one at a time, and the one with the best ensembling performance with will be added to . The ensembling performance is calculated as follows. The predictions of from are denoted by , and the ensembled prediction for is calculated as
The ensemble’s performance is then calculated from the ensembled predictions.
For ease of description, we assume the order of entry to be , and the corresponding ensembling performance as each candidate is added to be respectively. Then the set of candidate phalanxes corresponding to the best ensembling performance, say , will be selected, and the remaining candidates are screened out.
After the screening of weak phalanxes, the surviving phalanxes are the final phalanxes, and they will be ensembled in the last step.
2.5 Ensembling of Regression Phalanxes
We fit a model for each of the phalanxes of variables, and obtain predictions from them. (For RF, we can increase the number of trees and get better OOB predictions as final predictions.) For a test point, the predictions from the ensemble of regression phalanxes (ERPX) are averaged to give the final prediction.
3 A simple illustrative example: Octane
The octane data set (Esbensen et al., 1996) consists of NIR absorbance spectra over 226 wavelengths ranging from 1102 nm to 1552 nm with measurements every two nanometers. For each of the 39 production gasoline samples the octane number was measured.
It is known that the octane data set contains six outliers (cases 25, 26, 36–39) to which alcohol was added. We omit those outliers to obtain 33 clean samples.
We apply Lasso and RF separately on the data set, then we choose Lasso as the base regression model since it produces better results: the MSE of Lasso is about 0.085 compared with about 0.27 for RF.
The R package “glmnet” (Friedman et al., 2010) is used for fitting Lasso models and the penalty parameter is chosen by using “cv.glmnet” with method “lambda.1se”. Since cross validation is used for choosing the tuning parameter for Lasso, the Phalanx formation procedure has inherited randomness. Therefore, we apply both ERPX and the original Lasso to the Octane data set three times each. For ERPX, we skip the initial grouping step since the number of features is relatively small and the features are all continuous. Different numbers of surviving groups and final phalanxes are obtained. The accuracies of both methods are assessed by CV-MSE. For each run, results for mean CV-MSE over 20 CV repetitions are presented in Table 1.
We can see that the CV-MSE values from ERPX are much smaller than those obtained from the original Lasso models, which confirms that ERPX can boost the performance of the base regression model.
In the following section, we generate synthetic data sets based on the octane data. We simulate data favoring Lasso and RF respectively as the base regression model and show that we are able to improve the performance over the base regression model using the proposed ERPX.
4 Simulation studies
In this section, we present several numerical experiments to demonstrate the performance of the proposed ERPX. We simulate data by first emulating the feature structure of the octane data. Then we generate response data as a function of the features in two different ways, to represent linear and nonlinear relationships with the features, respectively.
The data sets are generated based on the octane data as follows.
, , where as in the octane data set.
is sampled from multivariate normal distribution, where consists of column means of the octane data.
The covariance matrix is equal to the sample covariance matrix as observed or from a perturbation of the data. Specifically, we add different levels of noise into the octane data, and for each noise level we take the resulting sample covariance matrix for . We consider three noise levels:
No noise: is the sample covariance matrix of the octane data.
Medium noise: Random samples from are added to each element of the octane data, where
is the minimum feature standard deviation among all the 226 features of the octane data.is then the sample covariance matrix of the modified octane data.
High noise: Random samples from are added to each element of the octane data. is then the sample covariance matrix of the modified octane data.
For each noise level we simulate the feature set 100 times. The strategy for generating depends on the choice of the base regression model.
4.1 Lasso as base regression model
To favor Lasso, for each of the 100 simulated , we generate the response variable in a linear pattern as follows.
Randomly select 10 features from the columns of .
For each feature , generate the corresponding coefficient
from a Uniform distribution.
Generate as .
Generate as , where is a scale constant , is the original response variable from the octane data set, and contains noises generated from .
For each set of simulated data , we apply ERPX with base regression models Lasso and RF respectively. Since the response variable is generated as a linear combination of the predictors, Lasso is expected to perform at least as good as RF. The performance is measured by 5-fold CV-MSE and OOB-MSE for Lasso and RF respectively.
|Average number of|
We can see from Table 2, for all the simulation settings, regardless of the choice of the base regression model, ERPX produces more accurate predictions than the corresponding base regression model with large relative margins. Somewhat surprisingly, RF slightly outperforms Lasso in around 70% of the 300 simulated data sets (and also on average). This could be caused by the use of different metrics, OOB-MSE versus CV-MSE, when the sample size is small.
4.2 Random Forests as base regression model
To generate the response variable from highly nonlinear relationships favoring RF, we deploy the following strategy.
Randomly select 10 features from the columns of .
Generate two sets of coefficients and from a Uniform distribution .
Since the initial values in are generated from a mixture of two linear patterns, they cannot be easily modelled by linear methods such as Lasso.
Generate , where , a scale constant, and contains noises generated from .
For each set of simulated data , we again apply ERPX with either Lasso or RF as the base regression model. In this case, we expect RF to outperform Lasso. We can see from Table 2, for all the simulation settings, RF outperforms Lasso as anticipated given the simulation set up, and ERPX improves upon both base regression models by large relative margins.
|Average number of|
5 Additional real data examples
The prediction performance of ERPX is illustrated on the following data sets.
5.1 AID364 data set
Assay AID364 is a cytotoxicity assay conducted by the Scripps Research Institute Molecular Screening Center. There are 3,286 compounds used in our study, with their inhibition percentages recorded. Visit http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=364 for details. Because toxic reactions can occur in many different ways, this assay is expected to present modelling challenges. We consider five sets of descriptors for the assay, to make 5 data sets. The descriptor sets are the following: atom pairs (AP), with 380 variables; Burden numbers (BN, Burden, 1989), with 24 variables; Carhart atom pairs (CAP, Carhart et al., 1985), with 1585 variables; fragment pairs (FP), with 580 variables; and pharmacophores fingerprints (PH), with 120 variables. The Burden numbers are continuous descriptors, and the other four are bit strings where each bit is set to “1” when a certain feature is present and “0” when it is not. See Liu et al. (2005) and Hughes-Oliver et al. (2010) for further explanation of the molecular properties captured by the descriptor sets.
The initial groups for the descriptor sets are determined by their features names. For example, related features for FP present similar names such as AR_01_AR, AR_02_AR, , AR_07_AR, and such features will form the initial groups. We perform our proposed ERPX on each of the five descriptor sets. The base regression model is chosen to be RF due to its superior performance over Lasso in this case (see Table 9 in Appendix). (RE: WILL)
ERPX and the original RF are run on each of the five descriptor sets as well as the five descriptor sets combined as a whole set, three times each. The results are presented in Table 4. As we can see, ERPX provides superior prediction accuracy over the original RF, and the margin gets bigger with all five descriptor sets merged together. This is because ERPX can exploit the “richness” of features to improve prediction accuracy.
Due to the naming schema of the features, the descriptor sets lend themselves well to obtaining initial groups. However, we show that ERPX still produces comparable prediction accuracies with initial groups obtained from the hierarchical clustering approach described in Section 2.1. We choose the same number of initial groups as shown in Table 4 to facilitate comparison. Table 5 presents the new results for descriptor sets other than BN (since no initial grouping is needed). As we can see, the prediction accuracies are comparable to those in Table 4 where the initial grouping is done according to feature names.
5.2 CLM data set
The community land model with carbon/nitrogen biogeochemistry (CLM-CN) is a state-of-the-art land surface model to make future climate projections. Sargsyan et al. (2014) performed simulations using CLM-CN for a single plant functional type: temperate evergreen needleleaf forest. The outputs were 100-year-later projected values of several quantities of interest (QoIs). The simulator was run 10,000 times for different settings of 79 input parameters, out of which 9983 runs succeeded.
We make predictions for two QoIs, leaf area index (LAI) and total vegetation carbon (TOTVEGC), from the 79 input variables. The two QoIs are log scaled since their sample distributions are highly right-skewed. The 9983 runs contain 38% and% of zeros, respectively. Therefore, we add constants and to the two QoIs respectively before we apply the log scaling (these values are roughly equal to the minima of the respective non-zero values). Since the climate projections are affected by a number of uncertainties in CLM-CN, predicting the LAI and TOTVEGC is a challenging task. We choose RF as the base regression model due to its superior performance versus Lasso in this example (see Table 10 in Appendix) (RE: WILL). We apply both ERPX and the original RF to demonstrate their performance. For ERPX, we skip the initial grouping step since there are only 79 variables.
The 9983 observations are randomly split into training and testing sets with 5000 and 4983 observations respectively. We repeat the random splitting 20 times to obtain 20 different pairs of training and testing sets. For each random split, both ERPX and RF are trained on the randomly sampled training set and applied to the corresponding test set. Therefore, we can generate boxplots of both the 20 OOB-MSEs from the training sets and the 20 test MSEs from the testing sets. The boxplots are shown in Figure 2. It is clear that ERPX provides more accurate predictions than RF, for both TOTVECG and LAI as response variables.
We also present the detailed results from the first three splits in Table 6. Since the number of phalanxes is always one in all the runs, the mainly difference between ERPX and RF is the screening of initial groups. By screening out most of the less important initial groups, ERPX is able to produce more accurate prediction than RF with only a small number of “strong” initial groups.
5.3 Gene expression data set
Scheetz et al. (2006) conducted a study of mammalian eye diseases where the gene expressions of the eye tissues from 120 twelve-week-old male F2 rats were recorded. A gene coded as TRIM32 is of particular interest here since it is responsible for causing Bardet-Biedl syndrome.
According to Scheetz et al. (2006), only 18976 probe sets exhibited sufficient signal for reliable analysis and at least 2-fold variation in expressions. The intensity values of these genes are evaluated on the logarithm scale and normalized using the method in Bolstad et al. (2003). It is believed from previous studies that TRIM32 is only linked to a small number of genes, so following Scheetz et al. (2006)
we concentrate mainly on the top 5000 genes with the highest marginal sample variance.
Again, we choose RF (MSE around 0.0128) as the base regression model over lasso (MSE around 0.0131). We apply ERPX and the original RF to this data set three times each. The results are presented in Table 7. For ERPX, we skip the initial grouping step.
It is clear that ERPX is providing more accurate predictions than RF for this data set.
5.4 Glass data set
The glass data set (Lemberge et al., 2000) was obtained from an electron probe X-ray microanalysis (EPXMA) of archaeological glass samples. A total of 180 glass samples were analyzed and each glass sample has a spectrum with 640 wavelengths. The goal is to predict the concentrations of several major constituents of glass, namely, Na2O, SiO2, K2O and CaO, from the spectrum. For different responses, the choice of the base regression model varies, since neither RF nor Lasso performs uniformly better accross the response variables. We apply ERPX and the corresponding base regression model to each of the six responses three times each. The results are presented in Table 8. As we can see, ERPX improves the prediction accuracy of the corresponding base regression method.
In this paper, we propose a novel framework called ensemble of Regression Phalanxes (ERPX). We propose to divide a often large number of features into subsets called Regression Phalanxes. Separate predictive models are built using features in each Regression Phalanx and they are further ensembled.
The proposed approach is widely applicable. We have demonstrated it on a variety of applications spanning chemistry, drug discovery, climate-change ecology, and gene expression. The simulated examples and real applications demonstrate that ERPX can take advantage of the richness of the features and produce gains in prediction accuracy over effective base regression model such as Lasso and RF.
- Bolstad et al. (2003) Bolstad, B. M., Irizarry, R. A., Åstrand, M., and Speed, T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185–193.
- Breiman (2001) Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
- Burden (1989) Burden, F. R. (1989). Molecular identification number for substructure searches. Journal of Chemical Information and Computer Sciences, 29(3):225–227.
- Carhart et al. (1985) Carhart, R. E., Smith, D. H., and Venkataraghavan, R. (1985). Atom pairs as molecular features in structure-activity studies: definition and applications. Journal of Chemical Information and Computer Sciences, 25(2):64–73.
- Esbensen et al. (1996) Esbensen, K., Midtgaard, T., and Schönkopf, S. (1996). Multivariate Analysis in Practice: A Training Package. Camo As.
- Friedman et al. (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1.
- Hughes-Oliver et al. (2010) Hughes-Oliver, J. M., Brooks, A. D., Welch, W. J., Khaledi, M. G., Hawkins, D., Young, S. S., Patil, K., Howell, G. W., Ng, R. T., and Chu, M. T. (2010). Chemmodlab: a web-based cheminformatics modeling laboratory. In silico biology, 11(1-2):61–81.
- Lemberge et al. (2000) Lemberge, P., De Raedt, I., and Janssens, K. H. (2000). Quantitative analysis of 16-17th century archaeological glass vessels using pls regression of epxma and mu-xrf data. Journal of chemometrics, 14(5):751–764.
- Liu et al. (2005) Liu, K., Feng, J., and Young, S. S. (2005). Powermv: a software environment for molecular viewing, descriptor generation, data analysis and hit evaluation. Journal of chemical information and modeling, 45(2):515–522.
- Sargsyan et al. (2014) Sargsyan, K., Safta, C., Najm, H. N., Debusschere, B. J., Ricciuto, D., and Thornton, P. (2014). Dimensionality reduction for complex models via bayesian compressive sensing. International Journal for Uncertainty Quantification, 4(1).
- Scheetz et al. (2006) Scheetz, T. E., Kim, K.-Y. A., Swiderski, R. E., Philp, A. R., Braun, T. A., Knudtson, K. L., Dorrance, A. M., DiBona, G. F., Huang, J., Casavant, T. L., et al. (2006). Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences, 103(39):14429–14434.
- Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.
- Tomal et al. (2015) Tomal, J. H., Welch, W. J., Zamar, R. H., et al. (2015). Ensembling classification models based on phalanxes of variables with applications in drug discovery. The Annals of Applied Statistics, 9(1):69–93.