Building an effective classifier for a specific problem is a difficult task. To be successful, a variety of aspects need to be taken into account: the structure of the data, the information that can be used for prediction, the number of the labeled examples available for induction, the noise level, among others. Another crucial choice is the type of predictor to be used. The strategies implemented by the different classifiers are diverse. For instance, decision trees adopt a divide-and-conquer approach in which the original prediction task is recursively divided by partitioning the attribute space into disjoint regions. Within each of these regions, the prediction problem is simpler than the original. A neural network provides a global sub-symbolic representation of the decision problem in terms of the set of synaptic weights. Another illustration is the strategy adopted in kernel methods, such as Suppor Vector Machines (SVM). In SVMs the original problem is embedded into an extended feature space. In this extended space, the discrimination problem is solved by finding the minimal margin hyperplane that separates classes, except for, possibly, a few instances. In practice, one often finds that combining the outputs of individual classifiers often leads to more accurate predictions. Whence, the popularity of ensemble methods[Dietterich2000, Banfield et al.2007, Bauer and Kohavi1999]. A necessary condition to obtain such improvements is that the ensemble members be diverse. In additions, the individual predictors should be complementary, in the sense that each of them tends to make errors on different test instances.
Homogeneous ensembles are composed of classifiers of the same type. Ensembles composed of classifiers of different types are called heterogeneous. The strategies to generate diversity among the base classifiers are different for homogeneous and for heterogeneous ensembles. In homogeneous ensembles, the main difficulty is to generate diversity even when the same learning algorithm is used. To this end, one can use bootstrap techniques (e.g. bagging [Breiman1996a]), randomized steps in the base learning algorithm (e.g. the random subspace method used random forest [Breiman2001]), noise injection in the class labels (e.g. ECOC [Dietterich and Bakiri1991]) or adaptive emphasis protocols (e.g. boosting [Freund et al.1999]). These techniques, which have exploited mainly in the context of homogeneous ensembles, can also be used to achieve further diversity in heterogeneous ensembles [Lu et al.2015]. However, since different learning algorithms are used to generate the base learners, heterogeneous ensembles are intrinsically diverse. In this case, the main difficulty resides in determining the optimal way to combine the predictions of the different models in the ensemble.
Broadly speaking, the methods to build heterogeneous ensemble can be grouped into two categories. In the first family of methods a fixed number of different models are combined. A second strategy is to build a collection of models with different parametrizations and then select the best subset to include in the final ensemble. In [de Oliveira et al.2013]
a static heterogeneous ensemble is proposed. In this study 5 different base classifiers are combined: a Support Vector Machine (SVM), a multilayer perceptron (MLP), logistic regression, K nearest neighbors and decision tree. The parameters and architecture of the individual classifiers are determined using 10-fold cross-validation. The proposed approach shows good results in the specific application of lithofacies classification. In[Nanni et al.2015], a combination of several carefully optimized strong learners, such as deep neural networks, SVM, adaboosts, and gaussian processes, is proposed. The study shows a good performance of the proposed combination over several image classification and UCI tasks with respect to any of its constituents. However, the problem of determining of the number of classifiers of each type that need to be used is not solved in a fully satisfactory manner. Furthermore, the optimal composition of the ensemble is problem-dependent. A possible way to overcome this difficulty is to create a library of classifiers and then select a subset for the final ensemble [Caruana et al.2004, Partalas et al.2010, Haque et al.2016]. For instance in [Caruana et al.2004] a library of 2000 different methods trained with wide range of different parametrizations is build. The models included in the library are both individual classifiers and ensembles. The ensemble methods used include boosted trees using different decision tree algorithms and ensemble size, and bagged trees using different base decision trees. In addition, the individual trees of the bagged ensembles were also added to the library. Other individual classifiers included are SVMs trained with different parameters, multilayer perceptrons, etc. From that library of models, an iterative greedy selection algorithm is applied to build the final ensemble. The procedure starts with empty ensemble. Then, at each iteration the model that maximizes a performance measure (such as AUC or accuracy on a validation set) is included into the ensemble until all models in the library have been aggregated. Finally, the ensemble with the best performance in the validation set is selected as the final combination. Tsoumakas et al. have made several interesting contribution in this line of research [Tsoumakas et al.2004, Partalas et al.2010]. For instance, in [Partalas et al.2010] the authors propose a greedy selection method from a library composed of 200 classifiers: 60 neural networks, 60 nearest neighbor classifiers, 80 SVMs and 20 decision trees. For each type of classifier, a parameter grid was defined and a single model was trained for each node in the grid. In their proposal, the ensemble is grown incrementally by selecting from the library one classifier at a time. At each step, the selection is made in terms of both individual accuracy and complementarity with the rest of the classifiers in the ensemble. In the problems investigated, such heterogeneous ensembles were found to be more accurate that their constituents. In [Haque et al.2016]
a genetic algorithm has been proposed to select the optimum structure of a heterogeneous ensemble from 20 different base models. These selection techniques, also known as ensemble pruning, have been also extensively applied to homogeneous ensembles[Tsoumakas et al.2009].
In this work we propose to analyze heterogeneous ensembles in which the individual classifiers are selected from homogeneous ensembles. The goal is to build a family of heterogeneous ensembles that can be smoothly transformed into each other. To this end, a family of heterogeneous ensembles of size T are built by pooling different fractions of base classifiers from M homogeneous ensembles of different types. Depending on the proportion of classifiers of each type, a particular heterogeneous combination in created. This family of heterogeneous ensembles can be represented in a regular simplex in M dimensions. The M vertices of this simplex represent the different homogeneous ensembles. The optimal fraction of each type of classifiers for the final ensemble is found by performing a search in this simplex.
The paper is organized as follows: Section 2, the design process to build optimal heterogeneous ensembles by pooling from homogeneous ensembles is described; Section 3, presents a comprehensive empirical evaluation of the proposed methodology and a comparison with the corresponding homogeneous ensembles and to individual classifiers. Finally, the conclusions of the present work are summarized.
2 From homogeneous to heterogeneous ensembles
In this study we analyze heterogeneous ensembles by pooling individual classifiers from different homogeneous ensembles. For this, we first train ensembles of size composed of different types of base classifiers. The heterogeneous ensemble of size is created by pooling classifiers from the ensembles, where is the number of base classifiers pooled from the homogeneous ensemble and . The optimum percentage of each type of base classifier can be obtained by cross-validation or out-of-bag error in a grid search in the space given by . Note, however, that there are different heterogeneous ensembles that can be built in this manner and that this number can be rather large even for small values of and . For instance, for and , 5253 different heterogeneous ensembles can be built. In order to reduce the search space, the ensembles can be evaluated using intervals of base classifiers of each type. For instance for , the followings configurations of the generated ensembles could be tested: , , , , , etc. This reduces the search space to possible ensemble configurations. Finally, the ensemble composition with minimum validation error is determined as the optimal ensemble. In the case that more than one ensemble configuration has the same minimum validation error, the average ensemble compositions for all minima with the same validation error is selected as the optimal heterogeneous ensemble.
For this study, we have used three homogeneous ensembles: random forests (RF), ensembles of support vector machines (SVM, [Cortes and Vapnik1995]) and of multilayer perceptrons. All base classifiers of these ensembles are created using random samples from the training set to allow for a fast validation of the optimum heterogeneous ensemble by means of out-of-bag validation [Breiman1996b]. In order to generate ensembles of SVMs the following randomized procedure is used. First, sets of partially optimized parameters for the SVMs, with , are obtained. More details on how these sets of partially optimized parameters are obtained are given below. Then, the ensemble is built in batches of SVMs. Each batch uses a different set of parameters and each individual SVMs is trained on a different random bootstrap sample without replacement of size 50% (i.e. subbagging) from the original training set. In this way the variability among the SVMs can be increased. Using subbagging has the advantage with respect to using standard bootstrap samples that the base models can be trained faster. This speedup is approximately 4 times considering the near quadratic training times of SVMs. In addition, the performance of both sampling strategies, bootstrapping and subbagging, has been demonstrated to be equivalent [Friedman and Hall2007, Martínez-Muñoz and Suárez2010]. To obtain the sets of partially optimized parameters, we first define a parameter grid. Next, a subbagging sample is generated. One SVMs is trained for each combination of parameters and validated on the left-out set. Finally, the set of parameter with lower error is kept for building the ensemble. This process is repeated times to obtain the with sets of parameters. The same procedure is used to generate the ensembles of MLPs. The training time complexity of the ensemble depends on the size of the parameter grid, , , on the sampling rate and on the complexity of the base classifier. Notwithstanding, in spite of creating an ensemble of SVMs (or MLPs), this procedure can be faster to train than training a single SVM by grid search and cross-validation, which is the most common way of training an SVM [Ben-Hur and Weston2010, Hsu et al.2003]. In the next section we will show the validity of this procedure to generate homogeneous ensembles of SVMs and MLPs, and also of the procedure to obtain heterogeneous ensembles from them.
3 Experimental Results
In this section we present the empirical analysis of heterogeneous ensembles as the combination of homogeneous base classifiers. Furthermore, we validate the procedure to obtain SVM (and MLP) ensembles by partial optimization of their training parameters. We carried out the analysis on 19 datasets from the UCI repository [Bache and Lichman2013]. In all tested datasets, except of the synthetic problems, the training and test sets were generated using random stratified sampling with sizes and of the original sets respectively. In the synthetic classification problems, which are Ringnorm, Threenorm and Twonorm, examples are sampled at random for training and for testing using independent realizations. The results reported are averages over executions except for Breast Chess, German, Tic-tac-toe, Ozone and Spambase were the averages are over executions due to computational limitations.
Three, , homogeneous ensembles of size were trained. Specifically, the ensembles used are: standard random forest [Breiman2001], partially optimized ensemble of support vector machines and of multi layer perceptrons. We have used e1071, RSNNS and randomForest R packages for creating SVMs, MLPs and RF respectively. Under these setting the possible configurations of the heterogeneous ensemble are . To reduce the computational burden in the identification the optimum combination of base classifiers, we evaluated the heterogeneous ensembles in intervals of base learners, which reduces the optimization to
evaluations. In addition, since that the base classifiers of the three analyzed ensembles were generated using random subsamples from the training set, the optimum heterogeneous configuration is obtained by out-of-bag validation to further reduce the computational cost. The values of the hyperparameters for SVM with a RBF kernel are selected from a grid withwith and with
. For MLP, the number of neurons in the hidden layer was optimized from the values. For building the partially optimized ensemble, sets of hyperparameter were obtained using out-of-bag. For random forest, the default parameters were used.
3.1 Homogenous ensemble of SVMs and MLPs
In order to validate the procedure to generate the partially optimized ensembles, a comprehensive comparison with respect to an optimized single base learner was carried out. For this purpose, a single SVM and a single MLP were trained using within-train 10-fold cross-validation and grid search over the same sets of parameters given above. The average errors for this experiments are shown in Table 1 for a single SVM and MLP, and for the homogeneous ensembles composed of SVMs (shown as E-SVM in the table) and of MLPs (shown as E-MLP). For each dataset, the best method is highlighted in boldface and the second best method is underlined. In addition, an overall comparison of the methods is shown in Figure 1 by mean of the procedure proposed by Demšar in [Demšar2006]. In this diagram, the average ranks for each method are shown. Methods connected by a horizontal solid line indicate that their differences in average rank are not statistically significant according to a Nemenyi test (p-value 0.05).
|Dataset||E-SVM||E-MLP||RF||SIM||[% SVM, % MLP, % Trees]||entropy/max|
|Australian||13.69||2.1||14.16||1.9||13.02||2.1*||13.51||2.0||[ 24.5 , 16.7 , 58.8 ]||0.87|
|Boston||12.20||2.3||12.31||2.0||12.85||2.1||12.23||2.0||[ 39.2 , 23.1 , 37.7 ]||0.98|
|Breast||3.37||1.1||3.24||1.1||3.30||1.1||3.30||1.0||[ 27.6 , 29.0 , 43.4 ]||0.98|
|Bupa||27.85||3.4||28.32||3.7||27.17||3.6||27.27||3.5||[ 21.4 , 15.6 , 63.0 ]||0.83|
|Chess||0.81||0.3||0.92||0.3||1.72||0.4||0.76||0.2||[ 34.7 , 22.7 , 42.6 ]||0.97|
|Colic||33.20||1.4||31.30||3.3||16.53||2.9*||17.20||3.0||[ 3.7 , 4.4 , 91.9 ]||0.3|
|German||24.59||1.6||24.70||1.9||23.94||1.8*||24.31||1.9||[ 16.0 , 28.1 , 55.8 ]||0.89|
|Heart||15.38||3.0||16.34||3.1||16.62||2.9||15.50||3.1||[ 33.5 , 23.8 , 42.7 ]||0.97|
|Hepatitis||15.83||3.0||15.42||4.3||15.12||3.6||15.19||3.6||[ 25.6 , 28.7 , 45.7 ]||0.97|
|Ionosphere||5.73||1.7||11.28||2.5||6.69||1.7||5.84||1.7||[ 64.4 , 13.7 , 21.9 ]||0.81|
|Ozone||5.60||0.3||5.48||0.5||5.67||0.3||5.37||0.4||[ 17.8 , 52.8 , 29.4 ]||0.91|
|Parkinsons||10.71||3.7||13.66||3.9||11.08||4.0||10.71||3.9||[ 44.1 , 12.9 , 43.0 ]||0.9|
|Pima||22.68||1.8*||23.10||2.1||23.12||2.0||22.95||1.8||[ 44.5 , 18.1 , 37.4 ]||0.94|
|Ringnorm||1.58||0.4*||16.41||1.5||5.87||1.0||1.70||0.5||[ 62.2 , 11.2 , 26.6 ]||0.81|
|Spambase||6.63||0.4||5.90||0.4||5.11||0.4||4.97||0.3||[ 12.2 , 11.1 , 76.8 ]||0.64|
|Sonar||17.78||4.9||20.65||4.6||18.88||4.8||17.97||4.4||[ 39.1 , 16.9 , 44.0 ]||0.94|
|Threenorm||14.10||0.7*||16.93||0.9||16.67||1.0||14.36||0.8||[ 52.1 , 10.8 , 37.1 ]||0.86|
|Tic-tac-toe||1.83||0.7||1.83||0.7||2.35||1.1||1.46||0.7*||[ 12.5 , 11.2 , 76.3 ]||0.65|
|Twonorm||2.44||0.3*||2.94||0.4||3.90||0.5||2.55||0.4||[ 42.5 , 22.7 , 34.8 ]||0.97|
From Table 1, it can be observed that the ensemble of MLPs clearly outperforms the single MLP. The differences are favourable to the ensemble of MLPs except for Ionosphere and Parkinsons. The differences between the single SVM and its ensemble counterpart are not so pronounced as the ones observed for MLPs. The ensemble of SVM obtains a better result than a single SVM in 12 out of 19 datasets. This same result can be observed in Figure 1 where the average rank of E-SVM is slightly better than a single SVM. However, the difference is not statistically significant. Even thought the differences are not statistically significant, this analysis shows that this procedure to build ensembles of SVMs is not detrimental. When using MLP as the base classifiers, we observe that the differences are statistically significant with respect to a single MLP. In addition, with these setting, we have observed that the training time for E-SVM is about 2 times faster than training a single SVM using grid search and 10-fold cross-validation. On the other hand, the ensemble of MLPs is about 10 times slower than the single MLP due to the linear complexity of MLP with respect to the number of training instances.
3.2 Heterogeneous ensemble pooled from homogeneous ensembles
In this section the performance of the proposed procedure to built heterogeneous ensembles by pooling from homogeneous ensembles is analyzed. The objective is to find the optimum proportion of each of the possible base classifiers to include in the final heterogeneous ensemble. Each of the possible selected proportions, which correspond to a different heterogeneous ensemble, can be represented by a point in a regular simplex in M dimensions. This is shown in Figure 2 for three representative datasets: Heart, Colic and Tic-tac-toe. Each plot in Figure 2 shows in a 3 dimensional simplex, the average test error for the different combinations of base classifiers in intervals of classifiers using a color map scale scheme. Darker colors indicate higher average error as indicated by the color legend at the top-right of each plot. The three vertices in the plots correspond to the three tested homogeneous ensembles. The vertices in the upper left, right and bottom left of the plot correspond to E-SVM, E-MLP and random forest respectively. A displacement away from one of these vertices smoothly transforms the corresponding homogeneous ensembles into a heterogeneous one. The horizontal axis shows the number of selected MLPs in the heterogeneous ensemble, while the vertical axis indicates the number of SVMs minus the number of random trees. In addition, all plots show the average of the selected positions using out-of-bag validation (marked with a ’o’ sign) and the average position for the best test errors (marked with a ’T’ sign).
In the plots of Figure 2 different behaviours of the combination of base classifiers can be observed. In Heart (left plot), the best position is observed quite centered, showing that a heterogeneous ensemble composed of base classifiers from different types is beneficial to improve the generalization performance of the ensemble. However, this is not a general trend as it can be observed in the center plot (Colic). In this case, the best result is clearly located at one of the vertices of the simplex that correspond to a homogeneous ensemble of random forest in this case. Finally, it is important to note that the optimum location need not be close to the best homogeneous ensemble. For instance, in Tic-tac-toe, the location of the minimum error is very close to the random forest vertex in spite of the fact that this homogeneous ensemble presents the worst average performance. Finally, we can observe that the average location of the minima identified using out-of-bag is quite close to the location in test. We have also observed, however, that for the smaller datasets the identification of the optimum point is less accurate.
In the Table 2, the average test errors for the homogeneous ensembles of SVMs (E-SVM) and MLPs (E-MLP), random forest (RF) and the proposed strategy (SIM) over the investigated problems are reported. The best and second best results for each dataset are highlighted in boldface and underlined respectively. In addition, the table shows the average percentage of classifiers of each type selected by out-of-bag validation for the heterogeneous ensembles. The percentages are shown in the same order that the ensembles are shown, that is, % of SVMs, % of MLP and % of random trees. The last column of the table indicates the entropy of the selected percentages of classifiers divided by the maximum entropy (i.e. [33.3, 33.3, 33.3]).
As shown in Table 2, the proposed method is the best or the second best method for all datasets. E-SVM also achieves rather good results but it is somehow less consistent. E-SVM is the method that obtains the highest number of best performances (in 9 datasets) but its performance is the worst in 4 datasets. Finally, random forest and E-MLP obtain 5 and 1 best results respectively. This results are summarized using a Demšar plot [Demšar2006] in Figure 3. From this diagram, it can be observed that the proposed procedure is significantly better than random forest and E-MLP (as given by a Nemenyi test with p-value 0.05). The proposed methodology has an average rank better that E-SVM but the difference is not statistically significant.
In this study, a continuous family of heterogeneous ensembles of size T with varying proportions of base classifiers of different types is analyzed. To this end, we first generate M different homogeneous ensembles. Diversification in these ensembles is obtained by using both subsampling and randomization techniques. Then a heterogenous ensemble is built by pooling classifiers from these homogeneous ensembles. The proportions of classifiers of different types in the heterogeneous combination can be represented with a point in a simplex in M dimensions. Each of the M vertices in this simples corresponds to one of the homogeneous ensembles. The optimal proportion of base classifiers in the final ensemble, which is strongly problem-dependent, can be estimated using out-of-bag data.
In the empirical evaluation carried out, the proposed strategy consistently exhibits excellent performance. In the problems investigated, it is either the first or second most accurate method. The results show that the proposed combination is better that any of the homogeneous ensembles; i.e. random forest, ensembles of MLPs and ensembles of SVMs. In addition, the differences of average ranks are statistically significant except for the ensemble of SVMs, which is second best.
[Bache and Lichman2013]
K. Bache and M. Lichman.
UCI machine learning repository, 2013.
- [Banfield et al.2007] R. E. Banfield, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer. A comparison of decision tree ensemble creation techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1):173–180, 2007.
- [Bauer and Kohavi1999] Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36(1):105–139, Jul 1999.
- [Ben-Hur and Weston2010] Asa Ben-Hur and Jason Weston. A User’s Guide to Support Vector Machines, pages 223–239. Humana Press, Totowa, NJ, 2010.
- [Breiman1996a] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, Aug 1996.
- [Breiman1996b] Leo Breiman. Out-of-bag estimation. Technical report, Statistics Department, University of California, 1996.
- [Breiman2001] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, Oct 2001.
- [Caruana et al.2004] Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selection from libraries of models. In Proceedings of the Twenty-first International Conference on Machine Learning, ICML ’04, pages 18–, New York, NY, USA, 2004. ACM.
- [Cortes and Vapnik1995] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.
- [de Oliveira et al.2013] Joacir Marques de Oliveira, Eulanda Miranda dos Santos, José Reginaldo Hughes Carvalho, and Leyne Abuim de Vasconcelos Marques. Ensemble of heterogeneous classifiers applied to lithofacies classification using logs from different wells. In International Joint Conference on Neural Networks, IJCNN, pages 1–6, Dallas, TX, USA, 2013.
- [Demšar2006] Janez Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006.
- [Dietterich and Bakiri1991] Thomas G. Dietterich and Ghulum Bakiri. Error-correcting output codes: A general method for improving multiclass inductive learning programs. In Proceeding of AAAI-91, pages 572–577. AAAI Press, 1991.
- [Dietterich2000] Thomas G. Dietterich. Ensemble methods in machine learning. In Multiple Classifier Systems, pages 1–15, Berlin, Heidelberg, 2000. Springer Berlin Heidelberg.
[Freund et al.1999]
Yoav Freund, Robert Schapire, and N Abe.
A short introduction to boosting.
Journal-Japanese Society For Artificial Intelligence, 14(771-780):1612, 1999.
- [Friedman and Hall2007] Jerome H. Friedman and Peter Hall. On bagging and nonlinear estimation. Journal of Statistical Planning and Inference, 137(3):669 – 683, 2007.
- [Haque et al.2016] Mohammad Nazmul Haque, Nasimul Noman, Regina Berretta, and Pablo Moscato. Heterogeneous ensemble combination search using genetic algorithm for class imbalanced data classification. PloS one, 11(1):e0146116, 2016.
- [Hsu et al.2003] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A practical guide to support vector classification. Technical report, Department of Computer Science, National Taiwan University, 2003.
- [Lu et al.2015] Z. Lu, X. Wu, and J. C. Bongard. Active learning through adaptive heterogeneous ensembling. IEEE Transactions on Knowledge and Data Engineering, 27(2):368–381, 2015.
- [Martínez-Muñoz and Suárez2010] Gonzalo Martínez-Muñoz and Alberto Suárez. Out-of-bag estimation of the optimal sample size in bagging. Pattern Recognition, 43(1):143 – 152, 2010.
- [Nanni et al.2015] Loris Nanni, Sheryl Brahnam, Stefano Ghidoni, and Alessandra Lumini. Toward a general-purpose heterogeneous ensemble for pattern classification. Computational intelligence and neuroscience, 2015:85, 2015.
- [Partalas et al.2010] Ioannis Partalas, Grigorios Tsoumakas, and Ioannis Vlahavas. An ensemble uncertainty aware measure for directed hill climbing ensemble pruning. Machine Learning, 81(3):257–282, Dec 2010.
- [Tsoumakas et al.2004] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Effective voting of heterogeneous classifiers. In Jean-François Boulicaut, Floriana Esposito, Fosca Giannotti, and Dino Pedreschi, editors, Machine Learning: ECML 2004, pages 465–476, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.
- [Tsoumakas et al.2009] Grigorios Tsoumakas, Ioannis Partalas, and Ioannis Vlahavas. An Ensemble Pruning Primer, pages 1–13. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.