The combination of multiple classifiers using ensemble methods is increasingly important for making progress in a variety of difficult prediction problems. We present a comparative analysis of several ensemble methods through two case studies in genomics, namely the prediction of genetic interactions and protein functions, to demonstrate their efficacy on real-world datasets and draw useful conclusions about their behavior. These methods include simple aggregation, meta-learning, cluster-based meta-learning, and ensemble selection using heterogeneous classifiers trained on resampled data to improve the diversity of their predictions. We present a detailed analysis of these methods across 4 genomics datasets and find the best of these methods offer statistically significant improvements over the state of the art in their respective domains. In addition, we establish a novel connection between ensemble selection and meta-learning, demonstrating how both of these disparate methods establish a balance between ensemble diversity and performance.READ FULL TEXT VIEW PDF
Dynamic ensemble selection (DES) techniques work by estimating the level...
Few-shot classification consists of learning a predictive model that is ...
The continuing rise in the number of problems amenable to machine learni...
The contribution of this work is twofold: (1) We introduce a collection ...
The key issue in Dynamic Ensemble Selection (DES) is defining a suitable...
Polythetic classifications, based on shared patterns of features that ne...
Imbalanced learning (IL), i.e., learning unbiased models from
Ensemble methods combining the output of individual classifiers [1, 2] have been immensely successful in producing accurate predictions for many complex classification tasks [3, 4, 5, 6, 7, 8, 9]. The success of these methods is attributed to their ability to both consolidate accurate predictions and correct errors across many diverse base classifiers . Diversity is key to ensemble performance: If there is complete consensus the ensemble cannot outperform the best base classifier, yet an ensemble lacking any consensus is unlikely to perform well due to weak base classifiers. Successful ensemble methods establish a balance between the diversity and accuracy of the ensemble [11, 12]. However, it remains largely unknown how different ensemble methods achieve this balance to extract the maximum information from the available pool of base classifiers [11, 13]. A better understanding of how different ensemble methods utilize diversity to increase accuracy using complex datasets is needed, which we attempt to address with this paper.
Popular methods like bagging  and boosting  generate diversity by sampling from or assigning weights to training examples but generally utilize a single type of base classifier to build the ensemble. However, such homogeneous ensembles may not be the best choice for problems where the ideal base classifier is unclear. One may instead build an ensemble from the predictions of a wide variety of heterogeneousstacking [16, 17] as well as ensemble selection [18, 19]. Stacking constructs a higher-level predictive model over the predictions of base classifiers, while ensemble selection uses an incremental strategy to select base predictors for the ensemble while balancing diversity and performance. Due to their ability to utilize heterogeneous base classifiers, these approaches have superior performance across several application domains [6, 20].
Computational genomics is one such domain where classification problems are especially difficult. This is due in part to incomplete knowledge of how the cellular phenomenon of interest is influenced by the variables and measurements used for prediction, as well as a lack of consensus regarding the best classifier for specific problems. Even from a data perspective, the frequent presence of extreme class imbalance, missing values, heterogeneous data sources of different scale, overlapping feature distributions, and measurement noise further complicate classification. These difficulties suggest that heterogeneous ensembles constructed from a large and diverse set of base classifiers, each contributing to the final predictions, are ideally suited for this domain. Thus, in this paper we use real-world genomic datasets (detailed in Section II-A) to analyze and compare the performance of ensemble methods for two important problems in this area: 1) prediction of protein functions , and 2) predicting genetic interactions , both using high-throughput genomic datasets. Constructing accurate predictive models for these problems is notoriously difficult for the above reasons, and even small improvements in predictive accuracy have the potential for large contributions to biomedical knowledge. Indeed, such improvements uncovered the functions of mitochondrial proteins  and several other critical protein families. Similarly, the computational discovery of genetic interactions between the human genes EGFR-IFIH1 and FKBP9L-MOSC2 potentially enables novel therapies for glioblastoma , the most aggressive type of brain tumor in humans.
Working with important problems in computational genomics, we present a comparative analysis of several methods used to construct ensembles from large and diverse sets of base classifiers. Several aspects of heterogeneous ensemble construction that have not previously been addressed are examined in detail including a novel connection between ensemble selection and meta-learning, the optimization of the diversity/accuracy tradeoff made by these disparate approaches, and the role of calibration in their performance. This analysis sheds light on how variants of simple greedy ensemble selection achieve enhanced performance, why meta-learning often out-performs ensemble selection, and several directions for future work. The insights obtained from the performance and behavior of ensemble methods for these complex domain-driven classification problems should have wide applicability across diverse applications of ensemble learning.
We begin by detailing our datasets, experimental methodology, and the ensemble methods studied (namely ensemble selection and stacking) in Section II
. This is followed by a discussion of their performance in terms of standard evaluation metrics in SectionIII. We next examine how the roles of diversity and accuracy are balanced in ensemble selection and establish a connection with stacking by examining the weights assigned to base classifiers by both methods (Section IV-A). In Section IV-B, we discuss the impact of classifier calibration on heterogeneous ensemble performance, an important issue that has only recently received attention . We conclude and indicate directions for future work in Section V.
For this study we focus on two important problems in computational genomics: The prediction of protein functions, and the prediction of genetic interactions. Below we describe these problems and the datasets used to assess the efficacy of various ensemble methods. A summary of these datasets is given in Table I.
A key goal in molecular biology is to infer the cellular functions of proteins. To keep pace with the rapid identification of proteins due to advances in genome sequencing technology, a large number of computational approaches have been developed to predict various types of protein functions. These approaches use various genomic datasets to characterize the cellular functions of proteins or their corresponding genes . Protein function prediction is essentially a classification problem using features defined for each gene or its resulting protein to predict whether the protein performs a certain function () or not (). We use the gene expression compendium of Hughes et al.  to predict the functions of roughly 4,000 baker’s yeast (S. cerevisiae) genes. The three most abundant functional labels from the list of Gene Ontology Biological Process terms compiled by Myers et al.  are used in our evaluation. The three corresponding prediction problems are referred to as PF1, PF2, and PF3 respectively and are suitable targets for classification case studies due to their difficulty. These datasets are publicly available from Pandey et al. .
Genetic interactions (GIs) are a category of cellular interactions that are inferred by comparing the effect of the simultaneous knockout of two genes with the effect of knocking them out individually . The knowledge of these interactions is critical for understanding cellular pathways , evolution , and numerous other biological processes. Despite their utility, a general paucity of GI data exists for several organisms important for biomedical research. To address this problem, Pandey et al.  used ensemble classification methods to predict GIs between genes from S. cerevisiae (baker’s yeast) using functional relationships between gene pairs such as correlation between expression profiles, extent of co-evolution, and the presence or absence of physical interactions between their corresponding proteins. We use the data from this study to assess the efficacy of heterogeneous ensemble methods for predicting GIs from a set of 152 features (see Table II for an illustration) and measure the improvement of our ensemble methods over this state-of-the-art.
A total of 27 heterogeneous classifier types are trained using the statistical language R 
in combination with its various machine learning packages, as well as the RWeka interface to the data mining software Weka  (see Table III). Among these are classifiers based on boosting and bagging which are themselves a type of ensemble method, but whose performance can be further improved by inclusion in a heterogeneous ensemble. Classifiers are trained using 10-fold cross-validation where each training split is resampled with replacement 10 times then balanced using undersampling of the majority class. The latter is a standard and essential step to prevent learning decision boundaries biased to the majority class in the presence of extreme class imbalance such as ours (see Table I). In addition, a 5-fold nested cross-validation is performed on each training split to create a validation set for the corresponding test split. This validation set is used for the meta-learning and ensemble selection techniques described in Section II-C. The final result is a pool of 270 classifiers.
Performance is measured by combining the predictions made on each test split resulting from cross-validation into a single set and calculating the area under the Receiver Operating Characteristic curve (AUC). The performance of the 27 base classifiers for each dataset is given in TableIII, where the bagged predictions for each base classifier are averaged before calculating the AUC. These numbers become important in later discussions since ensemble methods involve a tradeoff between the diversity of predictions and the performance of base classifiers constituting the ensemble.
The predictions of each base classifier become columns in a matrix where rows are instances and the entry at row , column
is the probability of instancebelonging to the positive class as as predicted by classifier . We evaluate ensembles using AUC by applying the mean across rows to produce an aggregate prediction for each instance.
Meta-learning is a general technique for improving the performance of multiple classifiers by using the meta information they provide. A common approach to meta-learning is stacked generalization (stacking)  that trains a higher-level (level 1) classifier on the outputs of base (level 0) classifiers.
Using the standard formulation of Ting and Witten 
, we perform meta-learning using stacking with a level 1 logistic regression classifier trained on theprobabilistic outputs of multiple heterogeneous level 0 classifiers. Though other classifiers may be used, a simple logistic regression meta-classifier helps avoid overfitting which typically results in superior performance . In addition, its coefficients have an intuitive interpretation as the weighted importance of each level 0 classifier .
The layer 1 classifier is trained on a validation set created by the nested cross-validation of a particular training split and evaluated against the corresponding test split to prevent the leaking of label information. Overall performance is evaluated as described in Section II-B.
In addition to stacking across all classifier outputs, we also evaluate stacking using only the aggregate output of each resampled (bagged) base classifier. For example, the outputs of all 10 SVM classifiers are averaged and used as a single level 0 input to the meta learner. Intuitively this combines classifier outputs that have similar performance and calibration, which allows stacking to focus on weights between (instead of within) classifier types.
A variant on traditional stacking is to first cluster classifiers with similar predictions, then learn a separate level 1 classifier for each cluster . Alternately, classifiers within a cluster can first be combined by taking their mean (for example) and then learning a level 1 classifier on these per-cluster averaged outputs. This is a generalization of the aggregation approach described in Section II-C2
but using a distance measure instead of restricting each cluster to bagged homogeneous classifiers. We use hierarchical clustering with(where is Pearson’s correlation) as a distance measure. We found little difference between alternate distance measures based on Pearson and Spearman correlation and so present results using only this formulation.
For simplicity, we refer to the method of stacking within clusters and taking the mean of level 1 outputs as intra-cluster stacking. Its complement, inter-cluster stacking, averages the outputs of classifiers within a cluster then performs stacking on the averaged level 0 outputs. The intuition for both approaches is to group classifiers with similar (but ideally non-identical) predictions together and learn how to best resolve their disagreements via weighting. Thus the diversity of classifier predictions within a cluster is important, and the effectiveness of this method is tied to a distance measure that can utilize both accuracy and diversity.
Ensemble selection is the process of choosing a subset of all available classifiers that perform well together, since including every classifier may decrease performance. Testing all possible classifier combinations quickly becomes infeasible for ensembles of any practical size and so heuristics are used to approximate the optimal subset. The performance of the ensemble can only improve upon that of the best base classifier if the ensemble has a sufficient pool of accurate and diverse classifiers, and so successful selection methods must balance these two requirements.
We establish a baseline for this approach by performing simple greedy ensemble selection, sorting base classifiers by their individual performance and iteratively adding the best unselected classifier to the ensemble. This approach disregards how well the classifier actually complements the performance of the ensemble.
Improving on this approach, Caruana et al.’s ensemble selection (CES) [18, 19] begins with an empty ensemble and iteratively adds new predictors that maximize its performance according to a chosen metric (here, AUC). At each iteration, a number of candidate classifiers are randomly selected and the performance of the current ensemble including the candidate is evaluated. The candidate resulting in the best ensemble performance is selected and the process repeats until a maximum ensemble size is reached. The evaluation of candidates according to their performance with the ensemble, instead of in isolation, improves the performance of CES over simple greedy selection.
Additional improvements over simple greedy selection include 1) initializing the ensemble with the top base classifiers, and 2) allowing classifiers to be added multiple times. The latter is particularly important as without replacement, the best classifiers are added early and ensemble performance then decreases as poor predictors are forced into the ensemble. Replacement gives more weight to the best performing predictors while still allowing for diversity. We use an initial ensemble size of to reduce the effect of multiple bagged versions of a single high performance classifier dominating the selection process, and (for completeness) evaluate all candidate classifiers instead of sampling.
Ensemble predictions are combined using a cumulative moving average to speed the evaluation of ensemble performance for each candidate predictor. Selection is performed on the validation set produced by nested cross-validation and the resulting ensemble evaluated as described in Section II-B.
The diversity of predictions made by members of an ensemble determines the ensemble’s ability to outperform the best individual, a long-accepted property which we explore in the following sections. We measure diversity using Yule’s -statistic  by first creating predicted labels from thresholded classifier probabilities, yielding a 1 for values greater than 0.5 and 0 otherwise. Given the predicted labels produced by each pair of classifiers and
, we generate a contingency table counting how often each classifier produces the correct label in relation to the other:
|correct (1)||incorrect (0)|
The pairwise statistic is then defined as:
This produces values tending towards when and correctly classify the same instances, when they do not, and when they are negatively correlated. We evaluated additional diversity measures such as Cohen’s -statistic  but found little practical difference between the measures (in agreement with Kuncheva et al. ) and focus on for its simplicity. Multicore performance and diversity measures are implemented in C++ using the Rcpp package . This proves essential for their practical use with large ensembles and nested cross validation.
We adjust raw values using the transformation so that 0 represents no diversity and 1 represents maximum diversity for graphical clarity.
Performance of the methods described in Section II-C is summarized in Table IV. Overall, aggregated stacking is the best performer and edges out CES for all our datasets. The use of clustering in combination with stacking also performs well for certain cluster sizes . Intra-cluster stacking performs best with cluster sizes 2, 14, 20, and 15 for GI, PF1, PF2, and PF3, respectively. Inter-cluster stacking is optimal for sizes 24, 33, 33, and 36 on the same datasets. Due to the size of the GI dataset, only 10% of the validation set is used for non-aggregate stacking and cluster stacking methods. Performance levels off beyond 10% and so this approach does not significantly penalize these methods. This step was not necessary for the other methods and datasets.
|Best Base Classifier||0.79||0.68||0.72||0.78|
AUC of ensemble learning methods for protein function and genetic interaction datasets. Methods include mean aggregation, greedy ensemble selection, selection with replacement (CES), stacking with logistic regression, aggregated stacking (averaging resampled homogeneous base classifiers before stacking), stacking within clusters then averaging (intra), and averaging within clusters then stacking (intra). The best performing base classifier (random forest for the GI dataset and GBM for PFs) is given for reference. Starred values are generated from a subsample of the validation set due to its size; see text for detail.
|Method A||Method B||p-value|
|Best Base Classifier||CES||0.001902|
|Best Base Classifier||Stacking (Aggregated)||0.000136|
|Greedy Selection||Mean Aggregation||0.029952|
|Greedy Selection||Stacking (Aggregated)||0.005364|
|Inter-Cluster Stacking||Mean Aggregation||0.047336|
|Inter-Cluster Stacking||Stacking (Aggregated)||0.003206|
|Intra-Cluster Stacking||Stacking (Aggregated)||0.001124|
|Mean Aggregation||Stacking (Aggregated)||0.000022|
|Stacking (Aggregated)||Stacking (All)||0.002472|
|cd||Best Base Classifier||11|
Ensemble selection also performs well, though we anticipate issues of calibration (detailed in Section IV-B) could have a negative impact since the mean is used to aggregate ensemble predictions. Greedy selection achieves best performance for ensemble sizes of 10, 14, 45, and 38 for GI, PF1, PF2, and PF3, respectively. CES is optimal for sizes 70, 43, 34, and 56 for the same datasets. Though the best performing ensembles for both selection methods are close in performance, simple greedy selection is much worse for non-optimal ensemble sizes than CES and its performance typically degrades after the best few base classifiers are selected (see Section IV-A). Thus, on average CES is the superior selection method.
In agreement with Altman et al. , we find the mean is the highest performing simple aggregation method for combining ensemble predictions. However, because we are using heterogeneous classifiers that may have uncalibrated outputs, the mean combines predictions made with different scales or notions of probability. This explains its poor performance compared to the best base classifier in a heterogeneous ensemble and emphasizes the need for ensemble selection or weighting via stacking to take full advantage of the ensemble. We discuss the issue of calibration in Section IV-B.
Thus, we observe consistent performance trends across these methods. However, to draw meaningful conclusions it is critical to determine if the performance differences are statistically significant. For this we employ the standard methodology given by Demšar  to test for statistically significant performance differences between multiple methods across multiple datasets. The Friedman test  first determines if there are statistically significant differences between any pair of methods over all datasets, followed by a post-hoc Nemenyi test 
to calculate a p-value for each pair of methods. This is the non-parametric equivalent of ANOVA combined with a Tukey HSD post-hoc test where the assumption of normally distributed values is removed by using rank transformations. As many of the assumptions of parametric tests are violated by machine learning algorithms, the Friedman/Nemenyi test is preferred despite reduced statistical power.
Using the Freidman/Nemeyi approach with a cutoff of , the pairwise comparison between our ensemble and non-ensemble methods is shown in Table V. For brevity, only methods with statistically significant performance differences are shown. The ranked performance of each method across all datasets is shown in Table VI. Methods sharing a label in the group column have statistically indistinguishable performance based on their summed rankings. This table shows that aggregated stacking and CES have the best performance, while CES and pure greedy selection have similar performance. However, aggregated stacking and greedy selection do not share a group as their summed ranks are too distant and thus have a significant performance difference. The remaining approaches including non-aggregated stacking are statistically similar to mean aggregation and motivates our inclusion of cluster-based stacking, whose performance may improve given a more suitable distance metric. These rankings statistically reinforce the general trends presented earlier in Table IV.
We note that nested cross-validation, relative to a single validation set, improves the performance of both stacking and CES by increasing the amount of meta data available as well as the bagging that occurs as a result. Both effects reduce overfitting but performance is still typically better with smaller ensembles. More nested folds increase the quality of the meta data and thus affects the performance of these methods as well, though computation time increases substantially and motivates our selection of nested folds.
Finally, we emphasize that each method we evaluate out-performs the previous state of the art AUC of 0.741 for GI prediction 
. In particular, stacked aggregation results in the prediction of 988 additional genetic interactions at a 10% false discovery rate. In addition, these heterogeneous ensemble methods out-perform random forests and gradient boosted regression models which are themselves homogeneous ensembles. This demonstrates the value of heterogeneous ensembles for improving predictive performance.
The relationship between ensemble diversity and performance has immediate impact on practitioners of ensemble methods, yet has not formally been proven despite extensive study [11, 13]. For brevity we analyze this tradeoff using GI and PF3 as representative datasets, though the trends observed generalize to PF1 and PF2.
Figure 1 presents a high-level view of the relationship between performance and diversity, plotting the diversity of pairwise classifiers against their performance as an ensemble by taking the mean of their predictions. This figure shows the complicated relationship between diversity and performance that holds for each of our datasets: Two highly diverse classifiers are more likely to perform poorly due to lower prediction consensus. There are exceptions, and these tend to include well-performing base classifiers such as random forests and gradient boosted regression models (shown in red in Figure 1) which achieve high AUC on their own and stand to gain from a diverse partner. Diversity works in tension with performance, and while improving performance depends on diversity, the wrong kind of diversity limits performance of the ensemble .
Figure 2 demonstrates this tradeoff by plotting ensemble diversity and performance as a function of the iteration number of the simple greedy selection and CES methods detailed in Section II for the GI (top figure) and PF3 (bottom figure) datasets. These figures reveal how CES (top curve) successfully exploits the tradeoff between diversity and performance while a purely greedy approach (bottom curve) actually decreases in performance over iterations after the best individual base classifiers are added. This is shown via coloring, where CES shifts from red to yellow (better performance) as its diversity increases while greedy selection grows darker red (worse performance) as its diversity only slightly increases. Note that while greedy selection increases ensemble diversity around iteration 30 for PF3, overall performance continues to decrease. This demonstrates that diversity must be balanced with accuracy to create well-performing ensembles.
To illustrate using the PF3 panel of Figure 2, the first classifiers chosen by CES (in order) are rf.1, rf.7, gbm.2, RBFClassifier.0, MultilayerPerceptron.9, and gbm.3 where numbers indicate bagged versions of a base classifier. RBFClassifier.0 is a low performance, high diversity classifier while the others are the opposite (see Table III for a summary of base classifiers). This ensemble shows how CES tends to repeatedly select base classifiers that improve performance, then selects a more diverse and typically worse performing classifier. Here the former are different bagged versions of a random forest while the latter is RBFClassifier.0. This manifests in the left part of the upper curve where diversity is low and then jumps to its first peak. After this, a random forest is added again to balance performance and diversity drops until the next peak. This process is repeated while the algorithm approaches a weighted equilibrium of high performing, low diversity and low performing, high diversity classifiers.
This agrees with recent observations that diversity enforces a kind of regularization for ensembles [13, 49]: Performance stops increasing when there is no more diversity to extract from the pool of possible classifiers. We see this in the PF3 panel of Figure 2 as performance reaches its peak, where small oscillations in diversity represent re-balancing the weights to maintain performance past the optimal ensemble size.
Since ensemble selection and stacking are top performers and can both be interpreted as learning to weight different base classifiers, we next compare the most heavily weighted classifiers selected by CES (Weight) with the coefficients of a level 1 logistic regression meta-learner (Weight). We compute Weight as the normalized counts of classifiers included in the ensemble, resulting in greater weight for classifiers selected multiple times. These weights for PF3 are shown in Table VII.
Nearly the same classifiers receive the most weight under both approaches (though logistic regression coefficients were not restricted to positive values so we cannot directly compare weights between methods). However, the general trend of the relative weights is clear and explains the oscillations seen in Figure 2: High performance, low diversity classifiers are repeatedly paired with higher diversity, lower performing classifiers. A more complete picture of selection emerges by examining the full list of candidate base classifiers (Table VIII) with the most weighted ensemble classifiers shown in bold. The highest performing, lowest diversity GBM and RF classifiers appear at the top of the list while VFI and IBk are near the bottom. Though there are more diverse classifiers than VFI and IBk, they were not selected due to their lower performance.
This example illustrates how diversity and performance are balanced during selection, and also gives new insight into the nature of stacking due to the convergent weights of these seemingly different approaches. A metric incorporating both measures should increase the performance of hybrid methods such as cluster-based stacking, which we plan to investigate in future work.
A key factor in the performance difference between stacking and CES is illustrated by stacking’s selection of MultilayerPerceptron instead of RBFClassifier for PF3. This difference in the relative weighting of classifiers, or the exchange of one classifier for another in the final ensemble, persists across our datasets. We suggest this is due to the ability of the layer 1 classifier to learn a function on the probabilistic outputs of base classifiers and compensate for potential differences in calibration, resulting in the superior performance of stacking.
A binary classifier is said to be well-calibrated if it is correct percent of the time for predictions of confidence . However, accuracy and calibration are related but not the same: A binary classifier that flips a fair coin for a balanced dataset will be calibrated but not accurate. Relatedly, many well-performing classifiers do not produce calibrated probabilities. Measures such as AUC are not sensitive to calibration for base classifiers, and the effects of calibration on heterogeneous ensemble learning have only recently been studied . This section further investigates this relationship.
To illustrate a practical example of calibration, consider a support vector machine. An uncalibrated SVM outputs the distance of an instance from a hyperplane to generate a probability. This is not a true posterior probability of an instance belonging to a class, but is commonly converted to such using Platt’s method. In fact, this is analogous to fitting a layer 1 logistic regression to the uncalibrated SVM outputs with a slight modification to avoid overfitting. This approach is not restricted to SVMs and additional methods such as isotonic regression are commonly used for both binary and multi-class problems .
Regardless of the base classifier, a lack of calibration may effect the performance of ensemble selection methods such as CES since the predictions of many heterogeneous classifiers are combined using simple aggregation methods such as the mean. Several methods exist for evaluating the calibration of probabilistic classifiers. One such method, the Brier score, assesses how close (on average) a classifier’s probabilistic output is to the correct binary label :
over all instances . This is simply the mean squared error evaluated in the context of probabilistic binary classification. Lower scores indicate better calibration.
Figure 3 plots the Brier scores for each base classifier against its performance for the GI dataset as well as the ensemble Brier scores for each iteration of CES and greedy selection. This shows that classifiers and ensembles with calibrated outputs generally perform better. Note in particular the calibration and performance of simple greedy selection, with initial iterations in the upper left of the panel showing high performing well-calibrated base classifiers chosen for the ensemble, but moving to the lower right as sub-optimal classifiers are forced into the ensemble. In contrast, CES starts with points in the lower right and moves to the upper left as both ensemble calibration and performance improve each iteration. The upper left of the CES plot suggests the benefit of additional classifiers outweighs a loss in calibration during its final iterations.
Stacking produces a layer 1 classifier with approximately half the Brier score (0.083) of CES or the best base classifiers. Since this approach learns a function over probabilities it is able to adjust to the different scales used by potentially ill-calibrated classifiers in a heterogeneous ensemble. This explains the difference in the final weights assigned by stacking and CES to the base classifiers in Table VII: Though the relative weights are mostly the same, logistic regression is able to correct for the lack of calibration across classifiers and better incorporate the predictions of MultilayerPerceptron whereas CES cannot. In this case, a calibrated MultilayerPerceptron serves to improve performance of the ensemble and thus stacking outperforms CES.
In summary, this section demonstrates the tradeoff between performance and diversity made by CES and examines its connection with stacking. There is significant overlap in the relative weights of the most important base classifiers selected by both methods. From this set of classifiers, stacking often assigns more weight to a particular classifier as compared to CES and this result holds across our datasets. We attribute the superior performance of stacking to this difference, originating from its ability to accommodate differences in classifier calibration that are likely to occur in large heterogeneous ensembles. This claim is substantiated by its significantly lower Brier score compared to CES as well as the correlation between ensemble calibration and performance. This suggests the potential for improving ensemble methods by accommodating differences in calibration.
The aim of ensemble techniques is to combine diverse classifiers in an intelligent way such that the predictive accuracy of the ensemble is greater than that of the best base classifier. Since enumerating the space of all classifier combinations quickly becomes infeasible for even relatively small ensemble sizes, other methods for finding well performing ensembles have been widely studied and applied in the last decade.
In this paper we apply a variety of ensemble approaches to two difficult problems in computational genomics: The prediction of genetic interactions and the prediction of protein functions. These problems are notoriously difficult for their extreme class imbalance, prevalence of missing values, integration of heterogeneous data sources of different scale, and overlap between feature distributions of the majority and minority classes. These issues are amplified by the inherent complexity of the underlying biological mechanisms and incomplete domain knowledge.
We find that stacking and ensemble selection approaches offer statistically significant improvements over the previous state-of-the-art for GI prediction  and moderate improvements over tuned random forest classifiers which are particularly effective in this domain . Here, even small improvements in accuracy can contribute directly to biomedical knowledge after wet-lab verification: These include 988 additional genetic interactions predicted by aggregated stacking at a 10% false discovery rate. We also uncover a novel connection between stacking and Caruana et al.’s ensemble selection method (CES) [18, 19], demonstrating how these two disparate methods converge to nearly the same final base classifier weights by balancing diversity and performance in different ways. We explain how variations in these weights are related to the calibration of base classifiers in the ensemble, and finally describe how stacking improves accuracy by accounting for differences in calibration. This connection also shows how the utilization of diversity is an emergent, not explicit, property of how CES maximizes ensemble performance and suggests directions for future work including formalizing the effects of calibration on heterogeneous ensemble performance, modifications to CES which explicitly incorporate diversity , and an optimization-based formulation of the diversity/performance tradeoff for improving cluster-based stacking methods.
We thank the Genomics Institute at Mount Sinai for their generous financial and technical support.
M. Friedman, “The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance,”Journal of the American Statistical Association, vol. 32, no. 200, pp. 675–701, 1937.
B. Zadrozny and C. Elkan, “Transforming Classifier Scores Into Accurate Multiclass Probability Estimates,” inProceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining, 2002, pp. 694–699.