Numerous machine learning algorithms with roots in various areas of computer science and related fields have been developed for solving different classes of problems. They include linear models, support vector machines, decision trees, ensemble algorithms like boosting or random forests, neural networks etc
. There are also many feature selection techniques used to prepare the input data for the predictive algorithm. Different methods can extract and utilise different parts of the information contained in the data set under scrutiny. It has been shown that in many cases an ensemble machine learning model, which combines several different predictions, can outperform the component learners. This can be realised in the form of wisdom of crowds or in a more systematic way as a Super Learner . The wisdom of crowds is one of the principles underlying the design of DREAM Challenges, where multiple team contribute diverse algorithms and methodologies for solving complex biomedical problems 
. Super learning was proposed by van der Laan et al. and implemented as SuperLearner R language package. It utilises cross validation to estimate the performance of the component algorithms and dependencies between them. One may notice that wisdom of crowds can be formally cast as a special example of super learning. The goal of the current study is to examine several methods for estimate the performance of the super learner algorithm. In particular it explores methods for minimising bias with minimal computational effort.
1.1 Super learning – basic notions and algorithms
The input of any machine learning algorithm is a set of
observations, usually assumed to be independent and identically distributed, of the response variableand a set of explanatory variables , . A predictive algorithm (that includes also the feature selection algorithm and the set of hyper-parameters) can be trained using a data set to produce a model . Then, the model can be applied to another set of variables to obtain a vector of predictions .
-fold cross validation is an almost unbiased technique to estimate performance of a machine learning algorithm when applied to unseen data. Algorithm 1 allows to compute a vector of predictions for all the observations in the data set. The predictions can be compared with the original decision variable to estimate the performance of the learning algorithm.
In super learning, we use multiple machine learning algorithms , . The cross validation procedure leads to vectors of predictions . The idea is to treat them as new explanatory variables and apply some machine learning algorithm to build a second-order predictive model (see Algorithm 2). The new features are expected to be strongly and linearly connected with the response variable, so linear models with non-negative weights seem to be appropriate super learning algorithms. Indeed they are default methods in SuperLearner R package. Note, however, that any method that selects a weighted subset of the elementary learning algorithms can be formally considered as and example of super learning. This includes also selection of the best-performing algorithm, which is a common application of cross validation, and can be considered as a special case of super learning . Other examples can be unweighted mean of best performing algorithms, as in the wisdom of crowds, or even a simple mean of all learning algorithms used.
1.2 Performance estimation for super learning – nested cross validation
Super learning, as every machine learning method, is sensitive to overfitting. Therefore, an unbiased estimate of the performance of ensemble models is necessary. The obvious and most reliable method to obtain it is an external cross validation. The entire procedure is called nested cross validation, as it contains two levels of CV loop (see Algorithm3).
Nested cross validation is implemented in SuperLearner R package as a default method of performance estimation. The authors of Super learner algorithm recommend to use 10-fold internal cross validation . The external CV should also be at least 10-fold, to avoid a meaningful reduction of the sample size. Even if one restricts himself to a single loop of the external CV, the entire procedure requires all the learners to be run 110 times. Although the routine is easy to parallelise, the computational complexity is very large.
1.3 Bootstrap bias correction for super learning
An alternative approach to verification of complex machine learning algorithms was proposed by Tsamardinos et al. . It is called the Bootstrap Bias Correction and bases on Efron’s bootstrapping technique . The method was originally proposed to estimate the bias caused by a choice of the optimal set of hyper-parameters for a predictive model. However, as it has been mentioned, this task can be considered as a special case of super learning. What is more, the procedure is general and does not involve any actions specific to selection of hyper-parameters. Thus, the Tsamardinos’ method can be easily generalised for any kind of super learning.
The idea is shown in Algorithm 4. The cross validated predictions for all the algorithms are computed only once. Then, the combining models are computed many times for samples that are drawn with replacement from the predictions. The predictions of each combination model are then tested on it’s out-of-bag observations.
In this method, all base predictions come from the same cross-validation run, hence the entire sample is somewhat co-dependent. Therefore learning on the outcome will be overfitted. On the other hand, the training set contains duplicates what introduces additional noise. The effective data set size is smaller than the sample. This effect should decrease the overfitting and cancel the bias of the performance estimate.
The major advantage of BBC method is its small computational complexity. The elementary predictions are computed only once, multiple runs are required only for the relatively simple combining procedure.
In the current study, we apply bootstrap bias corrected super learning algorithm to synthetic and real-world data sets and compare the results with other verification methods, including the nested cross validation.
2 Materials and Methods
The tests were focused on binary classification tasks. The area under ROC curve (AUC) was used as a quality measure of predictions, since it does not depend on the class balance in the data set. The methods, however, can be applied also for multi-class classification or regression tasks, with different quality metrics.
2.1.1 Artificial data
The methodology was developed on the synthetic data set. The data set was created as follows:
First, the two vectors of expected values and two covariance matrices were randomly generated. These parameter sets are common for all the observations.
The requested number of instances of the binary decision variable was randomly chosen, with the same probability for both classes.
For each class of the decision variable, the explanatory variables are drawn from a multivariate normal distribution with the corresponding set of parameters.
This procedure allows to create an arbitrary big sample with the same statistical properties and verify directly the predictions on the unseen data. The parameters were tuned to emulate the strength of linear, quadratic and pairwise interactions as well as the dependencies between variables that may appear in the real-world data sets.
The data sets we had generated contained 5000 explanatory variables and 50, 100, 150, 200 observations. Statistic tests indicate from 2 to 20 relevant variables, depending on the sample size.
2.1.2 Biomedical data sets
The tests were performed on seven data sets that contain measurements of gene expression and copy number variation for four cancer types. These data sets correspond to biological questions with different levels of difficulty. These are:
data sets obtained from the CAMDA 2017 Neuroblastoma Data Integration Challenge (http://camda.info):
CNV – 39 115 array comparative genomic hybridization (aCGH) copy number variation profiles, data set limited to a cohort of 145 patients,
MA – 43 349 GE profiles analysed with Agilent 44K microarrays, data set limited to a cohort of 145 patients,
RNA-seq – 60 778 RNA-seq GE profiles at gene level, data set limited to a cohort of 145 patients.
The data collection procedures and design of experiments were previously described in the original studies [22, 19, 18, 4, 14]. Data sets are also accessible in Gene Expression Omnibus . The relevant question for these data sets is predicting the final clinical status of the patient using molecular data. This is difficult problem.
data sets with The Cancer Genome Atlas database generated by the TCGA Research Network (https://www.cancer.gov/tcga) that contain RNA-sequencing data for various types of cancer [12, 5, 13, 3, 15, 6]:
BRCA RNA – data set contains 1205 samples and 20223 probes (112 normal and 1093 tumor),
BRCA CNV – data set contains 1314 samples and 21106 probes (669 normal and 645 tumor),
HNSC – data set contains 564 samples and 20235 probes (44 normal and 520 tumor),
LUAD – data set contains 574 samples and 20172 probes (59 normal and 515 tumor).
In this case the relevant question is discerning normal tissue from tumor using data set at hand. It is much easier question since both genetic profile and gene expression patterns are highly modified in cancer cells in comparison with normal tissue.
For each data set we applied the full protocol of super learning and nested cross validation. We used the default 10-fold setup for both external and internal cross validation and 100 repeats of resampling procedure. For the artificial data sets, we performed the entire routine for 100 different sets drawn from the same distribution. In the case of real data sets, the protocol was repeated 100 times on the same data for different cross validation splits.
2.2.1 Base machine learning algorithms
Six popular machine learning algorithms were used as base learners:
Random Forest ,
Support Vector Machine (SVM) ,
Naive Bayesian classifier,
kNN classifier for .
All the algorithms are implemented from the standard R packages, with the default parameters. The parameters used are obviously not optimal, but the performance optimisation is not the subject of the current study.
For each algorithm, the input data were prepared by the feature selection algorithm. To this end, we applied Welch
-test for each explanatory variable and chose 100 variables with the biggest value of the test statistic This procedure is very sensitive to overfitting, so any bias in the verification methods should be clearly visible.
2.2.2 Methods of super learning
Six methods of combining various prediction results via super learner approach were tested:
two default methods implemented in SuperLearner R package:
NNLS: non-negative least squares,
non-negative logistic regression,
two “toy example” methods, that are, however, commonly used:
Mean: average of all the base results (the method does not introduce any overfitting),
Best 1: choice of the best-fitted model (this special case is mentioned in the original van der Laan paper and corresponds directly to the original purpose of the bootstrap method by Tsamardinos),
Best : average of best-performing models, where is optimised on the training set – usually is set as 3-4,
RF: Random Forest algorithm.
2.2.3 Methods of verification
The estimate of quality was performed using the following methods (from the most biased up to unbiased one):
Training set: measure the performance of the combined classifier on the same data that were used to build the combined model (the results are obviously overestimated);
Independent CV: compute the results of base learners only once, then verify the combining algorithm in a second, independent cross validation (due to the common information in the training and validation sets the results are also overestimated);
BBC SL: apply the bootstrap bias correction method (the method proposed in the current study),
Nested CV: apply the nested cross validation (as a gold-standard);
New data: directly measure the performance for new data, drawn from the same distribution (the oracle for artificial data sets).
3.1 Artificial data sets
3.1.1 Base learners – cross validation vs. direct verification
The way of building the artificial data sets allows for comparison between the cross validation results and the actual results obtained for unseen data. This comparison is shown in Table 1.
|50 obs., MSE=0.01||100 obs., MSE=0.01|
|150 obs., MSE=0.005||200 obs., MSE=0.005|
As it could be expected, the performance of all the machine learning algorithms improves with the sample size. In particular AdaBoost could not cope with 50 observations only.
Interestingly, a considerable negative bias of the cross validated estimations of AUC was observed in comparison with the model trained on the entire sample. To correct this difference, additional models were built using the training sets reduced to 90% of the original sample. Quality of these models, agree much better with the cross-validated estimates. That means that most of the bias of the cross validation procedure comes from the smaller size of training sets. The small remaining bias is most likely due to negative correlations of fluctuations in training and validation sets in cross-validation.
3.1.2 Super learning
The performance of diverse methods of super learning is shown in Table 2
. For this particular data set, the super learning technique needs at least 100 observations to outperform the best single result and at least 200 observations to perform better then a simple average of all the base results. Random forest proves to be a poor super learning method. The non-negative linear models perform the best (surprisingly, the ordinary least squares method was better at classification than the logistic regression, that is specialised for this task). However, the simple best-method performs almost as well.
The differences between various performance estimates are small, but some regular patterns are noticeable. As previously, the performance of the models learned on the entire data set and applied to new data is better than measured in the nested cross validation. Moreover, the performance measured for models built using the reduced training set is slightly better as well. The biased performance estimate on the training set is significantly bigger than other estimates. For 100 and more observations, BBC SL method leads to the results very close to the nested cross validation. The results of the independent CV are more unstable, often overestimated.
These results are, however, ambiguous: the bias of cross validation methods due to the smaller size of training sets seems to be stronger than any bias due to overfitting. Nevertheless, the proposed BBC SL algorithm proved better than any other simple verification algorithm and close to the “gold standard” nested cross validation.
|SL||Training||Indep.||BBC||Nested||New data||New data|
|50 observations, MSE=0.01|
|100 observations, MSE=0.01|
|200 observations, MSE=0.005 for Best 1 and Best , 0.004 for others|
3.2 Biomedical data
. The direct measurement of performance for unseen data is impossible in this case, hence a nested cross-validation is a reference. The estimated standard deviation of the distribution of results is shown for each data set. The error of the mean value is smaller, but it is hard to estimate, since the measurements are not mutually independent. As for artificial data, the performance of Random Forest for combining the base results was very poor, and the non-negative logistic regression performed very close to NNLS method. Thus, both were not shown in the tables.
For all the tested data sets ensemble methods outperform the best single classifier. However, in all the cases the best-performing method was a simple average over all the base results. The linear model and -best proved nearly as good in some cases. This result is obviously not a rule, our artificial data sets give an example for better performance of linear combinations of classifiers.
Another interesting point is the stability of the results. The most repeatable values of AUC are produced by mean of all the classifiers, while choice of the best single one and mean of best ones are the most unstable. The effect is not visible on the training set, but clearly appears when nested CV or BBC SL verification methods are used (see especially Table 4).
The performance measured using the proposed BBC SL method is almost exactly the same as obtained by nested cross validation. Two simpler methods of the performance estimation report inflated results.
Figure 1) shows the performance of three combining methods for the BRCA RNA data set in more detailed way. The “Best single” method is simply a common practice of choosing the best-performing classifier to this particular purpose. In a simple cross validation, the result seems to be better than any ensemble model. However, when the external validation is applied, the performance drops significantly and becomes unstable. It turns out, that as simple operation as choice between six classification algorithms is a considerable source of overfitting. Both nested cross validation and the proposed BBC SL method show this clearly. For this particular data set, the optimal strategy is the average of all the base predictions.
|CNV, typical MSE=0.02|
|MA, typical MSE=0.01|
|RNA, typical MSE=0.01|
Super learner as proposed in the  is computationally demanding approach that relies on multiple cross-validation and application of multiple learning algorithms. In the original formulation of the algorithm verification of the quality of the final model involves repeating entire procedure within nested cross-validation, what significantly increases the computational cost of the modelling procedure. The current study shows that the nested cross-validation can be replaced by using resampling protocol, which gives equivalent results.
In almost all examined cases, the ensemble of learners gives better results than selection of a single best method for prediction. Interestingly, the simplest method of combining results of all algorithms, namely the the simple average of predictions of all base learners seems to be a very good choice for obtaining a stable non-overfitted estimate. Two simplest methods for assigning weights to base learners, namely a simple linear combination or selection of unweighted average of k-best models have better performance that other methods and should be explored along simple mean of all base methods.
One should note, that Super learner was applied here to merge results of algorithms that are very good predictors themselves. The differences between predictions of these classifiers are concentrated on a handful of difficult cases. The simple average works best, because there are too few independent data point to build reliable model for more advanced methods.
-  (2001) Random forests. Machine Learning 45 (1), pp. 5–32. Cited by: 1st item.
-  (2011) LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2 (3), pp. 1–27. Cited by: 3rd item.
-  (2015) Comprehensive molecular portraits of invasive lobular breast cancer. Cell 163 (2), pp. P506–519. Cited by: 2nd item.
-  (2012) Age-dependent accumulation of genomic aberrations and deregulation of cell cycle and telomerase genes in metastatic neuroblastoma. International Journal of Cancer 131 (7), pp. 1591–1600. External Links: Cited by: 1st item.
-  (2014) Comprehensive molecular profiling of lung adenocarcinoma.. Nature 511 (), pp. 543–550. External Links: Cited by: 2nd item.
-  (2013) Comprehensive molecular characterization of clear cell renal cell carcinoma.. Nature 499 (), pp. 43–49. External Links: Cited by: 2nd item.
-  (2002-01) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 30 (1), pp. 207–210. External Links: Cited by: 1st item.
-  (1994) An introduction to the bootstrap. CRC Press. Cited by: §1.3.
-  (2014) Do we need hundreds of classifiers to solve real world classification problems?. The Journal of Machine Learning Research 15 (1), pp. 3133–3181. Cited by: §1.
A desicion-theoretic generalization of on-line learning and an application to boosting.
European Conference on Computational Learning Theory, pp. 23–37. Cited by: 4th item.
-  (2010) Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 (1), pp. 1–22. Cited by: 2nd item.
-  (2012) Comprehensive genomic characterization of squamous cell lung cancers.. Nature 489 (), pp. 519–525. External Links: Cited by: 2nd item.
-  (2014) Comprehensive molecular portraits of human breast tumours.. Nature 490 (), pp. 61–70. External Links: Cited by: 2nd item.
-  (2013-04) Hox-c9 activates the intrinsic pathway of apoptosis and is associated with spontaneous regression in neuroblastoma. Cell Death & Disease 4, pp. e586. External Links: Cited by: 1st item.
-  (2015) Comprehensive genomic characterization of head and neck squamous cell carcinomas.. Nature 517 (), pp. 576–582. External Links: Cited by: 2nd item.
-  (2012) Wisdom of crowds for robust gene network inference. Nature Methods 9 (8), pp. 796. Cited by: §1.
-  (2016) Crowdsourcing biomedical research: leveraging communities as innovation engines. Nature Reviews Genetics 17 (8), pp. 470. Cited by: §1.
-  (2012) High genomic instability predicts survival in metastatic high-risk neuroblastoma. Neoplasia 14 (9), pp. 823 – IN10. External Links: Cited by: 1st item.
-  (2014) Chromosome 17/17q gain and unaltered profiles in high resolution array-cgh are prognostically informative in neuroblastoma. Genes, Chromosomes and Cancer 53 (8), pp. 639–649. External Links: Cited by: 1st item.
-  (2018) Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. Machine Learning 107 (12), pp. 1895–1922. Cited by: §1.3.
-  (2007) Super learner. Statistical Applications in Genetics and Molecular Biology 6 (1). Cited by: §1.1, §1.2, §1, §4.
-  (2015-06) Comparison of RNA-seq and microarray-based models for clinical endpoint prediction. Genome Biology 16, pp. 133. External Links: Cited by: 1st item.