1 Introduction
In supervised learning, ensemble approaches are popular techniques for the prediction of tabular data. The most prominent ensemble methods include averaging ensembles like Random Forest
[5] or Random Subspace [11], as well as gradient boosting techniques
[9]. While Random Subspace is a modelagnostic approach, i.e. an approach that be can combined with any type of base model, Random Forest is designed specifically for the aggregation of randomized decision trees. One advantage of decision trees is their interpretability, as feature importance scores can be easily derived from a trained treebased model. On the other hand, while modelagnostic approaches are more general and flexible, they cannot be used, at least in a straightforward way, to derive feature importances, as soon as they are combined with models other than trees. Note that while gradient boosting is a modelagnostic approach, it is designed to aggregate weak models, and is hence typically used with shallow decision trees.
In this paper, we propose a modelagnostic ensemble approach for supervised learning. The proposed approach is based on a parametric version of Random Subspace, in which each base model is learned from a feature subset sampled according to a Bernoulli distribution. We formulate the training procedure as an optimization problem where the goal is to identify the parameters of the Bernoulli distribution that minimize the generalization error of the ensemble model, and we show that this optimization problem can be solved using gradient descent even when the base models are not differentiable. The optimization of the Bernoulli distribution is however intractable, as the computation of the exact output of the full ensemble model would require the training of one model for each possible feature subset. To render the parameter optimization tractable, we use Monte Carlo sampling to approximate the ensemble model output. We further use an importance sampling approach that circumvents frequent retraining of the base models after each update of the gradient descent.
While the degree of randomization is controlled by a hyperparameter in standard Random Subspace, it has the advantage to be automatically tuned in our parametric version. Furthermore, modelagnostic feature importance scores can be easily derived from the trained ensemble model. We show the good performance of the proposed approach, both in terms of prediction and feature ranking, on simulated and realworld datasets. We also show that our approach can be successfully used for the reconstruction of gene regulatory networks.
2 Methods
We assume a supervised setting, where we have at our disposal a learning set containing inputoutput pairs
drawn from an unknown probability distribution. Let us denote by
the number of input variables. The output can be either continuous (regression problem) or discrete (classification problem). Our goal is to train a modelagnostic predictive model, while deriving for each input variable a score that measures its importance for the output prediction.To achieve this goal, we propose a parametric version of a supervised learning method that combines Random Subspace [11] with Bagging [4], which we call RSB. The original, nonparametric RSB method consists in learning an ensemble of predictive models, where each model is built from a bootstrap sample of the original dataset and a randomly chosen subset of input variables (with
), sampled according to a uniform distribution. This method has been shown to yield competitive predictive performance with respect to Random Forest, with the advantage of being applicable with any type of base model
[17]. Here, instead of using a uniform distribution, we adopt a parametric distribution for the selection of the input features, and feature importance scores are derived through the identification of the distribution parameters that yield the lowest generalization error. In the following, after introducing the parametric RSB model (Section 2.1), we show how this model can be trained in a tractable way (Section 2.2) and we discuss our approach with respect to related works (Section 2.3).2.1 The Parametric RSB Approach
Let us denote by
a binary vector of length
encoding a subset of selected input variables: if the th variable is selected and otherwise, . In the approach that we propose, each indicator variable is assumed to follow a Bernoulli distribution with parameter. The probability mass function for
is then given by:(1) 
where is the probability of selecting the th variable and . Let be the set of all the possible feature subsets, where is the cardinality of .
We assume an ensemble method that consists in averaging base models trained independently of each other using subsets of features drawn from . Let us denote by some functional space corresponding to a given learning algorithm and by the subset of functions from that only depend on the variables indicated by . Let be the base model learned by this learning algorithm from the feature subset (). Asymptotically, the prediction of the ensemble model for a given input is given by:
(2) 
For a fixed , a practical approximation of can be obtained by MonteCarlo sampling, i.e. by drawing feature subsets from and then training a model from each of these subsets, using the chosen learning algorithm.
If all the ’s are equal, the resulting ensemble method is very close to the Random Subspace approach [11], the only difference being that the number of selected features will be slightly randomized from one model to the next. In this work, we would like however to identify the parameters that yield the most accurate expected predictions
over our training set. Given a loss function
, the corresponding optimization problem can be formulated as follows:(3)  
Our approach takes inspiration from the variational optimization method, which seeks to minimize an objective function by minimizing the expectation of that function with respect to a parametric distribution [18]. A nice advantage is that the selection probabilities after optimization can be interpreted as measures of variable importances: useless variables are expected to get low selection probabilities, while the most important ones are expected to get selection probabilities close to 1.
2.2 Training the Parametric RSB Model
We propose to solve the optimization problem in Eq (3) using gradient descent. We first show how to compute the gradient of the objective function assuming that all the models are given. We then explain how to estimate this gradient by using Monte Carlo sampling and show how to incrementally update this gradient estimate using importance sampling. Precise pseudocodes of the algorithms are given in Appendix A and our Python implementation is available at https://github.com/vahuynh/PRSB.
2.2.1 Computing the Gradient
Let us assume that we have complete access to the models . Let us denote and . We then have, :
(4) 
where we have defined, for the simplicity of notations:
(5)  
(6)  
(7)  
(8) 
(resp. ) is thus the expected output of a model that does not take (resp. takes) as input the th variable. Assuming that the loss function is differentiable w.r.t. the expected output , the derivative of the objective function w.r.t. is given by:
(9)  
(10) 
The above derivative can be easily interpreted in the context of a gradient descent approach. For example, when is positive, the loss decreases with a lower model prediction . This means that if , the model without variable will give a lower loss than the model with variable . In that case, the derivative is positive and a gradient descent step (i.e. , where is the learning rate) will decrease the value of .
2.2.2 Estimating the Gradient
Given the current selection probabilities , the exact computation of , and , as required to compute the gradient, is obviously intractable as it implies training
models. An unbiased estimation of
can however be obtained by Monte Carlo sampling, i.e. by averaging the output over an ensemble of models (with ), where each model is trained using a subset of features sampled from . From this ensemble of models, one can then build similarly Monte Carlo approximations of and :(11)  
(12) 
where (resp. ) is the number of models where (resp. ), with .
It remains to be explained on which data the models are trained. Using the same training set as the one used to optimize the parameters would lead to biased predictions and hence to overfitting. To avoid having to split the available data in two independent sets, we use in practice a tenfold crossvalidation procedure to compute in an unbiased way the predictions for each object of the learning set. We furthermore build each base model from a bootstrap sample following the original RSB method. Note that in the case where is the empty set, which can happen when all the parameters are very low, we set to a constant model that always returns the mean value of the output in the bootstrap sample (for regression problems) or the majority class (for classification problems).
2.2.3 Updating the Gradient
The above procedure allows to estimate the gradient and to perform one gradient descent step. However, after this step, the distribution parameters are updated to and a new gradient estimation requires to compute and . To be able to compute the approximations in Eqs (11) and (12), new models must thus in principle be learned by sampling each from the new distribution . This would result in a very computationally expensive algorithm where new models are learned after each parameter update.
In order to estimate the effect of a change in the feature selection probabilities
without learning new models, we use the importance sampling approximations of the expectations. Given a new vector of feature selection probabilities , any expectation under can be approximated through . We have, for any input :(13)  
(14)  
(15) 
where the feature subsets in Eq (15) have been sampled from . Similarly, we have:
(16)  
(17) 
Using these approximations, the expectations , and can be estimated for any with the same ensemble of models , i.e. the ones learned at the previous learning stage, when the were sampled from . Standard gradient descent techniques can then be used to identify the parameter set that minimizes . Since, , must be between 0 and 1, we use the projected gradient descent technique, where is projected into the space after each step of the gradient descent.
Importance sampling approximation can be extended to the case where the feature subsets are sampled from different distributions with parameters , respectively. Let us denote by the number of subsets sampled from , with . We adopt the following multiple importance sampling approximation [19]:
(18) 
Using the multiple importance sampling approximation allows us, at each learning stage, to replace only a small number of models with new ones, rather than learning models each time, thereby saving some computing time.
Note that the (multiple) importance sampling approximation consists of a weighted average of the functions , using weights . When becomes very different from all the vectors , some models might be hardly used for the importance sampling approximation because they have a very low weight . The effective number of used models (or effective sample size) can be computed as [8]:
(19) 
With imbalanced weights, the importance sampling approximation is equivalent to averaging over models. When is too low, the gradient estimation thus becomes unreliable. In such a case, we stop the gradient descent and we move to the next learning stage. In practice, we stop as soon as drops below , with a maximum of 100 gradient descent steps.
2.3 Discussion
The proposed algorithm has the advantage of being modelagnostic in that any supervised learning method can be used to fit the models. Despite the use of gradient descent, no hypothesis of differentiability is required for the model family. The framework can also be easily adapted to any differentiable loss.
Computational complexity and convergence
The complexity of the gradient descent procedure for updating the parameters is linear with respect to the number of samples, the number of features and the number of base models in the ensemble. The costliest step is the construction of the base models. We have to learn initial models, followed by new models at each iteration, where is the number of reference distributions. If is the number of iterations, it means that models have to be grown in total. The complexity of the construction of these models depends on the type of model, but note that each model is grown only from a potentially small subset of features. To fix ideas, the total number of models learned for the simulated datasets (see Section 3.1) is provided in Table S3.
Note that deriving a general theoretical analysis of the convergence of our approach is a difficult task, as the convergence depends on the type of base model. However, empirically we have observed the decrease and convergence of the objective function on all tested datasets.
Regularization
While we have not used any regularization term in (3), incorporating one is straightforward (and this will be exploited for the inference of gene regulatory networks). A natural regularization term to enforce sparsity could be simply the sum , which can be nicely interpreted as , i.e., the average size of the subsets drawn from . Adding this term to (3) with a regularization coefficient would simply consists in adding to the gradient in (10). We did not include such regularization in our experiments below to reduce the number of hyperparameters. Despite the lack of regularization, the algorithm has a natural propensity for selecting few features due to the use of tenfold crossvalidation to estimate the predictions: incorporating a useless feature will indeed often deteriorates the quality of the predictions and lead to a decrease of the corresponding . Note however that the sparsity of the resulting selection weights will depend on the robustness of the learning algorithm to the presence of irrelevant features. This will be illustrated in our experiments.
Related works
Our method has direct connections with Random Subspace ensemble methods [11, 17]. In addition to providing a feature ranking, it has the obvious added flexibility w.r.t. Random Subspace that the feature sampling distribution (and thus also the subspace size) is automatically adapted to improve performance. Our approach also falls into the general family of wrapper feature selection methods. Among this family, the closest works are the approaches based on Estimation of Distribution Algorithms (EDA, [2]
). EDA are generic optimization techniques that belong to the family of evolutionary algorithms. When applied to feature selection, they iteratively generate and evaluate a population of feature subsets
. In particular, EDA identify a probability distribution from the best solutions found at the previous iteration and use this probability distribution to generate new solutions. EDA and our parametric RSB approach are thus similar in the sense that they both iteratively sample feature subsets from an explicit probability distribution and update the latter. However, the goal of EDA is to minimize the expected value of the loss function:(20) 
while we are trying to minimize the loss of the ensemble model (see Eq.(3)). Both approaches also greatly differ in the optimization technique: EDA iteratively update from the best solutions in the current population, while our approach is based on gradient descent and importance sampling. Finally and most importantly, contrary to our approach, EDA focuses exclusively on the identification of important features and does not train any predictive model.
Our optimization procedure has also some links with variational optimization (VO, [18]). VO is a general technique for minimizing a function that is nondifferentiable or combinatorial. It is based on the bound:
(21) 
Instead of minimizing with respect to , one can thus minimize the upper bound with respect to . Provided the distribution is rich enough, this is equivalent to minimizing . In the same spirit as VO, the introduction of the distribution and the use of an ensemble model in our problem formulation in (3) can be interpreted as a way to define a more tractable proxy to the original combinatorial feature selection problem, i.e.:
(22) 
Like in VO, the formulation in (3) allows us to use gradient descent optimization despite the fact that the models are not necessarily differentiable. Note however that our objective function is not an upper bound of (22), as an ensemble is more flexible than its individual constituents.
3 Results
In all our experiments, we use ensembles of models and we initialize each to 0.05, so that each feature is expected to be selected five times over the ensemble.We noticed that using lower initial values prevents several features to be selected in the first iterations of the algorithm, hence resulting in convergence issues, while higher values result in larger computing times, as each base model must be trained using a larger number of features. Unless otherwise stated, we use reference distributions for the importance sampling approximation, thereby learning 10 new models at each learning step. We do 20 restarts of the training algorithm and keep the optimized
that yields the lowest objective function value. For regression problems we use the mean square error as loss function, while for classification problems we use the crossentropy. All the inputs are normalised to have zero mean and unit variance prior to training. As base model
, we use either a CART decision tree [6], anearest neighbors (kNN) model
[1] withor a support vector machine (SVM)
[3]with a radial basis function kernel and the (inverse) regularization parameter
set to 1.We report the predictive performance with the mean square error for regression problems and the misclassification rate for classification problems. A ranking of features can be obtained by sorting them by decreasing value of optimized importances . If the relevant variables are known, the feature ranking can be evaluated using the area under the precisionrecall curve (AUPR). A perfect ranking (i.e. all the relevant features have a higher importance than the irrelevant ones) yields an AUPR of 1, while a random ranking has an AUPR close to the proportion of relevant features.
We compare our approach to three baselines: the standard RSB method, Random Forest (RF) and Gradient Boosting with Decision Trees (GBDT). When evaluating feature rankings, we also compare ourselves to Estimation of Distribution Algorithms (EDA). Implementation details for these methods are provided in Appendix B.
3.1 Simulated Problems
Problem  Type  

Checkerboard  Regression  304  4 
Friedman  Regression  305  5 
Hypercube  Classification  305  5 
Linear  Classification  310  10 
Hypercube  

Model  TS error  /  
tree  Single  0.24 0.09  
RSB  0.18 0.12  259.10 70.11  
PRSB  0.16 0.11  12.16 2.18  
kNN  Single  0.44 0.05  
RSB  0.41 0.07  64.70 39.88  
PRSB  0.11 0.02  6.85 1.85  
SVM  Single  0.34 0.10  
RSB  0.34 0.09  187.90 78.95  
PRSB  0.14 0.05  12.76 2.46 
. Values are mean and standard deviation over 10 datasets.
We simulate four problems, shown in Table 1. For each problem, we generate 10 datasets, each with 300 training samples and 500 test samples (see Appendix C for the detailed simulation protocol).
Hypercube  

Method  TS error  AUPR 
RF  0.18 0.11  0.81 0.25 
GBDT  0.17 0.13  0.78 0.27 
PRSBtree  0.16 0.11  0.79 0.23 
PRSBkNN  0.11 0.02  0.91 0.09 
PRSBSVM  0.14 0.05  0.86 0.14 
Table 2 shows for the Hypercube problem the misclassification rates obtained on the test set with three model types: the single base model built using all the features, the standard RSB ensemble model, where the number of randomly sampled features is tuned by 10fold crossvalidation, and our parametric RSB method (PRSB). PRSB yields the lowest error for any type of base model. The improvement of performance over standard RSB is larger in the case of kNN and SVM, compared to decision trees. This can be explained by the fact a decision tree, contrary to kNN and SVM, has an inner feature selection mechanism and is hence able to maintain a good performance even in the presence of irrelevant features. Therefore, for a given irrelevant feature , the difference between and (Eq (16) and Eq (17) respectively) will be lower in the case of trees, which can prevent the corresponding to decrease towards zero during the gradient descent. Similar results can be observed on the three other simulated problems (Figure S1).
Note also that the standard RSB model greatly improves over the single model only in the case of decision trees. The decision tree being a model that is prone to high variance, its prediction performance is indeed usually improved by using ensemble methods [4, 5, 10]. On the other hand, since kNN and SVM have a sufficiently low variance, their performance is not improved with a standard ensemble.
While the degree of randomization is controlled by the parameter (i.e. the number of randomly sampled features for each base model) in standard RSB, it has the advantage to be automatically tuned in parametric RSB. Tables 2 and S1 indicate the sum of optimized parameters for each PRSB model, which is equivalent to , i.e. the average number of selected variables per base model. By comparing this average number to the parameter value of RSB, we can see that PRSB effectively selects a much lower number of features, while no explicit constraint on sparsity is used during model training. The average number of selected variables remains however slightly higher than the actual number of relevant features, indicating that a certain degree of randomization is introduced during model construction.
Except for the Friedman problem, PRSB outperforms or is equivalent to RF and GBDT both in terms of predictive performance and feature ranking (Tables 3 and S2). Overall, our method also returns better feature rankings than EDA (Figure S2).
One parameter of the PRSB training algorithm is the number of reference distributions used to compute the importance sampling approximations. Table S3 shows that, except for the Checkerboard problem, setting improves both the feature ranking and the prediction error, compared to . In particular, PRSB with Q = 1 yields a higher AUPR than GBDT on the Friedman problem, when combined with kNN or SVM. This gain in performance comes however with a high computational cost, since with , new models must be learned at each iteration, against for .
3.2 RealWorld Problems
We use here 15 realworld datasets (7 regression datasets and 8 binary classification datasets) retrieved from the Penn Machine Learning Benchmarks (PMLB) repository
[16]. We chose to restrict to datasets containing only continuous features and we selected all the real datasets (i.e. not simulated) having more than 300 samples. Original dataset sizes are indicated in Table S4. For each dataset, we add 500 irrelevant features, generated by randomly permuting the values of the original variables.Base model = tree  Base model = kNN  Base model = SVM  

Single  RSB  PRSB  Single  RSB  PRSB  Single  RSB  PRSB  
cpu_act  16.37  7.22  36.48  289.10  262.05  73.61  263.56  263.22  208.78 
ERA  5.19  2.59  2.87  4.38  3.79  3.06  3.69  3.69  3.16 
GeographicalOriginalofMusic  0.71  0.32  0.39  0.72  0.69  0.56  0.45  0.47  0.50 
pm10  1.22  0.62  0.69  0.89  0.77  0.76  0.76  0.76  0.73 
rmftsa_ladata  5.68  3.73  4.36  7.75  7.30  5.26  7.72  7.72  5.47 
satellite_image  1.39  0.60  0.68  2.21  1.96  0.46  1.31  1.32  0.94 
wind  24.29  11.44  13.60  28.31  25.17  15.29  16.12  16.17  14.80 
breast_cancer_wisconsin  0.10  0.05  0.06  0.18  0.14  0.11  0.07  0.08  0.06 
bupa  0.47  0.41  0.46  0.50  0.49  0.42  0.48  0.43  0.41 
clean2  0.00  0.00  0.00  0.07  0.07  0.04  0.03  0.04  0.00 
diabetes  0.33  0.25  0.24  0.37  0.35  0.30  0.33  0.33  0.34 
haberman  0.37  0.27  0.27  0.35  0.26  0.26  0.26  0.26  0.26 
phoneme  0.21  0.15  0.11  0.33  0.29  0.10  0.27  0.27  0.29 
spambase  0.12  0.06  0.09  0.35  0.31  0.17  0.14  0.14  0.23 
spectf  0.27  0.19  0.24  0.37  0.30  0.28  0.27  0.27  0.27 
Table 4 shows the average prediction errors, obtained with 10fold crossvalidation. As it was already observed on the simulated datasets, PRSB improves over RSB for almost all the realworld datasets when used in combination with kNN and SVM, while no improvement is observed in the case of trees. Again, this can be explained by the fact that a decision tree is robust to the presence of irrelevant features, while kNN and SVM are not. PRSB models are also much sparser than RSB ensembles, when comparing the (expected) number of selected features per base model (Table S5). PRSB remains however outperformed by RF and GBDT in terms of predictive performance (Table S6). The difference is however not statistically significant when using a tree as base model, according to the Friedman and Nemenyi tests (Figure 1, value of the Friedman test: 9.6e4, [7]).
Since the truly relevant variables of each problem are unknown, the quality of a feature ranking is evaluated by considering all the original (i.e. nonpermuted) features as relevant. In terms of AUPR, PRSB outperforms EDA but remains globally outperformed by RF and GBDT (Table S7). PRSB combined with kNN is however not statistically different from RF and GBDT (Figure 2, value of the Friedman test: 0.0040). Note also that while PRSB combined with trees are globally better than kNN and SVM in terms of predictive performance, it is not the case in terms of feature ranking.
3.3 Application to gene network inference
An open problem in computational biology is the reconstruction of gene regulatory networks, which attempt to explain the joint variability in the expression levels of a group of genes through a sparse pattern of interactions. One approach to gene network reconstruction is the application of a feature selection approach that identifies the regulators of each target gene. Such approach is used by GENIE3, one of the current stateoftheart network inference algorithms [12]. This method learns for each target gene a RF model predicting its expression from the expressions of all the candidate regulators, and identifies the regulators of that target gene through the RFbased feature importance scores. The PRSB and EDA approaches can be used in the same way for gene network inference, with however the advantage that the base models are not restricted to decision trees. Furthermore, while in GENIE3 the different models, corresponding to the different target genes, are learned independently of each other, the PRSB method has the additional advantage that it can be extended to introduce a global constraint on the topology of the network.
More formally, let be the number of genes, among which there are candidate regulators, and let and be respectively the expression levels of the candidate regulators and of the target genes in the th sample (). Our goal is to identify a matrix , where is the weight of the regulatory link directed from the th candidate regulator to the th gene. In the context of our PRSB approach, we seek to identify the matrix that minimizes the following objective function:
(23) 
where is the expression of the th gene in the th sample and denotes the th column of the matrix . The second term in the above objective function is a joint regularizer (with a coefficient ) that enforces structured sparsity, by enforcing the selection of as few rows as possible in [13]. Using this joint regularizer will result in modular networks where only a few regulators control the expressions of the different genes, a property often encountered in real gene regulatory networks.
We evaluate the ability of PRSB to reconstruct the five 100gene networks of the DREAM4 Multifactorial Network challenge [15, 14], for which GENIE3 was deemed the best performer. The DREAM4 networks are artificial networks for which the true regulatory links are known and an AUPR can thus be computed given a ranking of links. The first two networks were generated by extracting subnetworks from the Escherichia coli network, while the three remaining networks were extracted from the Saccharomyces cerevisiae network. To reconstruct each network, a simulated gene expression dataset with 100 samples was made available to the challenge participants.
Figure S3 shows, for different values of the regularization coefficient , the sum of of PRSB, i.e. the average number of selected candidate regulators per base model. As the coefficient determines the number of used candidate regulators, a suitable value of can be identified if one has a rough idea of the number of regulators per target gene in the studied network. For the complete E. coli and S. cerevisiae networks (composed of regulatory interactions that were experimentally confirmed), the observed average number of regulators is 0.46 and 0.66, respectively (many genes do not have known regulators) [14]. For each network, we thus select the largest value of (amongst {20, 50, 100}) such that the average sum is higher than . The resulting AUPRs are given in Table 5. The highest AUPRs are obtained with the regularized PRSB combined with SVM and kNN. Note that without any regularization (), PRSB still outperforms EDA, while being slightly inferior to the original RFbased GENIE3.
Net1  Net2  Net3  Net4  Net5  

GENIE3  RF  0.18  0.14  0.26  0.24  0.23 
EDA  tree  0.06  0.07  0.10  0.11  0.10 
kNN  0.07  0.10  0.09  0.11  0.09  
SVM  0.09  0.08  0.12  0.10  0.10  
PRSB  tree  0.15  0.11  0.25  0.20  0.20 
kNN  0.17  0.15  0.24  0.21  0.24  
SVM  0.18  0.14  0.24  0.22  0.23  
PRSB  tree  0.15  0.13  0.26  0.25  0.22 
kNN  0.19  0.19  0.28  0.28  0.28  
SVM  0.22  0.22  0.29  0.26  0.29 
4 Conclusions
We proposed a modelagnostic ensemble method based on the idea of averaging base models independently trained on feature subsets sampled from a Bernoulli distribution. We show that the parameters of the latter distribution can be trained using gradient descent even if the base models are not differentiable. The required iterative gradient computations can furthermore be performed efficiently by exploiting importance sampling. The resulting approach has interesting features: it is model agnostic, it can use any combination of differentiable loss function and regularization term, and it provides variable importance scores. Experiments show that PRSB always improves over standard RSB when combined with kNN or SVM. On the simulated datasets, PRSB generally outperforms its competitors (RF, GBDT and EDA) both in terms of predictive performance and feature ranking quality. On the real problems, PRSB with trees is not statistically different from RF and GBDT. We also showed that an appropriate regularization strategy allows PRSB to outperform the stateoftheart GENIE3 in the inference of gene regulatory networks.
While we adopted an ensemble strategy, the same optimization technique, combining gradient descent and importance sampling, can be used to solve the feature selection problem as defined in (20
) and addressed also by EDA. It would be interesting to investigate this approach and compare it with the ensemble version explored in this paper. Note however that it would require to exploit a stronger learning algorithm, because it would not benefit from the ensemble averaging effect. Applying this technique, and its associated derivation of feature importance scores, on top of modern deep learning models would be also highly desirable given the challenge to explain these models. This would require however to develop specific strategies to reduce the non negligible computational burden of the approach. Finally, exploiting more complex feature subset distributions, beyond independent Bernoulli distributions, would be also very interesting but adapting the optimization strategy might not be trivial.
References
 [1] N. S. Altman. An introduction to kernel and nearestneighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992.
 [2] R. Armañanzas, I. Inza, R. Santana, Y. Saeys, J. Flores, J. Lozano, Y. Van de Peer, R. Blanco, V. Robles, C. Bielza, and P. Larrañaga. A review of estimation of distribution algorithms in bioinformatics. BioData Mining, 1(1):6, 2008.

[3]
B. E. Boser, I. Guyon, and V. N. Vapnik.
A training algorithm for optimal margin classifiers.
InProceedings of the 5th Annual ACM Workshop on Computational Learning Theory
, pages 144–152. ACM Press, 1992.  [4] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
 [5] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
 [6] L. Breiman, J. H. Friedman, R. A. Olsen, and C. J. Stone. Classification and Regression Trees. Wadsworth International (California), 1984.
 [7] J. Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7(1):1–30, 2006.
 [8] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, New York, 2001.
 [9] J. H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232, 2001.
 [10] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 36(1):3–42, 2006.
 [11] T. K. Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844, 1998.
 [12] V. A. HuynhThu, A. Irrthum, L. Wehenkel, and P. Geurts. Inferring regulatory networks from expression data using treebased methods. PLoS ONE, 5(9):e12776, 2010.
 [13] R. Jenatton, J.Y. Audibert, and F. Bach. Structured variable selection with sparsityinducing norms. Journal of Machine Learning Research, 12:2777–2824, 2011.
 [14] D. Marbach, J. C. Costello, R. Küffner, N. Vega, R. J. Prill, D. M. Camacho, K. R. Allison, the DREAM5 Consortium, M. Kellis, J. J. Collins, and G. Stolovitzky. Wisdom of crowds for robust gene network inference. Nature Methods, 9(8):796–804, 2012.
 [15] D. Marbach, R. J. Prill, T. Schaffter, C. Mattiussi, D. Floreano, and G. Stolovitzky. Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the National Academy of Sciences, 107(14):6286–6291, 2010.
 [16] R. S. Olson, W. La Cava, P. Orzechowski, R. J. Urbanowicz, and J. H. Moore. PMLB: a large benchmark suite for machine learning evaluation and comparison. BioData Mining, 10(1):36, 2017.
 [17] P. Panov and S. Džeroski. Combining bagging and random subspaces to create better ensembles. In M. R. Berthold, J. ShaweTaylor, and N. Lavrač, editors, Advances in Intelligent Data Analysis VII, pages 118–129, Berlin, Heidelberg, 2007. Springer.

[18]
J. Staines and D. Barber.
Optimization by variational bounding.
In
Proceedings of the 2013 European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2013)
, pages 473–478, 2013.  [19] E. Veach and L. J. Guibas. Optimally combining sampling techniques for monte carlo rendering. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques (SIGGRAPH 95), pages 419–428. AddisonWesley, 1995.