The standard procedures for model selection in prediction problems is cross-validation and data splitting. However, such an approach is known to be sub-optimal [Džeroski and Ženko, 2004, Dietterich, 2000, Sill et al., 2009]. The reason is that one might achieve more accurate predictions by combining different regression estimators rather then by selecting the best one. Stacking methods [Zhou, 2012] are a way of overcoming such a drawback from standard model selection.
A well known stacking method was introduced by Breiman . This approach consists in taking a linear combination of base regression estimators. That is, the stacked regression has the shape , whereare weights that are estimated from data and represents the features.
Even though this linear stacking method leads to combined estimators that are easy to interpret, it may be sub-optimal in cases where models have different local accuracy, i.e., situations where the performance of these estimators vary over the feature space. Example 1.1 illustrates this situation.
Consider predicting based on a single feature, , using the data in Figure 1. We fit two least squares estimators: and . None of the models is uniformly better; for example, the linear fit has better performance when , but the quadratic fit yields better performance for . One may take this into account when creating the stacked estimator by assigning different weights for each regression according to : while one can assign a larger weight to the linear fit on the regime , a lower weight should be assigned to it if .
It is well known that different regression methods may perform better on different regions of the feature space. For instance, because local estimators do not suffer from boundary effects, they achieve good performance closer to the edges of the feature space [Fan and Gijbels, 1992]
. Random forests, on the other hand, implicitly perform feature selection, and thus may have better performance in regions where some features are not relevant[Breiman, 2001].
In this work we improve Breiman’s approach so that it can take local accuracy into account. That is, we develop a meta-learner that is able to learn which models have higher importance on each region of the feature space. We achieve this goal by allowing each parameter to vary as a function of the features . In this way, the meta-learner can adapt to each region of the feature space, which yields higher predictive power. Our approach keeps the local interpretability of the linear stacking model.
The remaining of the work is organized as follows. Section 2 introduces the notation used in the paper, as well as our method. Section 3 shows details on its implementation. Section 4 shows applications of our method to a variety of datasets to evaluate its performance. Section 5 concludes the paper.
2 Notation and Motivation
The stacking method proposed by Breiman  is a linear combination of regression functions for a label . More precisely, let be a vector of regression estimators, that is, is an estimate of . The linear stacked regression is defined as
where are meta-parameters. One way to estimate the meta-parameters using data is through the least squares method, computed using a leave-one-out setup:
where is the prediction for made by the -th regression fitted without the -th instance. Note that it is important to use this hold-out approach because if the base regression functions are constructed using the same data as , this can cause to over-fit the training data.
In order for the stacked estimator to be easier to interpret, Breiman  also requires ’s to be weights, that is and that .
Even though Breiman’s solution works on a variety of settings, it does not take into account that each regression method may perform better in distinct regions of the feature space. In order to overcome this limitation, we propose the Neural Network Stacking (NNS) which generalizes Breiman’s approach by allowing on Equation 2.1 to vary with . That is, our meta-learner has the shape
where . In other words, the NNS is a local linear meta-learner. Example 2.1 shows that NNS can substantially decrease the prediction error of Breiman’s approach.
We fit both Breiman’s linear meta-learner and our NNS local linear meta-learner to the models fitted in Example 1.1. Figure 2 shows that Breiman’s meta-learner is not able to fit the true regression satisfactorily because both estimators have poor performance on specific regions of the data. On the other hand, feature-varying weights yield a better fit.
Our goal is to find , , that minimizes the mean squared risk,
where is defined as in Equation in 2.3.
We estimate via an artificial neural network. This network takes as input and produces an output , which is then used to obtain
. To estimate the weights of the networks, we introduce an appropriate loss function that captures the goal of having a small. This is done by using the loss function
Notice that the base regression estimators are used only when evaluating the loss function; they are not the inputs of the network. With this approach, we allow each to be a complex function of the data. We call this method Unconstrained Neural Network Stacking (UNNS). Figure 3 illustrates a UNNS that stacks 2 base estimators in a regression problem with four features.
In addition to the linear stacking, this approach allows the user to easily take advantage of the neural network architecture by directly adding a network output node, , to the stacking. That is, we also consider a variation of UNNS which takes the shape
This has some similarity to adding a single neural network estimator to the stacking. However, we use the same architecture to create the additional term, mitigating computation time. Algorithm 1 shows how this method is implemented. In order to avoid over-fitting, ’s and ’s are estimated using different folds of the training set.
In order to achieve an interpretable stacked solution, we follow Breiman’s suggestion and consider a second approach to estimate ’s which consists in minimizing under the constrain that ’s are weights, that is, and . Unfortunately, it is challenging to directly impose this restriction to the solution of the neural network. Instead, we use a different parametrization of the problem, which is motivated by Theorem 3.1.
The solution of
under the constrain that and is given by
where is a -dimensional vector of ones and
Theorem 3.1 shows that, under the given constrains, is uniquely defined by .
Now, because is a covariance matrix,
then is positive definite, and thus
can be applied to it. It follows that , where is a lower triangular matrix.
This suggests that we estimate by first estimating and then plugging the estimate back into Equation 3.1. That is, in order to obtain a good estimator under the above mentioned restrictions, the output of the network is set to be rather than the weights themselves111 Since the gradients for all matrix operations are implemented for Pytorch tensor classes, the additional operations of the CNNS method will be automatically backpropagated once Pytorch’s backward method is called on the loss evaluation.
Since the gradients for all matrix operations are implemented for Pytorch tensor classes, the additional operations of the CNNS method will be automatically backpropagated once Pytorch’s backward method is called on the loss evaluation.. We name this method Constrained Neural Network Stacking (CNNS). Figure 4 illustrates a CNNS that stacks 2 base regressors (that is, is a triangular matrix) in a 4 feature regression problem.
Algorithm 2 shows the implementation of this method. As with UNNS, we also explore a variation which adds an extra network output to .
illustrates the full training process. For simplicity, the neural network early stopping patience criterion is set to a single epoch and the additional parameteris not used.
3.1 Comparison with standard stacking methods
Most stacking methods create a meta-regression model by applying a regression method directly on the outputs of individual predictions. In particular, a meta-regression method can be a neural network. Such procedure differs from NN-Stacking by the shape of both the input and of the output of the network. While standard stacking uses base regression estimates () as input and as output, NN-Stacking uses the features as input and either the weights (for UNSS) or (for CNSS) as outputs. The base regression estimates are used only on the loss function. Thus, the NN-Stacking method leads to more interpretable models. Section 4 compares these methods in terms of their predictive power. We also point out that our approach has some similarity to Sill et al. , which allows each to depend on meta-features computed from using a specific parametric form. Neural networks, on the other hand, provide a richer family of functions to model such dependencies (in fact, they are universal approximators; Csáji ).
3.2 Selecting base regressors
Consider the extreme case where for some , that is, the case in which two base regressors generate the same prediction over all feature space. Now, suppose that one fits a NNS (either CNNS or UNNS) for this case. Then Thus, one of the regressions can be dropped from the stacking with no loss in predictive power.
In practice, our experiments (Section 4) show that regression estimators that have strongly correlated results do not contribute to the meta-learner. This suggests that one should choose base regressors with considerably distinct nature.
3.3 Implementation details
A Python package that implements the methods proposed in this paper is available at github.com/randommm/nnstacking. The scripts for the experiments in Section 4 are availiable at github.com/vcoscrato/NNStacking. We work with the following specifications for the artificial neural networks:
Optimizer: we use the Adam algorithm [Kingma and Ba, 2014] and decrease its learning rate after the validation loss stops improving for a user-defined number of epochs.
Initialization: we use the Xavier Gaussian method proposed by Glorot and Bengio  to sample the initial parameters of the neural network.
Stopping criterion: in order to address the risk of having strong over-fit on the neural networks, we worked with a 90%/10% split early stopping for small datasets and a higher split factor for larger datasets (increasing the proportion of training instances) and a patience of 10 epochs without improvement on the validation set.
Dropout: We use dropout (with a rate of 50%) to address the problem of over-fitting [Hinton et al., 2012].
Software: we use PyTorch [Paszke et al., 2017].
Architecture: as default values we use a 3 layer depth network with hidden layer size set to 100; these values have been experimentally found to be suitable in our experiments (Section 4).
We compare stacking methods for the following UCI datasets:
The GPU kernel performance dataset (241600 instances, 13 features) [Nugteren and Codreanu, 2015],
The music year prediction dataset [Dheeru and Karra Taniskidou, 2017] (515345 instances and 90 features),
The blog feedback dataset [Buza, 2014] (60021 instances, 280 features),
The superconductivity dataset [Hamidieh, 2018] (21263 instances, 80 features).
First, we fit the following regression estimators (that will be stacked):
Three linear models: with L1, L2, and no penalization [Friedman et al., 2001],
Two tree based models: bagging and random forests [Friedman et al., 2001],
The tuning parameters of these estimators are chosen by cross-validation using scikit-learn [Pedregosa et al., 2011].
Using these base estimators, we then fit four variations of NNS (both CNNS and UNNS with and without the additional ) using the following specifications:
Tuning: three different architectures were tested for each neural network approach. The layer size was fixed at 100 and the number of hidden layers were set to 1, 3, and 10. We choose the architecture with the lowest validation mean-squared error.
Train/validation/test split: for all datasets, we use 75% of the instances to fit the models, among which 10% are used for performing early stop. The remaining 25% of the instances are used as a test set to compare the performance of the various models. The train/test split is performed at random. The cross-validated predictions (the matrix denoted on Algorithm 1) are obtained using a 10-fold cross-validation on the training data (i.e., ).
Total fitting time: we compute the total fitting time (in seconds; including the time for cross-validating the network architecture) of each method on two cores of an AMD Ryzen 7 1800X processor running at 3.6Gz.
We compare our methods with Breiman’s linear stacking and the usual neural net stacking model described in Section 3.1. In addition to these, we also include a comparison with a direct neural network that has as its input and as its output.
The comparisons are made by evaluating the mean squared error (MSE, ) and the mean absolute error (MAE, ) of each model
4.1 GPU kernel performance dataset
Table 1 shows the results that were obtained for the GPU kernel performance dataset. Our UNNS methods outperforms both Breiman’s stacking and the usual meta-regression stacking approaches in terms of MSE. Moreover, the UNNS model is also the best one in terms of MAE, even though the gap between the models is lower in this case. Our stacking methods also perform better than all base estimators. This suggests that each base model performs better on a distinct region of the feature space.
Figure 6 shows a boxplot with the distribution of the fitted ’s for UNNS. Many fitted values fall out of the range , which explains why UNNS gives better results than Breiman’s and CNNS (which have the restriction that ’s must be proper weights).
shows the correlation between the prediction errors for base estimators. The linear estimators had an almost perfect pairwise correlation, which indicates that removing up to 2 of them from the stacking would not affect predictions. Indeed, after refitting UNNS without using ridge regression and lasso, we obtain exactly the same results. We also refit the best UNNS removing all of the linear estimators to check if poor performing estimators are making stacking results worse. In this setting, we obtain an MSE of, and a MAE of . Note that although the point estimates of the errors are lower than those obtained in Table 1, the confidence intervals have an intersection, which leads to the conclusion that the poor performance of linear estimators is not damaging the stacked estimator.
4.2 Music year dataset
Table 3 shows the accuracy metrics results for the music year dataset. In this case, the CNNS gave the best results, both in terms of MSE and MAE. For this dataset, Breiman’s stacking was worse than using gradient boosting, one of the base regressors. The same happens with the usual meta-regression neural network approach. On the other hand, NNS could find a representation that combines the already powerful GBR estimator with less powerful ones in a way that leverages their individual performance.
All base estimators had high prediction error correlations (Table 4). In particular, two of the linear estimators could be removed from the stacking without affecting its performance. However, when removing all three linear estimators the MSE for the best NNS increased to and its MAE increased to .
Figure 7 shows that the fitted NNS weights have a large dispersion. This illustrates the flexibility added by our method. Models with very distinctive nature (e.g., ridge regression - which imposes a linear shape on the regression function, and random forests - which is fully non-parametric) can add to each other, getting weights of different magnitudes depending on the region of the feature space that the new instance lies on.
4.3 Blog feedback dataset
Table 5 shows the results for the blog feedback dataset. All stacked estimators had similar performance in terms of MSE. However, UNNS had slightly worse performance with respect to MAE. This may happen because the NNS is designed to minimize the MSE and not the MAE. Overall, for this small dataset, the NNS shows no improvement over Breiman’s stacking or the usual meta-regression neural network.
GBR had the lowest MSE for the base estimators, while bagging and random forests had the lowest MAE. This explains why these models have larger fitted weights (Figure 8). Moreover, the linear models prediction errors had an almost perfect error correlation (Table 6). This suggests that removing up to 2 of them from the NNS would not impact its performance. Also, the linear estimators has a poor performance when compared to the other base regressors. We thus refit the best NNS for this data after removing these estimators, and achieve an MSE of 531.88 () and a MAE of 5.31 (). We conclude that the linear estimators did not damage nor improved the NNS.
4.4 Superconductivity dataset
The results for the superconductivity dataset (Table 7) were similar to those obtained for the blog feedback data: the NNS methods perform slightly better than Breiman’s in terms of MSE, and worse in terms of MAE. Moreover, both tree-based models had the best MSE among base estimators, competing with the GBR in terms of MAE. Hence, they got larger fitted weights (Figure 9).
Table 8 shows that GBR did not have a high correlation error to the tree-based estimators (0.72 in both cases). This is another reason why although having higher MSE, the GBR has high fitted weight for some instances. One can also note that bagging and random forest had an almost perfect error correlation. This implies that removing one of them would lead to no changes in the NNS. Finally, removing the linear models did not change the MSE and the MAE for the stacking methods.
5 Conclusion and future extensions
NN-Stacking is a stacking tool with good predictive power that keeps the simplicity in interpretation of Breiman’s method. The key idea of the method is to take advantage of the fact that distinct base models often perform better at different regions of the feature space, and thus it allows the weight associated to each model to vary with .
Our experiments show that both CNNS and UNNS can be suitable in different settings: in cases where the base estimators do not capture the complexity from the whole data, the freedom adopted by UNNS can lead to a larger improvement in performance. On the other hand, when base estimators already have high performance, UNNS the CNNS have similar predictive power, but the restrictions imposed by CNNS guarantee a more interpretable solution. Both CNNS and UNNS have comparable computational cost.
In our experiments, we observe that NNS improves over standard stacking approaches especially on large datasets. This can be explained by the fact that NNS methods have a higher complexity (i.e., larger number of parameters) than the other approaches. Thus, a larger sample size is needed to satisfactorily estimate them. The experiments also show that including weak regression methods (such as linear methods) might decrease the errors of NNS. In a few cases, however, adding such weak regressors slightly increases the prediction errors of the stacked estimators This suggests that adding a penalization to the loss function that encourages ’s to be zero may lead to improved results.
Future work includes extending these ideas to classification problems, as well as developing a leave-one-out version based on super learners [Van der Laan et al., 2007]
. Also, we desire to develop a method of regularization on population moments estimation to avoid over-fitting, as well as to study asymptotic properties for the estimator of.
Victor Coscrato and Marco Inácio are grateful for the financial support of CAPES: this study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. Rafael Izbicki is grateful for the financial support of FAPESP (grant 2017/03363-8) and CNPq (grant 306943/2017-4). The authors are also grateful for the suggestions given by Rafael Bassi Stern.
- Breiman  Leo Breiman. Stacked regressions. Machine learning, 24(1):49–64, 1996.
Džeroski and Ženko 
Saso Džeroski and Bernard Ženko.
Is combining classifiers with stacking better than selecting the best one?Machine learning, 54(3):255–273, 2004.
- Dietterich  Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.
- Sill et al.  Joseph Sill, Gábor Takács, Lester Mackey, and David Lin. Feature-weighted linear stacking. arXiv preprint arXiv:0911.0460, 2009.
- Zhou  Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.
- Fan and Gijbels  Jianqing Fan and Irene Gijbels. Variable bandwidth and local linear regression smoothers. The Annals of Statistics, pages 2008–2036, 1992.
- Breiman  Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
- Csáji  Balázs Csanád Csáji. Approximation with artificial neural networks. Faculty of Sciences, Etvs Lornd University, Hungary, 24:48, 2001.
- Kingma and Ba  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Glorot and Bengio  Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. Journal of Machine Learning Research - Proceedings Track, 9:249–256, 01 2010.
- Clevert et al.  Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus), 2015.
- Ioffe and Szegedy  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07 2015. PMLR. URL http://proceedings.mlr.press/v37/ioffe15.html.
- Hinton et al.  Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, 2012.
- Paszke et al.  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch, 2017.
- Nugteren and Codreanu  Cedric Nugteren and Valeriu Codreanu. Cltune: A generic auto-tuner for opencl kernels. In 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, pages 195–202. IEEE, 2015.
- Dheeru and Karra Taniskidou  Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
- Buza  Krisztian Buza. Feedback prediction for blogs. In Data analysis, machine learning and knowledge discovery, pages 145–152. Springer, 2014.
- Hamidieh  Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a superconductor. arXiv preprint arXiv:1803.10260, 2018.
- Friedman et al.  Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, NY, USA:, 2001.
- Meir and Rätsch  Ron Meir and Gunnar Rätsch. An introduction to boosting and leveraging. In Advanced lectures on machine learning, pages 118–183. Springer, 2003.
- Pedregosa et al.  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Van der Laan et al.  Mark J Van der Laan, Eric C Polley, and Alan E Hubbard. Super learner. Statistical applications in genetics and molecular biology, 6(1), 2007.
Appendix A Proofs
Theorem 3.1. Notice that
Hence, in order to minimize , it suffices to minimize for each . Now, once , it follows that,
where . Using Lagrange multipliers, the optimal weights can by found by minimizing
and therefore the optimal solution satisfies . Substituting this on Equation A.1 , obtain that
which yields the optimal solution