1 Introduction
The standard procedures for model selection in prediction problems is crossvalidation and data splitting. However, such an approach is known to be suboptimal [Džeroski and Ženko, 2004, Dietterich, 2000, Sill et al., 2009]. The reason is that one might achieve more accurate predictions by combining different regression estimators rather then by selecting the best one. Stacking methods [Zhou, 2012] are a way of overcoming such a drawback from standard model selection.
A well known stacking method was introduced by Breiman [1996]. This approach consists in taking a linear combination of base regression estimators. That is, the stacked regression has the shape , where
’s are the individual regression estimators (such as random forests, linear regression or support vector regression),
are weights that are estimated from data and represents the features.Even though this linear stacking method leads to combined estimators that are easy to interpret, it may be suboptimal in cases where models have different local accuracy, i.e., situations where the performance of these estimators vary over the feature space. Example 1.1 illustrates this situation.
Example 1.1.
Consider predicting based on a single feature, , using the data in Figure 1. We fit two least squares estimators: and . None of the models is uniformly better; for example, the linear fit has better performance when , but the quadratic fit yields better performance for . One may take this into account when creating the stacked estimator by assigning different weights for each regression according to : while one can assign a larger weight to the linear fit on the regime , a lower weight should be assigned to it if .
It is well known that different regression methods may perform better on different regions of the feature space. For instance, because local estimators do not suffer from boundary effects, they achieve good performance closer to the edges of the feature space [Fan and Gijbels, 1992]
. Random forests, on the other hand, implicitly perform feature selection, and thus may have better performance in regions where some features are not relevant
[Breiman, 2001].In this work we improve Breiman’s approach so that it can take local accuracy into account. That is, we develop a metalearner that is able to learn which models have higher importance on each region of the feature space. We achieve this goal by allowing each parameter to vary as a function of the features . In this way, the metalearner can adapt to each region of the feature space, which yields higher predictive power. Our approach keeps the local interpretability of the linear stacking model.
The remaining of the work is organized as follows. Section 2 introduces the notation used in the paper, as well as our method. Section 3 shows details on its implementation. Section 4 shows applications of our method to a variety of datasets to evaluate its performance. Section 5 concludes the paper.
2 Notation and Motivation
The stacking method proposed by Breiman [1996] is a linear combination of regression functions for a label . More precisely, let be a vector of regression estimators, that is, is an estimate of . The linear stacked regression is defined as
(2.1) 
where are metaparameters. One way to estimate the metaparameters using data is through the least squares method, computed using a leaveoneout setup:
(2.2) 
where is the prediction for made by the th regression fitted without the th instance. Note that it is important to use this holdout approach because if the base regression functions are constructed using the same data as , this can cause to overfit the training data.
In order for the stacked estimator to be easier to interpret, Breiman [1996] also requires ’s to be weights, that is and that .
Even though Breiman’s solution works on a variety of settings, it does not take into account that each regression method may perform better in distinct regions of the feature space. In order to overcome this limitation, we propose the Neural Network Stacking (NNS) which generalizes Breiman’s approach by allowing on Equation 2.1 to vary with . That is, our metalearner has the shape
(2.3) 
where . In other words, the NNS is a local linear metalearner. Example 2.1 shows that NNS can substantially decrease the prediction error of Breiman’s approach.
Example 2.1.
We fit both Breiman’s linear metalearner and our NNS local linear metalearner to the models fitted in Example 1.1. Figure 2 shows that Breiman’s metalearner is not able to fit the true regression satisfactorily because both estimators have poor performance on specific regions of the data. On the other hand, featurevarying weights yield a better fit.
3 Methodology
Our goal is to find , , that minimizes the mean squared risk,
where is defined as in Equation in 2.3.
We estimate via an artificial neural network. This network takes as input and produces an output , which is then used to obtain
. To estimate the weights of the networks, we introduce an appropriate loss function that captures the goal of having a small
. This is done by using the loss functionNotice that the base regression estimators are used only when evaluating the loss function; they are not the inputs of the network. With this approach, we allow each to be a complex function of the data. We call this method Unconstrained Neural Network Stacking (UNNS). Figure 3 illustrates a UNNS that stacks 2 base estimators in a regression problem with four features.
In addition to the linear stacking, this approach allows the user to easily take advantage of the neural network architecture by directly adding a network output node, , to the stacking. That is, we also consider a variation of UNNS which takes the shape
This has some similarity to adding a single neural network estimator to the stacking. However, we use the same architecture to create the additional term, mitigating computation time. Algorithm 1 shows how this method is implemented. In order to avoid overfitting, ’s and ’s are estimated using different folds of the training set.
In order to achieve an interpretable stacked solution, we follow Breiman’s suggestion and consider a second approach to estimate ’s which consists in minimizing under the constrain that ’s are weights, that is, and . Unfortunately, it is challenging to directly impose this restriction to the solution of the neural network. Instead, we use a different parametrization of the problem, which is motivated by Theorem 3.1.
Theorem 3.1.
The solution of
under the constrain that and is given by
(3.1) 
where is a dimensional vector of ones and
with .
Theorem 3.1 shows that, under the given constrains, is uniquely defined by . Now, because is a covariance matrix, then is positive definite, and thus Cholesky decomposition can be applied to it. It follows that , where is a lower triangular matrix. This suggests that we estimate by first estimating and then plugging the estimate back into Equation 3.1. That is, in order to obtain a good estimator under the above mentioned restrictions, the output of the network is set to be rather than the weights themselves^{1}^{1}1
Since the gradients for all matrix operations are implemented for Pytorch tensor classes, the additional operations of the CNNS method will be automatically backpropagated once Pytorch’s backward method is called on the loss evaluation.
. We name this method Constrained Neural Network Stacking (CNNS). Figure 4 illustrates a CNNS that stacks 2 base regressors (that is, is a triangular matrix) in a 4 feature regression problem.Algorithm 2 shows the implementation of this method. As with UNNS, we also explore a variation which adds an extra network output to .
Figure 5
illustrates the full training process. For simplicity, the neural network early stopping patience criterion is set to a single epoch and the additional parameter
is not used.3.1 Comparison with standard stacking methods
Most stacking methods create a metaregression model by applying a regression method directly on the outputs of individual predictions. In particular, a metaregression method can be a neural network. Such procedure differs from NNStacking by the shape of both the input and of the output of the network. While standard stacking uses base regression estimates () as input and as output, NNStacking uses the features as input and either the weights (for UNSS) or (for CNSS) as outputs. The base regression estimates are used only on the loss function. Thus, the NNStacking method leads to more interpretable models. Section 4 compares these methods in terms of their predictive power. We also point out that our approach has some similarity to Sill et al. [2009], which allows each to depend on metafeatures computed from using a specific parametric form. Neural networks, on the other hand, provide a richer family of functions to model such dependencies (in fact, they are universal approximators; Csáji [2001]).
3.2 Selecting base regressors
Consider the extreme case where for some , that is, the case in which two base regressors generate the same prediction over all feature space. Now, suppose that one fits a NNS (either CNNS or UNNS) for this case. Then Thus, one of the regressions can be dropped from the stacking with no loss in predictive power.
In practice, our experiments (Section 4) show that regression estimators that have strongly correlated results do not contribute to the metalearner. This suggests that one should choose base regressors with considerably distinct nature.
3.3 Implementation details
A Python package that implements the methods proposed in this paper is available at github.com/randommm/nnstacking. The scripts for the experiments in Section 4 are availiable at github.com/vcoscrato/NNStacking. We work with the following specifications for the artificial neural networks:

Optimizer: we use the Adam algorithm [Kingma and Ba, 2014] and decrease its learning rate after the validation loss stops improving for a userdefined number of epochs.

Initialization: we use the Xavier Gaussian method proposed by Glorot and Bengio [2010] to sample the initial parameters of the neural network.

Layer activation and regularization: we use ELU [Clevert et al., 2015]
as the activation function, and do not use regularization.

Normalization
: we use batch normalization
[Ioffe and Szegedy, 2015] to speedup the training process. 
Stopping criterion: in order to address the risk of having strong overfit on the neural networks, we worked with a 90%/10% split early stopping for small datasets and a higher split factor for larger datasets (increasing the proportion of training instances) and a patience of 10 epochs without improvement on the validation set.

Dropout: We use dropout (with a rate of 50%) to address the problem of overfitting [Hinton et al., 2012].

Software: we use PyTorch [Paszke et al., 2017].

Architecture: as default values we use a 3 layer depth network with hidden layer size set to 100; these values have been experimentally found to be suitable in our experiments (Section 4).
4 Experiments
We compare stacking methods for the following UCI datasets:

The GPU kernel performance dataset (241600 instances, 13 features) [Nugteren and Codreanu, 2015],

The music year prediction dataset [Dheeru and Karra Taniskidou, 2017] (515345 instances and 90 features),

The blog feedback dataset [Buza, 2014] (60021 instances, 280 features),

The superconductivity dataset [Hamidieh, 2018] (21263 instances, 80 features).
First, we fit the following regression estimators (that will be stacked):

Three linear models: with L1, L2, and no penalization [Friedman et al., 2001],

Two tree based models: bagging and random forests [Friedman et al., 2001],

A gradient boosting method (GBR)
[Meir and Rätsch, 2003].
The tuning parameters of these estimators are chosen by crossvalidation using scikitlearn [Pedregosa et al., 2011].
Using these base estimators, we then fit four variations of NNS (both CNNS and UNNS with and without the additional ) using the following specifications:

Tuning: three different architectures were tested for each neural network approach. The layer size was fixed at 100 and the number of hidden layers were set to 1, 3, and 10. We choose the architecture with the lowest validation meansquared error.

Train/validation/test split: for all datasets, we use 75% of the instances to fit the models, among which 10% are used for performing early stop. The remaining 25% of the instances are used as a test set to compare the performance of the various models. The train/test split is performed at random. The crossvalidated predictions (the matrix denoted on Algorithm 1) are obtained using a 10fold crossvalidation on the training data (i.e., ).

Total fitting time: we compute the total fitting time (in seconds; including the time for crossvalidating the network architecture) of each method on two cores of an AMD Ryzen 7 1800X processor running at 3.6Gz.
We compare our methods with Breiman’s linear stacking and the usual neural net stacking model described in Section 3.1. In addition to these, we also include a comparison with a direct neural network that has as its input and as its output.
The comparisons are made by evaluating the mean squared error (MSE, ) and the mean absolute error (MAE, ) of each model
on a test set. We also compute the standard error for each of these metrics, which enables one to compute confidence intervals for the errors of each method.
4.1 GPU kernel performance dataset
Table 1 shows the results that were obtained for the GPU kernel performance dataset. Our UNNS methods outperforms both Breiman’s stacking and the usual metaregression stacking approaches in terms of MSE. Moreover, the UNNS model is also the best one in terms of MAE, even though the gap between the models is lower in this case. Our stacking methods also perform better than all base estimators. This suggests that each base model performs better on a distinct region of the feature space.
Figure 6 shows a boxplot with the distribution of the fitted ’s for UNNS. Many fitted values fall out of the range , which explains why UNNS gives better results than Breiman’s and CNNS (which have the restriction that ’s must be proper weights).
Table 2
shows the correlation between the prediction errors for base estimators. The linear estimators had an almost perfect pairwise correlation, which indicates that removing up to 2 of them from the stacking would not affect predictions. Indeed, after refitting UNNS without using ridge regression and lasso, we obtain exactly the same results. We also refit the best UNNS removing all of the linear estimators to check if poor performing estimators are making stacking results worse. In this setting, we obtain an MSE of
, and a MAE of . Note that although the point estimates of the errors are lower than those obtained in Table 1, the confidence intervals have an intersection, which leads to the conclusion that the poor performance of linear estimators is not damaging the stacked estimator.4.2 Music year dataset
Table 3 shows the accuracy metrics results for the music year dataset. In this case, the CNNS gave the best results, both in terms of MSE and MAE. For this dataset, Breiman’s stacking was worse than using gradient boosting, one of the base regressors. The same happens with the usual metaregression neural network approach. On the other hand, NNS could find a representation that combines the already powerful GBR estimator with less powerful ones in a way that leverages their individual performance.
All base estimators had high prediction error correlations (Table 4). In particular, two of the linear estimators could be removed from the stacking without affecting its performance. However, when removing all three linear estimators the MSE for the best NNS increased to and its MAE increased to .
Figure 7 shows that the fitted NNS weights have a large dispersion. This illustrates the flexibility added by our method. Models with very distinctive nature (e.g., ridge regression  which imposes a linear shape on the regression function, and random forests  which is fully nonparametric) can add to each other, getting weights of different magnitudes depending on the region of the feature space that the new instance lies on.
4.3 Blog feedback dataset
Table 5 shows the results for the blog feedback dataset. All stacked estimators had similar performance in terms of MSE. However, UNNS had slightly worse performance with respect to MAE. This may happen because the NNS is designed to minimize the MSE and not the MAE. Overall, for this small dataset, the NNS shows no improvement over Breiman’s stacking or the usual metaregression neural network.
GBR had the lowest MSE for the base estimators, while bagging and random forests had the lowest MAE. This explains why these models have larger fitted weights (Figure 8). Moreover, the linear models prediction errors had an almost perfect error correlation (Table 6). This suggests that removing up to 2 of them from the NNS would not impact its performance. Also, the linear estimators has a poor performance when compared to the other base regressors. We thus refit the best NNS for this data after removing these estimators, and achieve an MSE of 531.88 () and a MAE of 5.31 (). We conclude that the linear estimators did not damage nor improved the NNS.
4.4 Superconductivity dataset
The results for the superconductivity dataset (Table 7) were similar to those obtained for the blog feedback data: the NNS methods perform slightly better than Breiman’s in terms of MSE, and worse in terms of MAE. Moreover, both treebased models had the best MSE among base estimators, competing with the GBR in terms of MAE. Hence, they got larger fitted weights (Figure 9).
Table 8 shows that GBR did not have a high correlation error to the treebased estimators (0.72 in both cases). This is another reason why although having higher MSE, the GBR has high fitted weight for some instances. One can also note that bagging and random forest had an almost perfect error correlation. This implies that removing one of them would lead to no changes in the NNS. Finally, removing the linear models did not change the MSE and the MAE for the stacking methods.
5 Conclusion and future extensions
NNStacking is a stacking tool with good predictive power that keeps the simplicity in interpretation of Breiman’s method. The key idea of the method is to take advantage of the fact that distinct base models often perform better at different regions of the feature space, and thus it allows the weight associated to each model to vary with .
Our experiments show that both CNNS and UNNS can be suitable in different settings: in cases where the base estimators do not capture the complexity from the whole data, the freedom adopted by UNNS can lead to a larger improvement in performance. On the other hand, when base estimators already have high performance, UNNS the CNNS have similar predictive power, but the restrictions imposed by CNNS guarantee a more interpretable solution. Both CNNS and UNNS have comparable computational cost.
In our experiments, we observe that NNS improves over standard stacking approaches especially on large datasets. This can be explained by the fact that NNS methods have a higher complexity (i.e., larger number of parameters) than the other approaches. Thus, a larger sample size is needed to satisfactorily estimate them. The experiments also show that including weak regression methods (such as linear methods) might decrease the errors of NNS. In a few cases, however, adding such weak regressors slightly increases the prediction errors of the stacked estimators This suggests that adding a penalization to the loss function that encourages ’s to be zero may lead to improved results.
Future work includes extending these ideas to classification problems, as well as developing a leaveoneout version based on super learners [Van der Laan et al., 2007]
. Also, we desire to develop a method of regularization on population moments estimation to avoid overfitting, as well as to study asymptotic properties for the estimator of
.Acknowledgments
Victor Coscrato and Marco Inácio are grateful for the financial support of CAPES: this study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior  Brasil (CAPES)  Finance Code 001. Rafael Izbicki is grateful for the financial support of FAPESP (grant 2017/033638) and CNPq (grant 306943/20174). The authors are also grateful for the suggestions given by Rafael Bassi Stern.
References
References
 Breiman [1996] Leo Breiman. Stacked regressions. Machine learning, 24(1):49–64, 1996.

Džeroski and Ženko [2004]
Saso Džeroski and Bernard Ženko.
Is combining classifiers with stacking better than selecting the best one?
Machine learning, 54(3):255–273, 2004.  Dietterich [2000] Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pages 1–15. Springer, 2000.
 Sill et al. [2009] Joseph Sill, Gábor Takács, Lester Mackey, and David Lin. Featureweighted linear stacking. arXiv preprint arXiv:0911.0460, 2009.
 Zhou [2012] ZhiHua Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.
 Fan and Gijbels [1992] Jianqing Fan and Irene Gijbels. Variable bandwidth and local linear regression smoothers. The Annals of Statistics, pages 2008–2036, 1992.
 Breiman [2001] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
 Csáji [2001] Balázs Csanád Csáji. Approximation with artificial neural networks. Faculty of Sciences, Etvs Lornd University, Hungary, 24:48, 2001.
 Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 Glorot and Bengio [2010] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. Journal of Machine Learning Research  Proceedings Track, 9:249–256, 01 2010.
 Clevert et al. [2015] DjorkArné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus), 2015.
 Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07 2015. PMLR. URL http://proceedings.mlr.press/v37/ioffe15.html.
 Hinton et al. [2012] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. CoRR, 2012.
 Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch, 2017.
 Nugteren and Codreanu [2015] Cedric Nugteren and Valeriu Codreanu. Cltune: A generic autotuner for opencl kernels. In 2015 IEEE 9th International Symposium on Embedded Multicore/Manycore SystemsonChip, pages 195–202. IEEE, 2015.
 Dheeru and Karra Taniskidou [2017] Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
 Buza [2014] Krisztian Buza. Feedback prediction for blogs. In Data analysis, machine learning and knowledge discovery, pages 145–152. Springer, 2014.
 Hamidieh [2018] Kam Hamidieh. A datadriven statistical model for predicting the critical temperature of a superconductor. arXiv preprint arXiv:1803.10260, 2018.
 Friedman et al. [2001] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, NY, USA:, 2001.
 Meir and Rätsch [2003] Ron Meir and Gunnar Rätsch. An introduction to boosting and leveraging. In Advanced lectures on machine learning, pages 118–183. Springer, 2003.
 Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
 Van der Laan et al. [2007] Mark J Van der Laan, Eric C Polley, and Alan E Hubbard. Super learner. Statistical applications in genetics and molecular biology, 6(1), 2007.
Appendix A Proofs
Theorem 3.1. Notice that
Hence, in order to minimize , it suffices to minimize for each . Now, once , it follows that,
where . Using Lagrange multipliers, the optimal weights can by found by minimizing
(A.1) 
Now,
and therefore the optimal solution satisfies . Substituting this on Equation A.1 , obtain that
and hence
which yields the optimal solution