1 Introduction
Combination models have been proven to be theoretically and empirically superior to single models by stateoftheart literature (Brown et al., 2005; Chandra & Yao, 2006b). The explosive growth of data challenges traditional single models to fit the true complex distribution (Dietterich, 2000)
. It is difficult for a single model to cope with the huge amount of highdimensional data, thus it is inaccurate for future prediction, which is also proved by
Ding et al. (2018) that there is no perfect model for all data. Many single models are prone to overfitting when training data in order to reduce errors, a situation that can be avoided by combination models (Perrone, 1993). In addition, scientists often train data with several different models, implement complex tuning efforts in each model, and ultimately select the best model. However, these lengthy steps are indicative of the large amount of time consumed in selecting the most suitable model for a given dataset in the absence of prior knowledge (Liu et al., 2000). Combination models do not reduce the time required to train individual submodels, but unlike model selection, only the best model is selected and the remaining models, which may still contain important information, are discarded. Combination models fully consider the information of multiple submodels to produce more robust predictions, which can avoid the above problems to some extent. So far, the combination models have achieved remarkable results in several research areas, like meteorological forecast (e.g., Lange et al., 2006; Zhang & Hanby, 2007; Xiao et al., 2018), energy (Baumeister & Kilian, 2015), finance (Kim et al., 2006), agriculture (Wen & Guyer, 2012). In recent years, solutions based on combination models often achieve good results in Kaggle competitions (e.g., Taieb & Hyndman, 2014; Hoch, 2015; Bojer & Meldgaard, 2020), which shows that combination models are promising.After reviewing a large volume of advanced literature, we categorize the combination models into two classes: ensemble models and predictions combination. The ensemble models sample in the input space, such as feature filtering by calculating feature importance (MendesMoreira et al., 2012), sampling of training data using crossvalidation (LeBlanc & Tibshirani, 1996), and they generate a strong learner by combining multiple weak learners. Practitioners are increasingly studying ensemble models, and ensemble learning is becoming popular as a research field. The typical examples of ensembles models are Bagging (Breiman, 1996), Boosting (Freund et al., 1996), and Stacking (Wolpert, 1992).
Bagging, also known as Bootstrap Aggregation, is an intuitive ensemble algorithm proposed by Breiman (1996)
. Bagging samples different subsets of equal size from the entire training data, then uses these subsets to train homogeneous models simultaneously in parallel. The predictions of these models are combined by majority voting as the final output. Random forest
(Ho, 1995)is an application of Bagging that uses decision trees as submodels.
Similar to Bagging, Boosting repeatedly samples the data to create several weak learners, which are combined using majority voting. For classification, the difference is that Boosting produces three classifiers per iteration. The first classifier is trained on a randomly selected portion of the training data; the second classifier is trained on the first classifier, and only half of the data selected can be correctly classified by the first classifier; the third classifier uses training data that contradicts the predictions of the first two classifiers. The final threeway majority vote produces the prediction results for this iteration.
Stacking method was proposed by (Wolpert, 1992). It trains a hierarchical model to improve the prediction. The first layer of learners are trained by crossvalidation. The outputs are fed into the second layer learners, which identify misclassifications (for classification problems) or errors (for regression problems) of the first layer learners in addition to training existing data. The third layer learners combine the training results from the previous two layers to produce a more robust model.
While predictions combination is also involved in ensemble models, the concept of predictions combination is broader and its components may not all be weak learners. The candidate models of the predictions combination train heterogeneous models for the same input space. The outputs of all models are weighted and averaged to obtain more robust results. For example, neural networks tend to fall into local minima during training, which can be improved if the predictions of multiple neural networks are combined
(e.g., Perrone, 1993; Alhamdoosh & Wang, 2014).In some pioneering studies, predictions combination is referred to as hybrid ensemble, but hybrid ensemble also includes the content of ensemble models. For the sake of distinction, this paper will follow the term of predictions combination. For load prediction, Salgado et al. (2006)
used several support vector machines and neural networks as candidate models, ranked and filtered these models by mean absolute percentage error, and finally weighted the individual models by a singlelayer linear neural network. Their hybrid ensemble model improves performance by 25% over the best single predictor.
Verma & Hassan (2011)practiced a hybrid Self Organising Map and KMeans approach for medical classification and proposed two fusion strategies based on multilayer perceptron.
Ala’raj & Abbod (2016)took five common approaches (neural networks, support vector machines, random forests, decision trees, and naive Bayes) as base classifiers and combined the predictions of each model using consensus approach. The experimental results demonstrate the ability of the proposed method in improving the accuracy of credit scoring prediction.
Qi & Tang (2018)constructed a hybrid ensemble model for predicting slope stability in geology. Gaussian process classification, quadratic discriminant analysis, support vector machine, artificial neural networks, adaptive boosted decision trees, and knearest neighbours were utilized as submodels and genetic algorithm was used to calculate classification weights for each model. The hybrid ensemble model they designed was shown to outperform any single model, even though the single model already had its own optimal combination of parameters.
For predictions combination, Perrone (1993) demonstrates that weighted averaging performs better than the basic simple averaging and can improve accuracy and reduce model covariance by removing results from nearly similar models. Nevertheless, as the study of predictions combination has increased, some problems have arisen. Perrone (1993) also indicates that if a predictions combination contains too many submodels, it may bring about the opposite effect. LeBlanc & Tibshirani (1996) states that if the least square regression method is used to assign weights to each submodel, the bestperforming model may be assigned a weight of 1, while all the others have a weight of 0. In this case, there is little need to use the predictions combination.
Since then, scholars have identified model diversity as the key to predictions combination success. Brown (2004) states that it is important to examine the reasons for predictions combination success, especially the ability to automatically exploit the strengths and weaknesses of components within the combination. The diversity of components deserves to be explored in depth. Webb & Zheng (2004) demonstrates that increasing the diversity of members within a combination without increasing their testing error inevitably leads to a reduction in the prediction error of the combination. Chandra & Yao (2006b) emphasizes that diversity and accuracy are key to constructing predictions combination, and similar ideas are experimentally validated by (e.g., Alhamdoosh & Wang, 2014; Peng et al., 2020). Krawczyk & Wozniak (2014), on the other hand, suggests to combine both building a diverse pool of models and finding the optimal model combination. Brown et al. (2005)
provides quantitative methods for diversity of predictions combination using ambiguity decomposition and bias–variance–covariance decomposition, which will be described in Section
2.Several methods to increase the diversity of submodels are proposed. For ensemble models, practitioners often use crossvalidation to obtain submodels, or choose different combinations of parameters for homogeneous models, followed by majority voting or weighted averaging of the model predictions. Crossvalidation yet provides limited improvement in model effectiveness, and Stone (1974)
proved as early as 1974 that estimators generated by crossvalidation are similarly behaved. For classification,
Liu et al. (2000) trained evolutionary ensembles in which the kmeans method was applied to cluster the nodes of the neural network. Subsequently representative nodes within each cluster were filtered out and this step ensured the diversity of nodes. Chandra & Yao (2006a) developed the Pairwise Failure Crediting (PFC) method to measure model diversity by calculating the degree to which a model differs from other models. Sirovetnukul et al. (2011) pointed out that predictions combination can learn negative knowledge from less well performing models that are easily ignored and removed in previous studies, and this knowledge can help the models converge to better solutions while also producing diverse results.Many empirical evidences demonstrate Negative Correlation Learning (NCL) in increasing model diversity and improving the effectiveness of the combination models (e.g., Liu & Yao, 1999; Liu et al., 2000; Chandra & Yao, 2006b; Sirovetnukul et al., 2011; Alhamdoosh & Wang, 2014; Peng et al., 2020)
. NCL introduces a correlation penalty term in the error function of each model within the combination to measure the deviation of the model from the overall. All submodels can be trained simultaneously and interactively on the same training set, and the final experimental results achieve a biasvariancecovariance balance. The prosperity of NCL in the combination models has certainly proven it to be a successful solution. Current applications of NCL are focused on ensemble models, especially neural networks. NCL is involved in the training process of each model with the intention of diversifying each submodel. In the framework of a specific ensemble model, the training pattern is the same for each submodel, although diversity can be obtained.
Zhao et al. (2010) and MendesMoreira et al. (2012) point out that in addition to homogeneous models, heterogeneous models can also be used as candidate models to obtain diversity in the model selection phase with greater generalization. However, few studies have applied NCL to model selection and predictions combination. Tang et al. (2009) exploited genetic algorithm (GA) to select models and NCL as a penalty term to modify the objective function of the GA, but the GA ran too slowly and tended to fall into local optimum.To bridge these gaps, a generic predictions combination scheme is designed in this paper to solve the regression problem. 12 wellestablished regression prediction methods, including ensemble models, generalized linear regression models, etc., are added to the model pool. Each predictor is trained, after which the predictions are generated. Crossvalidation and grid search are applied to the training process of each submodel to fully train the predictor and obtain the optimal parameters. Thereafter, we view the process of model selection and combination as a nonconvex optimization problem, which is solved using the
Gekko optimizer (Beal et al., 2018). The NCL is added as a penalty term to the objective function of the optimization problem. Two weighting methods, error inverse weighting and error exponential weighting (Armstrong, 2001), are used to fine tune the weights of the predictions combination. Our proposed predictions combination scheme achieves excellent results on three publicly available datasets, where the NCL and weighting methods each contribute positively.The main contributions of this paper are twofold:

Theoretically, this paper classifies combination models into two categories, ensemble models and predictions combination, based on whether the input space is sampled or not.

Practically, this paper proposes a predictions combination scheme that incorporates NCL. Submodels with diversity are selected from the model pool by the novel nonconvex optimization solver Gekko. The two finetuned weighting methods also help to improve the performance of the predictions combination compared to the simple averaging method.
The rest of the paper is organized as follows. Section 2 introduces the theories and methods involved in our proposed framework. Section 3 presents the framework of predictions combination accounting for model diversity. In Section 4, we systematically investigate the application of the proposed method on three publicly available datasets and analyze the contribution of NCL and two finetuned weighting methods to model performance improvement, respectively. Section 5 gives some discussions and Section 6 concludes the paper.
2 Related works
In a regression problem, there is dataset containing n samples with the form . The target of the problem is to find a function f that maps x to y based on the samples in dataset:
(1) 
In machine learning,
is also referred to as a model or an estimator. The mean square error (mse) is a general measure of accuracy, which is minimized as the goal of the regression model.(2) 
2.1 Ambiguity Decomposition
In the following analysis, we refer to Brown et al. (2005)’s two measures of predictions combination diversity: ambiguity decomposition and biasvariancecovariance decomposition. First, in a general scenario, m submodels in a pool can form a predictions combination by weighted average. is a convex combination of all components:
(3) 
where is the jth prediction of the model pool. Then the overall MSE can be calculated as a weighted average of the of each submodel, the same weight as that in .
(4)  
Hence, according to the above derivation, the can be viewed as two components:
(5) 
Fomular 5 indicates that is less than the weighted average of all submodels, in the case that the submodels are not identical and the second ambiguity term is positive. This fact reveals that the larger the difference between each submodel and the predictions combination, the larger the ambiguity term and the smaller the error of the predictions combination. In particular, without an established criterion to judge the best model in advance, it is more efficient to use the predictions combination directly even if the error of one particular model is smaller than the predictions combination.
2.2 Biasvariancecovariance Decomposition
The mean squared error (mse) of the submodels and the predictions combination is employed in the ambiguity decomposition to measure diversity, i.e., the larger the mse, the more diverse the combination. However, as the submodels increase in volume, the more diverse they are, the more likely they seem to deviate from the true value. This situation leads to an increase in the first term of , when it is not such beneficial to consider increasing the diversity of the predictions combination. Thus the proposition of how to balance the diversity and the accuracy of the submodels becomes of interest. Brown et al. (2005)’s biasvariancecovariance decomposition is a welldefined tradeoff.
For simplicity, given the simple average form of the predictions combination and , the biasvariancecovariance decomposition is given by the following equation:
(6) 
in which bias, var, and cov are the averaged bias, variance, and covariance of each model in predictions combination separately. The formulas of the three terms are as follows:
(7) 
(8) 
(9) 
Unlike the case of ambiguity decomposition where the accuracy of each submodel needs to be considered, the biasvariancecovariance decomposition can theoretically reduce the error of the predictions combination by decreasing the covariance without increasing the bias and variance. In addition, the covariance term can be negative, implying that negative correlations between submodels can contribute to the predictions combination.
2.3 Negative Correlation Learning
The ambiguity decomposition and the biasvariancecovariance decomposition present two ways to quantify the diversity of the predictions combination, while the biasvariancecovariance decomposition revels the negative correlation among submodels in reducing the predictions combination error. The above two decompositions do not explain how to achieve diversity of submodels, which is solved by negative correlation learning (NCL) proposed by Liu & Yao (1999).
NCL was originally designed as a method for neural network ensembles. It adds a penalty term to the objective function of each individual network and trains all the networks simultaneously and interactively before combining them. The purpose of this training pattern is not to obtain multiple accurate and independent neural networks, but to capture the correlations and derive subnetworks with negative correlations using penalty terms, which in turn form a robust combination. It is still given that there are m subnetworks in the neural network ensemble and n samples in the dataset. For each network, its objective function during training is of the form MSE, and NCL adds a penalty term to it. The form of the network’s objective function is given by the following equation:
(10) 
where, is the negative correlation factor with a value between 0 and 1, and the second term is the penalty term. Formula 10 measures both the difference between each submodel and the true value , and the difference between each submodel and the predictions combination, where lambda controls the strength of the negative correlation. When equals 0, the objective function is equivalent to MSE; when equals 1, the negative correlation strength of the objective function is maximized. In summary, NCL presents a new perspective on obtaining diversity in neural network ensembles training, due to the fact that each submodel can interact under the control of the penalty term.
3 Predictions Combination of Regression with Negative Correlation Learning
As one of the most fundamental mathematical problems, regression has a considerable number of wellestablished models designed from different perspectives. Predictions combination is a way to solve regression problems by weighted average the predictions of multiple models. This approach produces better results than any of the submodels under certain circumstances and improves the accuracy by error hedging since the predictions combination fully takes the information from multiple models into account. However, the predictions combination highly demands for model diversity, because if all candidates are the same, the predictions combination will no longer work. In this paper, by introducing negative correlation learning in the predictions combination of regression, those submodels with diversity are selected from the model pool, and then combined and assigned weights to finally achieve the purpose of improving the prediction accuracy.
Specifically, the research in this paper contains these aspects: model pool construction, model training methods, optimization problem design for predictions combination, weight finetuning and evaluation. These contents will be introduced in this section.
3.1 Model Pool Construction
Many ensemble models adopt crossvalidation to train homogeneous models and perform majority voting to select the models that work well. In contrast, this paper draws on the conclusion of MendesMoreira et al. (2012) that heterogeneous models control for diversity and perform better in the model selection phase, due to the possibility of using homogeneous models as candidates. In constructing the model pool, 12 common regression prediction models are selected in this paper, which includes the ensemble model, like Random Forest, AdaBoost, and GBDT. Still following the notation of Section 2, the linear regression model views as a linear combination of inputs , in which there are l features for each sample.
(11) 
Or write as the matrix form of linear regression: , in which is the input samples, is the regression coefficients, is the predictions, and is the constant term. Following these notations, we present 12 regression models as shown in Tabel 5 in Appendix A. They are Simple Linear Regression (SLR) (Zou et al., 2003)
, Ridge Regression (RR)
(Hoerl & Kennard, 1970), Lasso Regression (LR)
(Santosa & Symes, 1986), Bayesian Regression (BR) (Box & Tiao, 2011), Stochastic Gradient Descent Regression (SGDR)
(Bottou, 2010), Polynomial Regression (PR) (Stigler, 1974), Random Forest Regression (RFR) (Ho, 1995), Adaptive Boosting Regression (ABR) (Solomatine & Shrestha, 2004), Gradient Boosting Decision Tree (GBDT)
(Friedman, 2001), Support Vector Regression (SVR) (Drucker et al., 1997), Decision Tree Regression (DTR) (Wu et al., 2008), Multilayer Perceptron Regression (MLP)
(Rosenblatt, 1961).3.2 Model Training Methods
In this paper, the 12 models in the model pool are trained individually. To achieve optimal performance of each model, grid search (Chicco, 2017) and crossvalidation (Geisser, 1975) are used. Generally when training a machine learning model, one would manually set different combinations of parameters to seek good performance. However, this practice is timeconsuming and laborious, and it is difficult to find the best combination of parameters manually. Grid search automates all this work. Before applying grid search, the range of values of different parameters in the model is artificially defined and discretized. Subsequently, in combination with crossvalidation, grid search search and evaluate performance in the parameter space defined in the previous step. Grid search automates the previous process of parameter selection, saves time and obtains the optimal combination of parameters in a given parameter space based on the training set.
During model training, the fitted results tend to perform better for the training set and worse for the test set that is not involved in the training. To reduce this overfitting, crossvalidation was developed. The training set is divided into several equal parts, and one part at a time is left as the validation set, while the rest are used as the training set to train the model. Until each part of the data has been the validation set, all the trained models are averaged to form a new model, which has better fitting ability to new data. Figure 1 shows the 5fold crossvalidation approach.
3.3 Optimization Problem for Predictions Combination
In this paper, we consider designing an optimization problem to implement the predictions combination in which candidate submodels are automatically selected and assigned weights. For a dataset of n samples, the m submodels in the pool are trained separately and produce m sets of predictions, which form a matrix . Our aim is to obtain a weight vector for these predictions, and the predictions combination is generated by the inner product of and : .
Each submodel corresponds to an optimization objective function , which can be calculated by Equation 10. The optimized objective values of all submodels form a vector , so that the overall objective function is
(12)  
Each term in the objective function contains , which varies with the weight vector at each iteration. This situation leads to the fact that this is a nonconvex optimization problem. To solve this problem, we use the Gekko optimizer designed by Beal et al. (2018). As an algebraic modeling language, Gekko excels in solving dynamic optimization problems. In addition, Gekko is also a Python library that integrates model building, analysis tools and optimization visualization. Gekko provides a number of userfriendly multiple builtin solvers that allow efficient interaction between them and optimization models. The gradientbased interior point optimizer IPOPT is the best known of these, and is used as the default optimizer by Gekko. For more information about IPOPT, please refer to Wächter & Biegler (2006).
3.4 Weight Finetuning and Evaluation
Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) are three common indexes used to evaluate the accuracy of regression models. The formulas of RMSE, MAE, and MAPE are as follows:
(13) 
(14) 
(15) 
Gekko has generated the weights for each candidate model when solving the optimization problem (Equation 12). For better performance of the predictions combination on the three evaluation indicators, we have given several weight finetuning schemes, drawing on the error inverse weights and error exponential weights designed by Armstrong (2001), which will be verified for validity and selection in subsequent experiments.
Let be the predictive error of the submodel, which can be one of RMSE, MAE, and MAPE. The error inverse weights (EIW) and error exponential weights (EEW) are written separately in the following form:
(16) 
(17) 
and are weight for the submodel respectively in Equation 16 and Equation 17. The intuitive goal of these two types of weight finetuning schemes is to assign higher weights to those models that predict well. Specifically, given the weight for the of m models generated by Gekko, the weights after finetuned by and become:
(18) 
(19) 
3.5 Framework of the Predictions Combination
Figure 2 demonstrates our proposed predictions combination framework incorporating NCL by an example. First a model pool consisting of common regression models is constructed. Each model generates a set of predictions for the test set, where a grid search is used to find the optimal combination of parameters from the parameter space and crossvalidation is used to improve the robustness of the model. After these steps, each single model achieves the best results within its capability. Subsequently, a fused NCL objective function was designed for model selection and weighting to find those submodels whose predictions have negative correlations, thus enhancing the diversity within the predictions combination. The weights of each model are automatically updated in the process of solving the objective function using the Gekko optimizer. After the candidate submodels and their corresponding weights are obtained, both error inverse weights and error exponential weights are applied to further finetune the weights in order to assign higher weights to models with good predictions while reducing the weights of models with less good predictions. Finally, the weightadjusted submodels are combined to gain a new set of predictions. In the following experiments, we will explore the performance of this predictions combination.
4 Experiments
The experiments in this section are designed to validate the optimization problem incorporating the NCL method for improving the accuracy of prediction combinations. The following subsections expand on the datasets, experimental design, and analysis of the results.
4.1 Datasets and Preprocessing
Three public datasets from Kaggle were chosen for this paper: CarPrice^{1}^{1}1https://www.kaggle.com/nehalbirla/vehicledatasetfromcardekho?select=car+data.csv, LifeExpectancy^{2}^{2}2https://www.kaggle.com/kumarajarshi/lifeexpectancywho, and Walmart^{3}^{3}3https://www.kaggle.com/vik2012kvs/walmartdataretailanalysis. Kaggle^{4}^{4}4https://www.kaggle.com/ is an active online community of data scientists and machine learners that often publish machine learning competitions, provide public datasets, and more. CarPrice is used cars transaction data, contains vehicle and driving information. The dependent variable is car price. LifeExpectancy is a dataset of population life expectancy and related factors for each country from 20002015, including development status of countries, adult and infant mortality, alcohol consumption, and more, according to the World Health Organization. Walmart, Walmart’s sales data set, includes external factors such as temperature, fuel, CPI and unemployment in addition to stores and dates. All three data can be used for price regressions, and their basic information is shown in the Table 1.
Dataset  # Samples  # Features  Max of Y  Min of Y  Mean of Y  Median of Y  Std of Y 
CarPrice  4,332  7  8,900,000  20,000  504,784.98  350,500  578,800.10 
LifeExpectancy  2,938  21  89  36.3  69.22  72.1  9.52 
Walmart  6,435  7  3,818,686  209,986.30  1,046,964.88  960,746  564,322.80 
For the three datasets, we performed OneHot coding of the nominal variables in the features, and represented the ordinal variables with consecutive numerical ranks. In this paper, we assume that the distribution of training set samples is representative of the overall distribution, so for the training set, data standardazation was implemented on variables other than nominal variables. The reason for not using data normalization is to avoid the effect of extreme data on the whole. The standardized data obeyed a normal distribution with a mean of 1 and a standard deviation of 0. For the validation set and test set, the data were standardized using the initial mean and standard deviation of the training set. The Equation of data standardazation is as follows:
(20) 
where is the original data, is the standradized data, with the mean and standard deviation std of n samples.
When dividing the data, we randomly selected 80% from all samples as the training set for training all models in the pool separately. 16% of the data was used as the validation set for verifying our proposed NCLbased model selection and weighting optimization method and training to derive the weights. The remaining 4% of the data naturally becomes the test set to test the performance of the predictions combination as brand new data.
4.2 Weightbased Predictions Combination
4.2.1 The Basic Predictions Combination: Simple Average
At the very beginning, each of the three datasets was trained by 12 models from the model pool. Parameter grid search and crossvalidation during this period ensures to some extent that each model exploits its capabilities as much as possible. The optimal combination of parameters for each model selected by the grid search has been added to the Table 6 in Appendix B. RMSE, MAE, and MAPE, as three prediction indexes, are employed to evaluate the models. To compare, the simple average predictions of the candidate models are added in Figure 3, Figure 4, and Figure 5, in which the sum of indexes are ranked in increasing order.
As shown in Figure 3, Figure 4, and Figure 5, the best models for the three datasets are GBDT, MLR, and GBDT respectively. A single model has a strong uncertainty, due to factors such as the size and features of the dataset itself. For example, PR performs the worst in the first two datasets, comparing with other models, but it ranks third in prediction in the third dataset; MLP has the best prediction in the second dataset, yet it is the second worst model in the first dataset. This fact again verifies that there is no perfect model for different datasets, although it already reach the best for its own parameters combination.
To reduce the effect of model uncertainty, a method that simply averages over all model predictions is often used in predictions combination. This naive approach is sometimes effective, i.e., it can achieve results comparable to the optimal model in the model pool. When working with a new data set, it is undoubtedly timeconsuming to determine the predictive effect of each model individually, and a direct simple average is an effective approach in this case. In our experiments, the simple average ranked fourth, eighth and tenth in performance on the three datasets, respectively. This unsatisfactory performance necessitates continued exploration of ways to optimize the predictions combination to achieve higher accuracy.
4.2.2 Weight Finetuning for All Model
The first attempt to optimize the predictions combination is to adjust the weights of each submodel. An intuitive idea to finetune the weights is to assign high weights to those models that perform well and low weights to those perform poorly, and to limit the sum of the weights of all models to 1. The error inverse weights (EIW) and error index weights (EEW) from Section 3 are used to finetune the weights. The three error indexes for the validation set, RMSE, MAE and MAPE are treated as Error term in EIW and EEW.
Model  CarPrice  LifeExpectancy  Walmart  
RMSE  MAE  MAPE  RMSE  MAE  MAPE  RMSE  MAE  MAPE  
Best_model  0.3744  0.1908  1.0161  0.1852  0.1098  0.4036  0.2264  0.1190  0.4530 
Simple_avg  0.4099  0.2519  1.3972  0.2965  0.1963  0.8136  0.3372  0.2342  0.6613 
FT_EIW_RMSE  0.3705  0.2338  1.3654  0.1948  0.1319  0.4976  0.2737  0.1623  0.5868 
FT_EIW_MAE  0.3736  0.2348  1.3659  0.1956  0.1327  0.5054  0.2655  0.1518  0.5765 
FT_EIW_MAPE  0.4328  0.2654  1.2814  0.1998  0.1350  0.4931  0.3250  0.2234  0.6007 
FT_EEW_RMSE  0.3800  0.2395  1.3890  0.2106  0.1461  0.5575  0.2947  0.1873  0.6138 
FT_EEW_MAE  0.3954  0.2459  1.3962  0.2387  0.1639  0.6527  0.2963  0.1891  0.6154 
FT_EEW_MAPE  0.4689  0.2907  1.1772  0.1993  0.1367  0.4939  0.3165  0.2149  0.5677 
The experimental results in Table 2 demonstrate some interesting phenomena. First, as shown in Section 4.2.1, the simple average performs worse than the best submodel, which may be due to some submodels with excessive errors. After performing weight finetuning, the weights of these errorexcessive submodels are reduced accordingly, bringing different gains in the three prediction indexes. Specifically, in the two types of weight finetuning schemes EIW and EEW, and finetuning can reduce the RMSE and MAE errors of the dataset more obviously than finetuning. This pattern was more evident for EIW, i.e., , , and achieved the best results on the three datasets when only RMSE and MAE indexes were considered, respectively. In addition, the finetuning performs excellently in reducing MAPE errors. However and should be used with caution in the case of weighted averaging of all submodels, as they do not yield high gains in RMSE and MAE errors compared to finetuning and finetuning, and may even perform worse than simple average, e.g., on the CarPrice dataset.
4.3 NCLbased Predictions Combination
In practice, not all submodels in the model pool may be necessary to construct the predictions combination. Those submodels with negative correlations should be selected to increase the internal diversity of the predictions combination and thus improve the prediction accuracy. In this section, the possibility of implementing NCLbased predictions combination will be explored.
4.3.1 Effect of NCL Penalty Intensity
For each submodel, the second term in its objective function to be optimized (Equation 10) is the NCL penalty term, where is a negative correlation factor used to measure the penalty intensity. equals 0, the objective function is just an ordinary MSE without any penalty; equals 1, the objective function has the maximum penalty intensity. In Figure 6, Figure 7, and Figure 8 we list the varying of the three accuracy indexes with in each dataset. The four lines in each subplot represent the best model suitable for the dataset, the simple average, the optimal weighting finetuning scheme containing all models, and the result after NCL selection and weighting.
In CarPrice, the NCL is not superior when takes a small value (). When increases, the NCL line in Figure 6 decreases with a particularly fast trend. In terms of RMSE error, NCL consistently outperforms the best model; for MAE error, NCL surpasses the best model after is greater than 0.2; as for MAPE, when increases to 0.5, the NCL line starts to break through the baseline of the best model. The accuracy of NCL is highest when , after which the accuracy of NCL decreases when increases again.
In LifeExpectancy, the line of NCL hits the RMSE baseline of the best model when is greater than 0.02 and maintains the outperformance over the best model thereafter. In the MAE error and MAPE error terms, NCL achieves a particularly close approximation to the best model when is equal to 0.05 and 0.07, respectively.
In Walmart, the line NCL remains fairly stable in RMSE and MAE, almost in line with the best model, and slightly outperforms the best model on some . However, in the MAPE error, the NCL keeps an increasing trend as increases, which indicates that increasing has little benefit in improving the MAPE accuracy.
In summary, this section illustrates the importance of in NCLbased predictions combination. NCL is evidently far superior to both simple average and weighted average over all models, and it can also easily demonstrate predictions that exceed or remain close to the best model by taking appropriate values for . The effect of NCL in improving model accuracy is less obvious when is small. As the increases, NCL begins to show its superiority in the predictions combination. However, if the is too large, this superiority may become a burden that does not bring more gains in terms of error reduction.
4.3.2 Model Subset Weighting and Negative Correlation
After selecting appropriate values and using Gekko to optimize the objective function of the predictions combination, we obtained different schemes suitable for different datasets. The predictions combination contains the results of selecting and weighting the models, which are automatically generated by the optimizer. For the three datasets, after selecting the that makes NCL perform best on all prediction indexes, the results are shown in the table below. A vector of length 12 is used to represent the results of model selection and weighting, and the position of each element of the vector represents a submodel, with the order of the submodels shown in Section 3.1. If the element is equal to 0, it means that the submodel corresponding to that position is not selected; the larger the value of the element, the higher the importance of the submodel corresponding to that position in the predictions combination.
Dataset  Results of Model Selection and Weighting  
CarPrice  0  
0.6  
LifeExpectancy  0  
0.05  
Walmart  0  
0.42 
In three datasets, when , the optimization result of the objective function always favors the selection of one model. When , CarPrice selects GBDT, SVR and DTR as the model subset for the predictions combination; when , LifeExpextancy selects BR, GBDT and MLP; when , Walmart selects the model subset of GBDT, DTR and MLP. These selection results do not correspond to the top models with the highest prediction accuracy ranking given in Section 4.2.1, demonstrating that NCL is not just a simple method of submodels ranking and selection.
The following heat maps are taken to show that the subsets of models selected by our proposed method do have negative correlation. First, for the test set of each dataset, the predictions combination is generated according to the model selection and weighting results in Table 3. The predictions of each submodel are subtracted from to measure the degree of model bias, i.e., . Next we derive the Pearson correlation coefficients for all as well as , which are presented by heat maps.
In Figure 9, Figure 10, and Figure 11, the darker the color block indicates a stronger positive correlation for the element at the corresponding position; the lighter represents a stronger negative correlation. The right part of each Figure shows that the selected subset of models is the most negatively correlated combination of submodels in the model pool. This part of the experimental results verifies that the model selection and weighting method incorporating NCL can indeed select the subset of models with negative correlation.
4.3.3 Weight Finetuning for Predictions Combination with NCL
The predictions combination with NCL performs excellently on all three datasets. In this section we explore whether there are further possibilities for improving the accuracy of this predictions combination. We finetune the weights generated by the Gekko optimizer according to the six schemes shown in Section 4.2.2, and the results are presented in the Table 4.
model  CarPrice  LifeExpectancy  Walmart  
RMSE  MAE  MAPE  RMSE  MAE  MAPE  RMSE  MAE  MAPE  
Best_model  0.3744  0.1908  1.0161  0.1852  0.1098  0.4036  0.2264  0.1190  0.4530 
Proposed Method  0.3096  0.1793  0.9837  0.1765  0.1115  0.4191  0.2261  0.1183  0.4598 
FT_EIW_RMSE  0.3099  0.1793  0.9920  0.1766  0.1116  0.4189  0.2261  0.1184  0.4586 
FT_EIW_MAE  0.3097  0.1792  0.9870  0.1771  0.1112  0.4139  0.2262  0.1184  0.4583 
FT_EIW_MAPE  0.3095  0.1792  0.9786  0.1753  0.1113  0.4273  0.2262  0.1185  0.4555 
FT_EEW_RMSE  0.3097  0.1793  0.9872  0.1772  0.1117  0.4171  0.2261  0.1184  0.4596 
FT_EEW_MAE  0.3096  0.1793  0.9845  0.1772  0.1117  0.4165  0.2261  0.1184  0.4596 
FT_EEW_MAPE  0.3096  0.1792  0.9832  0.1765  0.1116  0.4215  0.2262  0.1185  0.4543 
For the first two datasets, finetuning the weights brings greater benefits compared to our proposed method. The EIWbased finetuning scheme works better than the EEWbased one, especially on both RMSE and MAE indexes. For Walmart, although the proposed method outperforms the best submodel in RMSE and MAE, any of the weight finetuning methods can no longer improve the RMSE and MAE accuracy of the predictions combination. However, for the MAPE indicator, some gains can still be obtained if is used to finetune the weights.
5 Discussion
Ensemble models and predictions combination, two classes of combination models, have made great progress in both research and practice. However, the predictions combination still faces the problems of choosing the appropriate model subset and assigning weights to the submodels. Simply averaging the predictions of all submodels does not achieve the expected results, even the corrections using the weighted averaging method are limited. This study proposes a novel method for predictions combination that automatically selects models and generates appropriate weights, yielding comparable performance with the optimal submodels.
Diversity is essential for the success of a predictions combination. Distinguishing from previous studies that increase model diversity when training homogeneous models, this study explores selecting a subset of models with negative correlations in a heterogeneous pool of models. The proposed approach uses a penalty term for negative correlation learning to control the optimization of the objective function during model selection. A reasonable penalty strength helps to select a diverse subset of models. Weight finetuning then further optimize the weights of the predictions combination, assisting it in embracing diversity and accuracy.
The empirical results of this paper support that considering diversity in the model selection process is an effective move. The predictions combination incorporating NCL is far more accurate than simple averaging and some intuitive weighted averaging methods on RMSE, MAE and MAPE, and it can reach comparable levels with or even exceed the best submodels in terms of selecting the appropriate value of the NCL factor . Another advantage of this study is that the prediction results do not depend on a particular class of models, due to the fact that any advanced model, such as GBDT, can be added to the model pool. In addition, compared with the genetic algorithm, this paper adopts the Gekko optimizer in the model selection process with higher computing efficiency, and the weights of each submodel can be given automatically during this period.
A limitation of this paper is that the 12 models in the model pool do not cover more established models in the regression field, which also provides researchers with the subsequent freedom to replace candidate models. This paper also lacks a more indepth exploration of the effect of the predictions combination in relation to the type and number of submodels in the model pool.
6 Conclusion
We developed a predictions combination approach incorporating model diversity. Negative correlation learning acts as a penalty term for the objective function to be optimized, assisting in the model selection process to find those subsets with diversity. Experiments on three publicly available regression datasets confirm the effectiveness of this approach.
First of all, this proposed method is userfriendly. Its framework is easy enough to understand, and practitioners no longer need to evaluate the effectiveness of individual models by various accuracy indexes to select the best one, nor do they need to blindly weight the candidate models, since the predictions combination with the addition of NCL can fully demonstrate prediction accuracy that approximates or exceeds that of the best submodel with appropriate penalty strength. In addition, the predictions from any model can be added to the model pool as an element for calculation. Even if that model does not work well, this method will discard it automatically. Therefore, our proposed method has practical implications.
Appendix A Appendix A
Model  Description  Objective Function  
SLR  SLR seeks to minimize the sum of squares of the residuals of the data and the predictions when fitting the model, and it assumes that the better the fit, the closer the data points and predictions will be on the graph.  where is the Euclidean norm.  
RR  RR adds a regular term to the objective function to find a better solution from the solution space compared to SLR.  , where is the parameter of regular term.  
LR  Lasso is short for least absolute shrinkage and selection operator. LR can be applied for variable selection and regularization to achieve better results with fewer variables.  , where is the parameter of regular term and is the l1norm.  
BR  BR utilizes the principle of Bayesian inference, assuming that the error terms of the regression model obey a normal distribution. The prior distribution of the data has some specific form and the posterior probabilities of the model parameters can be computed. BR assumes that yis a Gaussian distribution of . 
where is an latent variable that can be obtained in the process of model inference.  
SGDR  SGDR does not correspond to a certain machine learning model, but is a training method and optimization technique.  , where L can be chosen as different model, like SLR; is the regular term, like l1norm or l2norm. can be updated by , where is learning rate and is the intercept distance.  
PR  PR incorporates a polynomial combination of features when training a linear model, and this approach is suitable for largescale data situations. The fit model is .  where is the expanded features and is regression coefficients.  
RFR  RFR improves model performance by ensembling multiple decision trees. A single decision tree tends to grow too deep in training and produces large variance. In RFR, multiple decision trees are trained separately for different subsets of the training set, and the training results are averaged so that the variance is reduced.  , where is the average prediction of the observation.  
ABR  ABR trains a series of weak learners and weights their outputs to get more accurate results. The weights of each weak learner are initialized to a simple average, and during the training process, the model assigns low weights to the good performers and high weights to the poor performers. Each time the model is trained to focus more on those samples that are trained incorrectly.  , where is the objective function of the weak learner, whose weight is .  
GBDT  GBDT ensembles a series of fixedsize decision trees, each of which can be improved after training. It uses a model combination method , where and are the prediction and weight of the submodel. GBDT uses a forward calculation method , where the loss function L is related to the current model and the fit results of training samples on model . 
.  
SVR  SVR only considers part of the training data, since data points beyond the decision boundary are ignored when constructing the loss function. 


DTR  DTR builds a tree structure, maps features to branches and target values to leaf nodes. Given the data of the node of the decision tree as Q, the node produces a candidate set containing features and threshold. If the data is less than the threshold, it goes down the left branch and vice versa along the right branch. The data in the left child node is denoted as and that in the right child node is denoted as . The impurity function is chosen to calculate the impurity of the data Q at node m. The model is improved by optimizing the impurity function at each training session.  , where , , and are the size of the data of left child node, right child node, and the node m. For regression, the impurity function can be chosen as mean square error and mean absolute error.  
MLP  MLP employs a multilayer neural network structure to map multidimensional data into one dimension. MLP consists of an input layer, a hidden layer and an output layer. Each neuron in the hidden layer is equivalent to a regressor for weighted summation of the data in the previous layer, and the activation function is also added. The MLP updates the parameters using stochastic gradient descent during backpropagation. 
, where is the regression coefficient and is the penalty intensity factor.  
Appendix B Appendix B
Model  CarPrice  LifeExpectancy  Walmart 
SLR  n_jobs: 1  n_jobs: 1  n_jobs: 1 
RR  alpha: 0.5,
max_iter: 100, solver: lsqr, tol: 0.0001 
alpha: 0.5,
max_iter: 100, solver: auto, tol: 0.0001 
alpha: 0.5,
max_iter: 100, solver: lsqr, tol: 0.0001 
LR  alpha: 0.5,
max_iter: 100, positive: True, precompute: True, selection: cyclic, tol: 0.0001, warm_start: True 
alpha: 0.5,
max_iter: 500, positive: False, precompute: False, selection: random, tol: 0.01, warm_start: False 
alpha: 0.5,
max_iter: 100, positive: True, precompute: True, selection: cyclic, tol: 0.0001, warm_start: True 
BR  alpha_1: 0.0001,
alpha_2: 1e06, compute_score: True, fit_intercept: False, lambda_1: 1e06, lambda_2: 0.0001, n_iter: 100, tol: 0.001 
alpha_1: 0.0001,
alpha_2: 1e06, compute_score: True, fit_intercept: True, lambda_1: 1e06, lambda_2: 0.0001, n_iter: 100, tol: 0.0001 
alpha_1: 0.0001,
alpha_2: 1e06, compute_score: True, fit_intercept: True, lambda_1: 1e06, lambda_2: 0.0001, n_iter: 100, tol: 0.01 
SGDR  alpha: 1e05,
learning_rate: adaptive, loss: squared_epsilon_insensitive, max_iter: 1000, penalty: l2, tol: 0.001 
alpha: 0.0001,
learning_rate: adaptive, loss: epsilon_insensitive, max_iter: 1500, penalty: l1, tol: 0.0001 
alpha: 1e05,
learning_rate: adaptive, loss: squared_loss, max_iter: 1000, penalty: elasticnet, tol: 0.01 
PR  pf^{5}^{5}5pf is short for polynomialfeatures._degree: 3,
pf_include_bias: False, pf_interaction_only: False, pf_order: F 
pf_degree: 2,
pf_include_bias: True, pf_interaction_only: True, pf_order: F 
pf_degree: 2,
pf_include_bias: False, pf_interaction_only: False, pf_order: F 
RFR  bootstrap: True,
max_depth: 4, min_samples_leaf: 3, min_samples_split: 2, n_estimators: 50 
bootstrap: True,
max_depth: 4, min_samples_leaf: 3, min_samples_split: 3, n_estimators: 200 
bootstrap: True,
max_depth: 4, min_samples_leaf: 3, min_samples_split: 4, n_estimators: 100 
ABR  learning_rate: 1,
loss: square, n_estimators: 10 
learning_rate: 0.1,
loss: linear, n_estimators: 100 
learning_rate: 0.01,
loss: exponential, n_estimators: 50 
GBDT  learning_rate: 0.1,
loss: huber, max_depth: 4, min_samples_split: 3, n_estimators: 200 
learning_rate: 0.1,
loss: ls, max_depth: 4, min_samples_split: 3, n_estimators: 200 
learning_rate: 0.5,
loss: huber, max_depth: 4, min_samples_split: 2, n_estimators: 200 
SVR  C: 2,
degree: 2, gamma: scale, kernel: rbf 
C: 2,
degree: 2, gamma: scale, kernel: linear 
C: 2,
degree: 2, gamma: scale, kernel: rbf 
DTR  max_features: auto,
min_samples_leaf: 3, min_samples_split: 2, splitter: random 
max_features: auto,
min_samples_leaf: 3, min_samples_split: 2, splitter: best 
max_features: auto,
min_samples_leaf: 3, min_samples_split: 2, splitter: random 
MLP  activation: tanh,
alpha: 1e05, learning_rate: adaptive, solver: adam 
activation: tanh,
alpha: 0.0001, learning_rate: constant, solver: adam 
activation: relu, alpha: 0.0001, learning_rate: adaptive, solver: adam 
References
 Ala’raj & Abbod (2016) Ala’raj, M., & Abbod, M. F. (2016). A new hybrid ensemble credit scoring model based on classifiers consensus system approach. Expert Systems with Applications, 64, 36–55.
 Alhamdoosh & Wang (2014) Alhamdoosh, M., & Wang, D. (2014). Fast decorrelated neural network ensembles with random weights. Information Sciences, 264, 104–117.
 Armstrong (2001) Armstrong, J. S. (2001). Principles of forecasting: a handbook for researchers and practitioners volume 30. Springer Science & Business Media.
 Baumeister & Kilian (2015) Baumeister, C., & Kilian, L. (2015). Forecasting the real price of oil in a changing world: a forecast combination approach. Journal of Business & Economic Statistics, 33, 338–351.
 Beal et al. (2018) Beal, L. D., Hill, D. C., Martin, R. A., & Hedengren, J. D. (2018). Gekko optimization suite. Processes, 6, 106.
 Bojer & Meldgaard (2020) Bojer, C. S., & Meldgaard, J. P. (2020). Kaggle forecasting competitions: An overlooked learning opportunity. International Journal of Forecasting, .
 Bottou (2010) Bottou, L. (2010). Largescale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 (pp. 177–186). Springer.
 Box & Tiao (2011) Box, G. E., & Tiao, G. C. (2011). Bayesian inference in statistical analysis volume 40. John Wiley & Sons.
 Breiman (1996) Breiman, L. (1996). Bagging predictors. Machine learning, 24, 123–140.

Brown (2004)
Brown, G. (2004).
Diversity in neural network ensembles.
Ph.D. thesis Citeseer.
 Brown et al. (2005) Brown, G., Wyatt, J., Harris, R., & Yao, X. (2005). Diversity creation methods: a survey and categorisation. Information Fusion, 6, 5–20.

Chandra & Yao (2006a)
Chandra, A., & Yao, X.
(2006a).
Ensemble learning using multiobjective evolutionary algorithms.
Journal of Mathematical Modelling and Algorithms, 5, 417–445.  Chandra & Yao (2006b) Chandra, A., & Yao, X. (2006b). Evolving hybrid ensembles of learning machines for better generalisation. Neurocomputing, 69, 686–700.
 Chicco (2017) Chicco, D. (2017). Ten quick tips for machine learning in computational biology. BioData mining, 10, 1–17.
 Dietterich (2000) Dietterich, T. G. (2000). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1–15). Springer.
 Ding et al. (2018) Ding, J., Tarokh, V., & Yang, Y. (2018). Model selection techniques: An overview. IEEE Signal Processing Magazine, 35, 16–34.
 Drucker et al. (1997) Drucker, H., Burges, C. J., Kaufman, L., Smola, A., Vapnik, V. et al. (1997). Support vector regression machines. Advances in neural information processing systems, 9, 155–161.
 Freund et al. (1996) Freund, Y., Schapire, R. E. et al. (1996). Experiments with a new boosting algorithm. In icml (pp. 148–156). Citeseer volume 96.
 Friedman (2001) Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, (pp. 1189–1232).
 Geisser (1975) Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American statistical Association, 70, 320–328.
 Ho (1995) Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (pp. 278–282). IEEE volume 1.
 Hoch (2015) Hoch, T. (2015). An ensemble learning approach for the kaggle taxi travel time prediction challenge. In DC@ PKDD/ECML.
 Hoerl & Kennard (1970) Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.
 Kim et al. (2006) Kim, M.J., Min, S.H., & Han, I. (2006). An evolutionary approach to the combination of multiple classifiers to predict a stock price index. Expert Systems with Applications, 31, 241–247.
 Krawczyk & Wozniak (2014) Krawczyk, B., & Wozniak, M. (2014). Experiments on simultaneous combination rule training and ensemble pruning algorithm. In 2014 IEEE Symposium on Computational Intelligence in Ensemble Learning (CIEL) (pp. 1–6). IEEE.
 Lange et al. (2006) Lange, M., Focken, U., Meyer, R., Denhardt, M., Ernst, B., & Berster, F. (2006). Optimal combination of different numerical weather models for improved wind power predictions. In 6th International Workshop on LargeScale Integration of Wind Power and Transmission Networks for Offshore Wind Farms, Delft.
 LeBlanc & Tibshirani (1996) LeBlanc, M., & Tibshirani, R. (1996). Combining estimates in regression and classification. Journal of the American Statistical Association, 91, 1641–1650.
 Liu & Yao (1999) Liu, Y., & Yao, X. (1999). Ensemble learning via negative correlation. Neural networks, 12, 1399–1404.

Liu et al. (2000)
Liu, Y., Yao, X., &
Higuchi, T. (2000).
Evolutionary ensembles with negative correlation
learning.
IEEE Transactions on Evolutionary Computation
, 4, 380–387.  MendesMoreira et al. (2012) MendesMoreira, J., Soares, C., Jorge, A. M., & Sousa, J. F. D. (2012). Ensemble approaches for regression: A survey. Acm computing surveys (csur), 45, 1–40.
 Peng et al. (2020) Peng, T., Zhang, C., Zhou, J., & Nazir, M. S. (2020). Negative correlation learningbased relm ensemble model integrated with ovmd for multistep ahead wind speed forecasting. Renewable Energy, .
 Perrone (1993) Perrone, M. P. (1993). Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization. Ph.D. thesis Citeseer.
 Qi & Tang (2018) Qi, C., & Tang, X. (2018). A hybrid ensemble method for improved prediction of slope stability. International Journal for Numerical and Analytical Methods in Geomechanics, 42, 1823–1839.
 Rosenblatt (1961) Rosenblatt, F. (1961). Principles of neurodynamics. perceptrons and the theory of brain mechanisms. Technical Report Cornell Aeronautical Lab Inc Buffalo NY.
 Salgado et al. (2006) Salgado, R. M., Pereira, J. J., Ohishi, T., Ballini, R., Lima, C., & Von Zuben, F. J. (2006). A hybrid ensemble model applied to the shortterm load forecasting problem. In The 2006 IEEE International Joint Conference on Neural Network Proceedings (pp. 2627–2634). IEEE.
 Santosa & Symes (1986) Santosa, F., & Symes, W. W. (1986). Linear inversion of bandlimited reflection seismograms. SIAM Journal on Scientific and Statistical Computing, 7, 1307–1330.

Sirovetnukul et al. (2011)
Sirovetnukul, R., Chutima, P.,
Wattanapornprom, W., & Chongstitvatana,
P. (2011).
The effectiveness of hybrid negative correlation learning in evolutionary algorithm for combinatorial optimization problems.
In 2011 IEEE International Conference on Industrial Engineering and Engineering Management (pp. 476–481). IEEE.  Solomatine & Shrestha (2004) Solomatine, D. P., & Shrestha, D. L. (2004). Adaboost. rt: a boosting algorithm for regression problems. In 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541) (pp. 1163–1168). IEEE volume 2.
 Stigler (1974) Stigler, S. M. (1974). Gergonne’s 1815 paper on the design and analysis of polynomial regression experiments. Historia Mathematica, 1, 431–439.
 Stone (1974) Stone, M. (1974). Crossvalidatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36, 111–133.
 Taieb & Hyndman (2014) Taieb, S. B., & Hyndman, R. J. (2014). A gradient boosting approach to the kaggle load forecasting competition. International journal of forecasting, 30, 382–394.
 Tang et al. (2009) Tang, K., Lin, M., Minku, F. L., & Yao, X. (2009). Selective negative correlation learning approach to incremental learning. Neurocomputing, 72, 2796–2805.
 Verma & Hassan (2011) Verma, B., & Hassan, S. Z. (2011). Hybrid ensemble approach for classification. Applied Intelligence, 34, 258–278.
 Wächter & Biegler (2006) Wächter, A., & Biegler, L. T. (2006). On the implementation of an interiorpoint filter linesearch algorithm for largescale nonlinear programming. Mathematical programming, 106, 25–57.
 Webb & Zheng (2004) Webb, G. I., & Zheng, Z. (2004). Multistrategy ensemble learning: Reducing error by combining ensemble learning techniques. IEEE Transactions on Knowledge and Data Engineering, 16, 980–991.
 Wen & Guyer (2012) Wen, C., & Guyer, D. (2012). Imagebased orchard insect automated identification and classification method. Computers and Electronics in Agriculture, 89, 110–115.
 Wolpert (1992) Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5, 241–259.
 Wu et al. (2008) Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Philip, S. Y. et al. (2008). Top 10 algorithms in data mining. Knowledge and information systems, 14, 1–37.
 Xiao et al. (2018) Xiao, L., Dong, Y., & Dong, Y. (2018). An improved combination approach based on adaboost algorithm for wind speed time series forecasting. Energy Conversion and Management, 160, 273–288.
 Zhang & Hanby (2007) Zhang, Y., & Hanby, V. I. (2007). Shortterm prediction of weather parameters using online weather forecasts. In Building simulation. volume 2007.
 Zhao et al. (2010) Zhao, Q. L., Jiang, Y. H., & Xu, M. (2010). Incremental learning by heterogeneous bagging ensemble. In International Conference on Advanced Data Mining and Applications (pp. 1–12). Springer.
 Zou et al. (2003) Zou, K. H., Tuncali, K., & Silverman, S. G. (2003). Correlation and simple linear regression. Radiology, 227, 617–628.
Comments
There are no comments yet.