Combination models have been proven to be theoretically and empirically superior to single models by state-of-the-art literature (Brown et al., 2005; Chandra & Yao, 2006b). The explosive growth of data challenges traditional single models to fit the true complex distribution (Dietterich, 2000)
. It is difficult for a single model to cope with the huge amount of high-dimensional data, thus it is inaccurate for future prediction, which is also proved byDing et al. (2018) that there is no perfect model for all data. Many single models are prone to overfitting when training data in order to reduce errors, a situation that can be avoided by combination models (Perrone, 1993). In addition, scientists often train data with several different models, implement complex tuning efforts in each model, and ultimately select the best model. However, these lengthy steps are indicative of the large amount of time consumed in selecting the most suitable model for a given dataset in the absence of prior knowledge (Liu et al., 2000). Combination models do not reduce the time required to train individual sub-models, but unlike model selection, only the best model is selected and the remaining models, which may still contain important information, are discarded. Combination models fully consider the information of multiple sub-models to produce more robust predictions, which can avoid the above problems to some extent. So far, the combination models have achieved remarkable results in several research areas, like meteorological forecast (e.g., Lange et al., 2006; Zhang & Hanby, 2007; Xiao et al., 2018), energy (Baumeister & Kilian, 2015), finance (Kim et al., 2006), agriculture (Wen & Guyer, 2012). In recent years, solutions based on combination models often achieve good results in Kaggle competitions (e.g., Taieb & Hyndman, 2014; Hoch, 2015; Bojer & Meldgaard, 2020), which shows that combination models are promising.
After reviewing a large volume of advanced literature, we categorize the combination models into two classes: ensemble models and predictions combination. The ensemble models sample in the input space, such as feature filtering by calculating feature importance (Mendes-Moreira et al., 2012), sampling of training data using cross-validation (LeBlanc & Tibshirani, 1996), and they generate a strong learner by combining multiple weak learners. Practitioners are increasingly studying ensemble models, and ensemble learning is becoming popular as a research field. The typical examples of ensembles models are Bagging (Breiman, 1996), Boosting (Freund et al., 1996), and Stacking (Wolpert, 1992).
Bagging, also known as Bootstrap Aggregation, is an intuitive ensemble algorithm proposed by Breiman (1996)
. Bagging samples different subsets of equal size from the entire training data, then uses these subsets to train homogeneous models simultaneously in parallel. The predictions of these models are combined by majority voting as the final output. Random forest(Ho, 1995)
is an application of Bagging that uses decision trees as sub-models.
Similar to Bagging, Boosting repeatedly samples the data to create several weak learners, which are combined using majority voting. For classification, the difference is that Boosting produces three classifiers per iteration. The first classifier is trained on a randomly selected portion of the training data; the second classifier is trained on the first classifier, and only half of the data selected can be correctly classified by the first classifier; the third classifier uses training data that contradicts the predictions of the first two classifiers. The final three-way majority vote produces the prediction results for this iteration.
Stacking method was proposed by (Wolpert, 1992). It trains a hierarchical model to improve the prediction. The first layer of learners are trained by cross-validation. The outputs are fed into the second layer learners, which identify misclassifications (for classification problems) or errors (for regression problems) of the first layer learners in addition to training existing data. The third layer learners combine the training results from the previous two layers to produce a more robust model.
While predictions combination is also involved in ensemble models, the concept of predictions combination is broader and its components may not all be weak learners. The candidate models of the predictions combination train heterogeneous models for the same input space. The outputs of all models are weighted and averaged to obtain more robust results. For example, neural networks tend to fall into local minima during training, which can be improved if the predictions of multiple neural networks are combined(e.g., Perrone, 1993; Alhamdoosh & Wang, 2014).
In some pioneering studies, predictions combination is referred to as hybrid ensemble, but hybrid ensemble also includes the content of ensemble models. For the sake of distinction, this paper will follow the term of predictions combination. For load prediction, Salgado et al. (2006)
used several support vector machines and neural networks as candidate models, ranked and filtered these models by mean absolute percentage error, and finally weighted the individual models by a single-layer linear neural network. Their hybrid ensemble model improves performance by 25% over the best single predictor.Verma & Hassan (2011)2016)
took five common approaches (neural networks, support vector machines, random forests, decision trees, and naive Bayes) as base classifiers and combined the predictions of each model using consensus approach. The experimental results demonstrate the ability of the proposed method in improving the accuracy of credit scoring prediction.Qi & Tang (2018)
constructed a hybrid ensemble model for predicting slope stability in geology. Gaussian process classification, quadratic discriminant analysis, support vector machine, artificial neural networks, adaptive boosted decision trees, and k-nearest neighbours were utilized as sub-models and genetic algorithm was used to calculate classification weights for each model. The hybrid ensemble model they designed was shown to outperform any single model, even though the single model already had its own optimal combination of parameters.
For predictions combination, Perrone (1993) demonstrates that weighted averaging performs better than the basic simple averaging and can improve accuracy and reduce model covariance by removing results from nearly similar models. Nevertheless, as the study of predictions combination has increased, some problems have arisen. Perrone (1993) also indicates that if a predictions combination contains too many sub-models, it may bring about the opposite effect. LeBlanc & Tibshirani (1996) states that if the least square regression method is used to assign weights to each sub-model, the best-performing model may be assigned a weight of 1, while all the others have a weight of 0. In this case, there is little need to use the predictions combination.
Since then, scholars have identified model diversity as the key to predictions combination success. Brown (2004) states that it is important to examine the reasons for predictions combination success, especially the ability to automatically exploit the strengths and weaknesses of components within the combination. The diversity of components deserves to be explored in depth. Webb & Zheng (2004) demonstrates that increasing the diversity of members within a combination without increasing their testing error inevitably leads to a reduction in the prediction error of the combination. Chandra & Yao (2006b) emphasizes that diversity and accuracy are key to constructing predictions combination, and similar ideas are experimentally validated by (e.g., Alhamdoosh & Wang, 2014; Peng et al., 2020). Krawczyk & Wozniak (2014), on the other hand, suggests to combine both building a diverse pool of models and finding the optimal model combination. Brown et al. (2005)
provides quantitative methods for diversity of predictions combination using ambiguity decomposition and bias–variance–covariance decomposition, which will be described in Section2.
Several methods to increase the diversity of sub-models are proposed. For ensemble models, practitioners often use cross-validation to obtain sub-models, or choose different combinations of parameters for homogeneous models, followed by majority voting or weighted averaging of the model predictions. Cross-validation yet provides limited improvement in model effectiveness, and Stone (1974)
proved as early as 1974 that estimators generated by cross-validation are similarly behaved. For classification,Liu et al. (2000) trained evolutionary ensembles in which the k-means method was applied to cluster the nodes of the neural network. Subsequently representative nodes within each cluster were filtered out and this step ensured the diversity of nodes. Chandra & Yao (2006a) developed the Pairwise Failure Crediting (PFC) method to measure model diversity by calculating the degree to which a model differs from other models. Sirovetnukul et al. (2011) pointed out that predictions combination can learn negative knowledge from less well performing models that are easily ignored and removed in previous studies, and this knowledge can help the models converge to better solutions while also producing diverse results.
Many empirical evidences demonstrate Negative Correlation Learning (NCL) in increasing model diversity and improving the effectiveness of the combination models (e.g., Liu & Yao, 1999; Liu et al., 2000; Chandra & Yao, 2006b; Sirovetnukul et al., 2011; Alhamdoosh & Wang, 2014; Peng et al., 2020)
. NCL introduces a correlation penalty term in the error function of each model within the combination to measure the deviation of the model from the overall. All sub-models can be trained simultaneously and interactively on the same training set, and the final experimental results achieve a bias-variance-covariance balance. The prosperity of NCL in the combination models has certainly proven it to be a successful solution. Current applications of NCL are focused on ensemble models, especially neural networks. NCL is involved in the training process of each model with the intention of diversifying each sub-model. In the framework of a specific ensemble model, the training pattern is the same for each sub-model, although diversity can be obtained.Zhao et al. (2010) and Mendes-Moreira et al. (2012) point out that in addition to homogeneous models, heterogeneous models can also be used as candidate models to obtain diversity in the model selection phase with greater generalization. However, few studies have applied NCL to model selection and predictions combination. Tang et al. (2009) exploited genetic algorithm (GA) to select models and NCL as a penalty term to modify the objective function of the GA, but the GA ran too slowly and tended to fall into local optimum.
To bridge these gaps, a generic predictions combination scheme is designed in this paper to solve the regression problem. 12 well-established regression prediction methods, including ensemble models, generalized linear regression models, etc., are added to the model pool. Each predictor is trained, after which the predictions are generated. Cross-validation and grid search are applied to the training process of each sub-model to fully train the predictor and obtain the optimal parameters. Thereafter, we view the process of model selection and combination as a non-convex optimization problem, which is solved using theGekko optimizer (Beal et al., 2018). The NCL is added as a penalty term to the objective function of the optimization problem. Two weighting methods, error inverse weighting and error exponential weighting (Armstrong, 2001), are used to fine tune the weights of the predictions combination. Our proposed predictions combination scheme achieves excellent results on three publicly available datasets, where the NCL and weighting methods each contribute positively.
The main contributions of this paper are two-fold:
Theoretically, this paper classifies combination models into two categories, ensemble models and predictions combination, based on whether the input space is sampled or not.
Practically, this paper proposes a predictions combination scheme that incorporates NCL. Sub-models with diversity are selected from the model pool by the novel non-convex optimization solver Gekko. The two fine-tuned weighting methods also help to improve the performance of the predictions combination compared to the simple averaging method.
The rest of the paper is organized as follows. Section 2 introduces the theories and methods involved in our proposed framework. Section 3 presents the framework of predictions combination accounting for model diversity. In Section 4, we systematically investigate the application of the proposed method on three publicly available datasets and analyze the contribution of NCL and two fine-tuned weighting methods to model performance improvement, respectively. Section 5 gives some discussions and Section 6 concludes the paper.
2 Related works
In a regression problem, there is dataset containing n samples with the form . The target of the problem is to find a function f that maps x to y based on the samples in dataset:
In machine learning,is also referred to as a model or an estimator. The mean square error (mse) is a general measure of accuracy, which is minimized as the goal of the regression model.
2.1 Ambiguity Decomposition
In the following analysis, we refer to Brown et al. (2005)’s two measures of predictions combination diversity: ambiguity decomposition and bias-variance-covariance decomposition. First, in a general scenario, m sub-models in a pool can form a predictions combination by weighted average. is a convex combination of all components:
where is the j-th prediction of the model pool. Then the overall MSE can be calculated as a weighted average of the of each sub-model, the same weight as that in .
Hence, according to the above derivation, the can be viewed as two components:
Fomular 5 indicates that is less than the weighted average of all sub-models, in the case that the sub-models are not identical and the second ambiguity term is positive. This fact reveals that the larger the difference between each sub-model and the predictions combination, the larger the ambiguity term and the smaller the error of the predictions combination. In particular, without an established criterion to judge the best model in advance, it is more efficient to use the predictions combination directly even if the error of one particular model is smaller than the predictions combination.
2.2 Bias-variance-covariance Decomposition
The mean squared error (mse) of the sub-models and the predictions combination is employed in the ambiguity decomposition to measure diversity, i.e., the larger the mse, the more diverse the combination. However, as the sub-models increase in volume, the more diverse they are, the more likely they seem to deviate from the true value. This situation leads to an increase in the first term of , when it is not such beneficial to consider increasing the diversity of the predictions combination. Thus the proposition of how to balance the diversity and the accuracy of the sub-models becomes of interest. Brown et al. (2005)’s bias-variance-covariance decomposition is a well-defined tradeoff.
For simplicity, given the simple average form of the predictions combination and , the bias-variance-covariance decomposition is given by the following equation:
in which bias, var, and cov are the averaged bias, variance, and covariance of each model in predictions combination separately. The formulas of the three terms are as follows:
Unlike the case of ambiguity decomposition where the accuracy of each sub-model needs to be considered, the bias-variance-covariance decomposition can theoretically reduce the error of the predictions combination by decreasing the covariance without increasing the bias and variance. In addition, the covariance term can be negative, implying that negative correlations between sub-models can contribute to the predictions combination.
2.3 Negative Correlation Learning
The ambiguity decomposition and the bias-variance-covariance decomposition present two ways to quantify the diversity of the predictions combination, while the bias-variance-covariance decomposition revels the negative correlation among sub-models in reducing the predictions combination error. The above two decompositions do not explain how to achieve diversity of sub-models, which is solved by negative correlation learning (NCL) proposed by Liu & Yao (1999).
NCL was originally designed as a method for neural network ensembles. It adds a penalty term to the objective function of each individual network and trains all the networks simultaneously and interactively before combining them. The purpose of this training pattern is not to obtain multiple accurate and independent neural networks, but to capture the correlations and derive sub-networks with negative correlations using penalty terms, which in turn form a robust combination. It is still given that there are m sub-networks in the neural network ensemble and n samples in the dataset. For each network, its objective function during training is of the form MSE, and NCL adds a penalty term to it. The form of the network’s objective function is given by the following equation:
where, is the negative correlation factor with a value between 0 and 1, and the second term is the penalty term. Formula 10 measures both the difference between each sub-model and the true value , and the difference between each sub-model and the predictions combination, where lambda controls the strength of the negative correlation. When equals 0, the objective function is equivalent to MSE; when equals 1, the negative correlation strength of the objective function is maximized. In summary, NCL presents a new perspective on obtaining diversity in neural network ensembles training, due to the fact that each sub-model can interact under the control of the penalty term.
3 Predictions Combination of Regression with Negative Correlation Learning
As one of the most fundamental mathematical problems, regression has a considerable number of well-established models designed from different perspectives. Predictions combination is a way to solve regression problems by weighted average the predictions of multiple models. This approach produces better results than any of the sub-models under certain circumstances and improves the accuracy by error hedging since the predictions combination fully takes the information from multiple models into account. However, the predictions combination highly demands for model diversity, because if all candidates are the same, the predictions combination will no longer work. In this paper, by introducing negative correlation learning in the predictions combination of regression, those sub-models with diversity are selected from the model pool, and then combined and assigned weights to finally achieve the purpose of improving the prediction accuracy.
Specifically, the research in this paper contains these aspects: model pool construction, model training methods, optimization problem design for predictions combination, weight fine-tuning and evaluation. These contents will be introduced in this section.
3.1 Model Pool Construction
Many ensemble models adopt cross-validation to train homogeneous models and perform majority voting to select the models that work well. In contrast, this paper draws on the conclusion of Mendes-Moreira et al. (2012) that heterogeneous models control for diversity and perform better in the model selection phase, due to the possibility of using homogeneous models as candidates. In constructing the model pool, 12 common regression prediction models are selected in this paper, which includes the ensemble model, like Random Forest, AdaBoost, and GBDT. Still following the notation of Section 2, the linear regression model views as a linear combination of inputs , in which there are l features for each sample.
Or write as the matrix form of linear regression: , in which is the input samples, is the regression coefficients, is the predictions, and is the constant term. Following these notations, we present 12 regression models as shown in Tabel 5 in Appendix A. They are Simple Linear Regression (SLR) (Zou et al., 2003)
, Ridge Regression (RR)(Hoerl & Kennard, 1970)
, Lasso Regression (LR)(Santosa & Symes, 1986), Bayesian Regression (BR) (Box & Tiao, 2011)
, Stochastic Gradient Descent Regression (SGDR)(Bottou, 2010), Polynomial Regression (PR) (Stigler, 1974), Random Forest Regression (RFR) (Ho, 1995), Adaptive Boosting Regression (ABR) (Solomatine & Shrestha, 2004)
, Gradient Boosting Decision Tree (GBDT)(Friedman, 2001), Support Vector Regression (SVR) (Drucker et al., 1997), Decision Tree Regression (DTR) (Wu et al., 2008)
, Multilayer Perceptron Regression (MLP)(Rosenblatt, 1961).
3.2 Model Training Methods
In this paper, the 12 models in the model pool are trained individually. To achieve optimal performance of each model, grid search (Chicco, 2017) and cross-validation (Geisser, 1975) are used. Generally when training a machine learning model, one would manually set different combinations of parameters to seek good performance. However, this practice is time-consuming and laborious, and it is difficult to find the best combination of parameters manually. Grid search automates all this work. Before applying grid search, the range of values of different parameters in the model is artificially defined and discretized. Subsequently, in combination with cross-validation, grid search search and evaluate performance in the parameter space defined in the previous step. Grid search automates the previous process of parameter selection, saves time and obtains the optimal combination of parameters in a given parameter space based on the training set.
During model training, the fitted results tend to perform better for the training set and worse for the test set that is not involved in the training. To reduce this overfitting, cross-validation was developed. The training set is divided into several equal parts, and one part at a time is left as the validation set, while the rest are used as the training set to train the model. Until each part of the data has been the validation set, all the trained models are averaged to form a new model, which has better fitting ability to new data. Figure 1 shows the 5-fold cross-validation approach.
3.3 Optimization Problem for Predictions Combination
In this paper, we consider designing an optimization problem to implement the predictions combination in which candidate sub-models are automatically selected and assigned weights. For a dataset of n samples, the m sub-models in the pool are trained separately and produce m sets of predictions, which form a matrix . Our aim is to obtain a weight vector for these predictions, and the predictions combination is generated by the inner product of and : .
Each sub-model corresponds to an optimization objective function , which can be calculated by Equation 10. The optimized objective values of all sub-models form a vector , so that the overall objective function is
Each term in the objective function contains , which varies with the weight vector at each iteration. This situation leads to the fact that this is a non-convex optimization problem. To solve this problem, we use the Gekko optimizer designed by Beal et al. (2018). As an algebraic modeling language, Gekko excels in solving dynamic optimization problems. In addition, Gekko is also a Python library that integrates model building, analysis tools and optimization visualization. Gekko provides a number of user-friendly multiple built-in solvers that allow efficient interaction between them and optimization models. The gradient-based interior point optimizer IPOPT is the best known of these, and is used as the default optimizer by Gekko. For more information about IPOPT, please refer to Wächter & Biegler (2006).
3.4 Weight Fine-tuning and Evaluation
Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) are three common indexes used to evaluate the accuracy of regression models. The formulas of RMSE, MAE, and MAPE are as follows:
Gekko has generated the weights for each candidate model when solving the optimization problem (Equation 12). For better performance of the predictions combination on the three evaluation indicators, we have given several weight fine-tuning schemes, drawing on the error inverse weights and error exponential weights designed by Armstrong (2001), which will be verified for validity and selection in subsequent experiments.
Let be the predictive error of the sub-model, which can be one of RMSE, MAE, and MAPE. The error inverse weights (EIW) and error exponential weights (EEW) are written separately in the following form:
and are weight for the sub-model respectively in Equation 16 and Equation 17. The intuitive goal of these two types of weight fine-tuning schemes is to assign higher weights to those models that predict well. Specifically, given the weight for the of m models generated by Gekko, the weights after fine-tuned by and become:
3.5 Framework of the Predictions Combination
Figure 2 demonstrates our proposed predictions combination framework incorporating NCL by an example. First a model pool consisting of common regression models is constructed. Each model generates a set of predictions for the test set, where a grid search is used to find the optimal combination of parameters from the parameter space and cross-validation is used to improve the robustness of the model. After these steps, each single model achieves the best results within its capability. Subsequently, a fused NCL objective function was designed for model selection and weighting to find those sub-models whose predictions have negative correlations, thus enhancing the diversity within the predictions combination. The weights of each model are automatically updated in the process of solving the objective function using the Gekko optimizer. After the candidate sub-models and their corresponding weights are obtained, both error inverse weights and error exponential weights are applied to further fine-tune the weights in order to assign higher weights to models with good predictions while reducing the weights of models with less good predictions. Finally, the weight-adjusted sub-models are combined to gain a new set of predictions. In the following experiments, we will explore the performance of this predictions combination.
The experiments in this section are designed to validate the optimization problem incorporating the NCL method for improving the accuracy of prediction combinations. The following subsections expand on the datasets, experimental design, and analysis of the results.
4.1 Datasets and Preprocessing
Three public datasets from Kaggle were chosen for this paper: CarPrice111https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho?select=car+data.csv, Life-Expectancy222https://www.kaggle.com/kumarajarshi/life-expectancy-who, and Walmart333https://www.kaggle.com/vik2012kvs/walmart-dataretail-analysis. Kaggle444https://www.kaggle.com/ is an active online community of data scientists and machine learners that often publish machine learning competitions, provide public datasets, and more. CarPrice is used cars transaction data, contains vehicle and driving information. The dependent variable is car price. Life-Expectancy is a dataset of population life expectancy and related factors for each country from 2000-2015, including development status of countries, adult and infant mortality, alcohol consumption, and more, according to the World Health Organization. Walmart, Walmart’s sales data set, includes external factors such as temperature, fuel, CPI and unemployment in addition to stores and dates. All three data can be used for price regressions, and their basic information is shown in the Table 1.
|Dataset||# Samples||# Features||Max of Y||Min of Y||Mean of Y||Median of Y||Std of Y|
For the three datasets, we performed One-Hot coding of the nominal variables in the features, and represented the ordinal variables with consecutive numerical ranks. In this paper, we assume that the distribution of training set samples is representative of the overall distribution, so for the training set, data standardazation was implemented on variables other than nominal variables. The reason for not using data normalization is to avoid the effect of extreme data on the whole. The standardized data obeyed a normal distribution with a mean of 1 and a standard deviation of 0. For the validation set and test set, the data were standardized using the initial mean and standard deviation of the training set. The Equation of data standardazation is as follows:
where is the original data, is the standradized data, with the mean and standard deviation std of n samples.
When dividing the data, we randomly selected 80% from all samples as the training set for training all models in the pool separately. 16% of the data was used as the validation set for verifying our proposed NCL-based model selection and weighting optimization method and training to derive the weights. The remaining 4% of the data naturally becomes the test set to test the performance of the predictions combination as brand new data.
4.2 Weight-based Predictions Combination
4.2.1 The Basic Predictions Combination: Simple Average
At the very beginning, each of the three datasets was trained by 12 models from the model pool. Parameter grid search and cross-validation during this period ensures to some extent that each model exploits its capabilities as much as possible. The optimal combination of parameters for each model selected by the grid search has been added to the Table 6 in Appendix B. RMSE, MAE, and MAPE, as three prediction indexes, are employed to evaluate the models. To compare, the simple average predictions of the candidate models are added in Figure 3, Figure 4, and Figure 5, in which the sum of indexes are ranked in increasing order.
As shown in Figure 3, Figure 4, and Figure 5, the best models for the three datasets are GBDT, MLR, and GBDT respectively. A single model has a strong uncertainty, due to factors such as the size and features of the dataset itself. For example, PR performs the worst in the first two datasets, comparing with other models, but it ranks third in prediction in the third dataset; MLP has the best prediction in the second dataset, yet it is the second worst model in the first dataset. This fact again verifies that there is no perfect model for different datasets, although it already reach the best for its own parameters combination.
To reduce the effect of model uncertainty, a method that simply averages over all model predictions is often used in predictions combination. This naive approach is sometimes effective, i.e., it can achieve results comparable to the optimal model in the model pool. When working with a new data set, it is undoubtedly time-consuming to determine the predictive effect of each model individually, and a direct simple average is an effective approach in this case. In our experiments, the simple average ranked fourth, eighth and tenth in performance on the three datasets, respectively. This unsatisfactory performance necessitates continued exploration of ways to optimize the predictions combination to achieve higher accuracy.
4.2.2 Weight Fine-tuning for All Model
The first attempt to optimize the predictions combination is to adjust the weights of each sub-model. An intuitive idea to fine-tune the weights is to assign high weights to those models that perform well and low weights to those perform poorly, and to limit the sum of the weights of all models to 1. The error inverse weights (EIW) and error index weights (EEW) from Section 3 are used to fine-tune the weights. The three error indexes for the validation set, RMSE, MAE and MAPE are treated as Error term in EIW and EEW.
The experimental results in Table 2 demonstrate some interesting phenomena. First, as shown in Section 4.2.1, the simple average performs worse than the best sub-model, which may be due to some sub-models with excessive errors. After performing weight fine-tuning, the weights of these error-excessive sub-models are reduced accordingly, bringing different gains in the three prediction indexes. Specifically, in the two types of weight fine-tuning schemes EIW and EEW, and fine-tuning can reduce the RMSE and MAE errors of the dataset more obviously than fine-tuning. This pattern was more evident for EIW, i.e., , , and achieved the best results on the three datasets when only RMSE and MAE indexes were considered, respectively. In addition, the fine-tuning performs excellently in reducing MAPE errors. However and should be used with caution in the case of weighted averaging of all sub-models, as they do not yield high gains in RMSE and MAE errors compared to fine-tuning and fine-tuning, and may even perform worse than simple average, e.g., on the CarPrice dataset.
4.3 NCL-based Predictions Combination
In practice, not all sub-models in the model pool may be necessary to construct the predictions combination. Those sub-models with negative correlations should be selected to increase the internal diversity of the predictions combination and thus improve the prediction accuracy. In this section, the possibility of implementing NCL-based predictions combination will be explored.
4.3.1 Effect of NCL Penalty Intensity
For each sub-model, the second term in its objective function to be optimized (Equation 10) is the NCL penalty term, where is a negative correlation factor used to measure the penalty intensity. equals 0, the objective function is just an ordinary MSE without any penalty; equals 1, the objective function has the maximum penalty intensity. In Figure 6, Figure 7, and Figure 8 we list the varying of the three accuracy indexes with in each dataset. The four lines in each subplot represent the best model suitable for the dataset, the simple average, the optimal weighting fine-tuning scheme containing all models, and the result after NCL selection and weighting.
In CarPrice, the NCL is not superior when takes a small value (). When increases, the NCL line in Figure 6 decreases with a particularly fast trend. In terms of RMSE error, NCL consistently outperforms the best model; for MAE error, NCL surpasses the best model after is greater than 0.2; as for MAPE, when increases to 0.5, the NCL line starts to break through the baseline of the best model. The accuracy of NCL is highest when , after which the accuracy of NCL decreases when increases again.
In Life-Expectancy, the line of NCL hits the RMSE baseline of the best model when is greater than 0.02 and maintains the outperformance over the best model thereafter. In the MAE error and MAPE error terms, NCL achieves a particularly close approximation to the best model when is equal to 0.05 and 0.07, respectively.
In Walmart, the line NCL remains fairly stable in RMSE and MAE, almost in line with the best model, and slightly outperforms the best model on some . However, in the MAPE error, the NCL keeps an increasing trend as increases, which indicates that increasing has little benefit in improving the MAPE accuracy.
In summary, this section illustrates the importance of in NCL-based predictions combination. NCL is evidently far superior to both simple average and weighted average over all models, and it can also easily demonstrate predictions that exceed or remain close to the best model by taking appropriate values for . The effect of NCL in improving model accuracy is less obvious when is small. As the increases, NCL begins to show its superiority in the predictions combination. However, if the is too large, this superiority may become a burden that does not bring more gains in terms of error reduction.
4.3.2 Model Subset Weighting and Negative Correlation
After selecting appropriate values and using Gekko to optimize the objective function of the predictions combination, we obtained different schemes suitable for different datasets. The predictions combination contains the results of selecting and weighting the models, which are automatically generated by the optimizer. For the three datasets, after selecting the that makes NCL perform best on all prediction indexes, the results are shown in the table below. A vector of length 12 is used to represent the results of model selection and weighting, and the position of each element of the vector represents a sub-model, with the order of the sub-models shown in Section 3.1. If the element is equal to 0, it means that the sub-model corresponding to that position is not selected; the larger the value of the element, the higher the importance of the submodel corresponding to that position in the predictions combination.
|Dataset||Results of Model Selection and Weighting|
In three datasets, when , the optimization result of the objective function always favors the selection of one model. When , CarPrice selects GBDT, SVR and DTR as the model subset for the predictions combination; when , Life-Expextancy selects BR, GBDT and MLP; when , Walmart selects the model subset of GBDT, DTR and MLP. These selection results do not correspond to the top models with the highest prediction accuracy ranking given in Section 4.2.1, demonstrating that NCL is not just a simple method of sub-models ranking and selection.
The following heat maps are taken to show that the subsets of models selected by our proposed method do have negative correlation. First, for the test set of each dataset, the predictions combination is generated according to the model selection and weighting results in Table 3. The predictions of each sub-model are subtracted from to measure the degree of model bias, i.e., . Next we derive the Pearson correlation coefficients for all as well as , which are presented by heat maps.
In Figure 9, Figure 10, and Figure 11, the darker the color block indicates a stronger positive correlation for the element at the corresponding position; the lighter represents a stronger negative correlation. The right part of each Figure shows that the selected subset of models is the most negatively correlated combination of sub-models in the model pool. This part of the experimental results verifies that the model selection and weighting method incorporating NCL can indeed select the subset of models with negative correlation.
4.3.3 Weight Fine-tuning for Predictions Combination with NCL
The predictions combination with NCL performs excellently on all three datasets. In this section we explore whether there are further possibilities for improving the accuracy of this predictions combination. We fine-tune the weights generated by the Gekko optimizer according to the six schemes shown in Section 4.2.2, and the results are presented in the Table 4.
For the first two datasets, fine-tuning the weights brings greater benefits compared to our proposed method. The EIW-based fine-tuning scheme works better than the EEW-based one, especially on both RMSE and MAE indexes. For Walmart, although the proposed method outperforms the best sub-model in RMSE and MAE, any of the weight fine-tuning methods can no longer improve the RMSE and MAE accuracy of the predictions combination. However, for the MAPE indicator, some gains can still be obtained if is used to fine-tune the weights.
Ensemble models and predictions combination, two classes of combination models, have made great progress in both research and practice. However, the predictions combination still faces the problems of choosing the appropriate model subset and assigning weights to the sub-models. Simply averaging the predictions of all sub-models does not achieve the expected results, even the corrections using the weighted averaging method are limited. This study proposes a novel method for predictions combination that automatically selects models and generates appropriate weights, yielding comparable performance with the optimal sub-models.
Diversity is essential for the success of a predictions combination. Distinguishing from previous studies that increase model diversity when training homogeneous models, this study explores selecting a subset of models with negative correlations in a heterogeneous pool of models. The proposed approach uses a penalty term for negative correlation learning to control the optimization of the objective function during model selection. A reasonable penalty strength helps to select a diverse subset of models. Weight fine-tuning then further optimize the weights of the predictions combination, assisting it in embracing diversity and accuracy.
The empirical results of this paper support that considering diversity in the model selection process is an effective move. The predictions combination incorporating NCL is far more accurate than simple averaging and some intuitive weighted averaging methods on RMSE, MAE and MAPE, and it can reach comparable levels with or even exceed the best sub-models in terms of selecting the appropriate value of the NCL factor . Another advantage of this study is that the prediction results do not depend on a particular class of models, due to the fact that any advanced model, such as GBDT, can be added to the model pool. In addition, compared with the genetic algorithm, this paper adopts the Gekko optimizer in the model selection process with higher computing efficiency, and the weights of each sub-model can be given automatically during this period.
A limitation of this paper is that the 12 models in the model pool do not cover more established models in the regression field, which also provides researchers with the subsequent freedom to replace candidate models. This paper also lacks a more in-depth exploration of the effect of the predictions combination in relation to the type and number of sub-models in the model pool.
We developed a predictions combination approach incorporating model diversity. Negative correlation learning acts as a penalty term for the objective function to be optimized, assisting in the model selection process to find those subsets with diversity. Experiments on three publicly available regression datasets confirm the effectiveness of this approach.
First of all, this proposed method is user-friendly. Its framework is easy enough to understand, and practitioners no longer need to evaluate the effectiveness of individual models by various accuracy indexes to select the best one, nor do they need to blindly weight the candidate models, since the predictions combination with the addition of NCL can fully demonstrate prediction accuracy that approximates or exceeds that of the best sub-model with appropriate penalty strength. In addition, the predictions from any model can be added to the model pool as an element for calculation. Even if that model does not work well, this method will discard it automatically. Therefore, our proposed method has practical implications.
Appendix A Appendix A
|SLR||SLR seeks to minimize the sum of squares of the residuals of the data and the predictions when fitting the model, and it assumes that the better the fit, the closer the data points and predictions will be on the graph.||where is the Euclidean norm.|
|RR||RR adds a regular term to the objective function to find a better solution from the solution space compared to SLR.||, where is the parameter of regular term.|
|LR||Lasso is short for least absolute shrinkage and selection operator. LR can be applied for variable selection and regularization to achieve better results with fewer variables.||, where is the parameter of regular term and is the l1-norm.|
BR utilizes the principle of Bayesian inference, assuming that the error terms of the regression model obey a normal distribution. The prior distribution of the data has some specific form and the posterior probabilities of the model parameters can be computed. BR assumes thaty
is a Gaussian distribution of.
|where is an latent variable that can be obtained in the process of model inference.|
|SGDR||SGDR does not correspond to a certain machine learning model, but is a training method and optimization technique.||, where L can be chosen as different model, like SLR; is the regular term, like l1-norm or l2-norm. can be updated by , where is learning rate and is the intercept distance.|
|PR||PR incorporates a polynomial combination of features when training a linear model, and this approach is suitable for large-scale data situations. The fit model is .||where is the expanded features and is regression coefficients.|
|RFR||RFR improves model performance by ensembling multiple decision trees. A single decision tree tends to grow too deep in training and produces large variance. In RFR, multiple decision trees are trained separately for different subsets of the training set, and the training results are averaged so that the variance is reduced.||, where is the average prediction of the observation.|
|ABR||ABR trains a series of weak learners and weights their outputs to get more accurate results. The weights of each weak learner are initialized to a simple average, and during the training process, the model assigns low weights to the good performers and high weights to the poor performers. Each time the model is trained to focus more on those samples that are trained incorrectly.||, where is the objective function of the weak learner, whose weight is .|
|GBDT||GBDT ensembles a series of fixed-size decision trees, each of which can be improved after training. It uses a model combination method , where and are the prediction and weight of the sub-model. GBDT uses a forward calculation method
, where the loss functionL is related to the current model and the fit results of training samples on model .
|SVR||SVR only considers part of the training data, since data points beyond the decision boundary are ignored when constructing the loss function.||
|DTR||DTR builds a tree structure, maps features to branches and target values to leaf nodes. Given the data of the node of the decision tree as Q, the node produces a candidate set containing features and threshold. If the data is less than the threshold, it goes down the left branch and vice versa along the right branch. The data in the left child node is denoted as and that in the right child node is denoted as . The impurity function is chosen to calculate the impurity of the data Q at node m. The model is improved by optimizing the impurity function at each training session.||, where , , and are the size of the data of left child node, right child node, and the node m. For regression, the impurity function can be chosen as mean square error and mean absolute error.|
MLP employs a multilayer neural network structure to map multi-dimensional data into one dimension. MLP consists of an input layer, a hidden layer and an output layer. Each neuron in the hidden layer is equivalent to a regressor for weighted summation of the data in the previous layer, and the activation function is also added. The MLP updates the parameters using stochastic gradient descent during backpropagation.
|, where is the regression coefficient and is the penalty intensity factor.|
Appendix B Appendix B
|SLR||n_jobs: 1||n_jobs: 1||n_jobs: 1|
|PR||pf555pf is short for polynomialfeatures._degree: 3,
- Ala’raj & Abbod (2016) Ala’raj, M., & Abbod, M. F. (2016). A new hybrid ensemble credit scoring model based on classifiers consensus system approach. Expert Systems with Applications, 64, 36–55.
- Alhamdoosh & Wang (2014) Alhamdoosh, M., & Wang, D. (2014). Fast decorrelated neural network ensembles with random weights. Information Sciences, 264, 104–117.
- Armstrong (2001) Armstrong, J. S. (2001). Principles of forecasting: a handbook for researchers and practitioners volume 30. Springer Science & Business Media.
- Baumeister & Kilian (2015) Baumeister, C., & Kilian, L. (2015). Forecasting the real price of oil in a changing world: a forecast combination approach. Journal of Business & Economic Statistics, 33, 338–351.
- Beal et al. (2018) Beal, L. D., Hill, D. C., Martin, R. A., & Hedengren, J. D. (2018). Gekko optimization suite. Processes, 6, 106.
- Bojer & Meldgaard (2020) Bojer, C. S., & Meldgaard, J. P. (2020). Kaggle forecasting competitions: An overlooked learning opportunity. International Journal of Forecasting, .
- Bottou (2010) Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 (pp. 177–186). Springer.
- Box & Tiao (2011) Box, G. E., & Tiao, G. C. (2011). Bayesian inference in statistical analysis volume 40. John Wiley & Sons.
- Breiman (1996) Breiman, L. (1996). Bagging predictors. Machine learning, 24, 123–140.
Brown, G. (2004).
Diversity in neural network ensembles.
Ph.D. thesis Citeseer.
- Brown et al. (2005) Brown, G., Wyatt, J., Harris, R., & Yao, X. (2005). Diversity creation methods: a survey and categorisation. Information Fusion, 6, 5–20.
Chandra & Yao (2006a)
Chandra, A., & Yao, X.
Ensemble learning using multi-objective evolutionary algorithms.Journal of Mathematical Modelling and Algorithms, 5, 417–445.
- Chandra & Yao (2006b) Chandra, A., & Yao, X. (2006b). Evolving hybrid ensembles of learning machines for better generalisation. Neurocomputing, 69, 686–700.
- Chicco (2017) Chicco, D. (2017). Ten quick tips for machine learning in computational biology. BioData mining, 10, 1–17.
- Dietterich (2000) Dietterich, T. G. (2000). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1–15). Springer.
- Ding et al. (2018) Ding, J., Tarokh, V., & Yang, Y. (2018). Model selection techniques: An overview. IEEE Signal Processing Magazine, 35, 16–34.
- Drucker et al. (1997) Drucker, H., Burges, C. J., Kaufman, L., Smola, A., Vapnik, V. et al. (1997). Support vector regression machines. Advances in neural information processing systems, 9, 155–161.
- Freund et al. (1996) Freund, Y., Schapire, R. E. et al. (1996). Experiments with a new boosting algorithm. In icml (pp. 148–156). Citeseer volume 96.
- Friedman (2001) Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, (pp. 1189–1232).
- Geisser (1975) Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American statistical Association, 70, 320–328.
- Ho (1995) Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition (pp. 278–282). IEEE volume 1.
- Hoch (2015) Hoch, T. (2015). An ensemble learning approach for the kaggle taxi travel time prediction challenge. In DC@ PKDD/ECML.
- Hoerl & Kennard (1970) Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.
- Kim et al. (2006) Kim, M.-J., Min, S.-H., & Han, I. (2006). An evolutionary approach to the combination of multiple classifiers to predict a stock price index. Expert Systems with Applications, 31, 241–247.
- Krawczyk & Wozniak (2014) Krawczyk, B., & Wozniak, M. (2014). Experiments on simultaneous combination rule training and ensemble pruning algorithm. In 2014 IEEE Symposium on Computational Intelligence in Ensemble Learning (CIEL) (pp. 1–6). IEEE.
- Lange et al. (2006) Lange, M., Focken, U., Meyer, R., Denhardt, M., Ernst, B., & Berster, F. (2006). Optimal combination of different numerical weather models for improved wind power predictions. In 6th International Workshop on Large-Scale Integration of Wind Power and Transmission Networks for Offshore Wind Farms, Delft.
- LeBlanc & Tibshirani (1996) LeBlanc, M., & Tibshirani, R. (1996). Combining estimates in regression and classification. Journal of the American Statistical Association, 91, 1641–1650.
- Liu & Yao (1999) Liu, Y., & Yao, X. (1999). Ensemble learning via negative correlation. Neural networks, 12, 1399–1404.
Liu et al. (2000)
Liu, Y., Yao, X., &
Higuchi, T. (2000).
Evolutionary ensembles with negative correlation
IEEE Transactions on Evolutionary Computation, 4, 380–387.
- Mendes-Moreira et al. (2012) Mendes-Moreira, J., Soares, C., Jorge, A. M., & Sousa, J. F. D. (2012). Ensemble approaches for regression: A survey. Acm computing surveys (csur), 45, 1–40.
- Peng et al. (2020) Peng, T., Zhang, C., Zhou, J., & Nazir, M. S. (2020). Negative correlation learning-based relm ensemble model integrated with ovmd for multi-step ahead wind speed forecasting. Renewable Energy, .
- Perrone (1993) Perrone, M. P. (1993). Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization. Ph.D. thesis Citeseer.
- Qi & Tang (2018) Qi, C., & Tang, X. (2018). A hybrid ensemble method for improved prediction of slope stability. International Journal for Numerical and Analytical Methods in Geomechanics, 42, 1823–1839.
- Rosenblatt (1961) Rosenblatt, F. (1961). Principles of neurodynamics. perceptrons and the theory of brain mechanisms. Technical Report Cornell Aeronautical Lab Inc Buffalo NY.
- Salgado et al. (2006) Salgado, R. M., Pereira, J. J., Ohishi, T., Ballini, R., Lima, C., & Von Zuben, F. J. (2006). A hybrid ensemble model applied to the short-term load forecasting problem. In The 2006 IEEE International Joint Conference on Neural Network Proceedings (pp. 2627–2634). IEEE.
- Santosa & Symes (1986) Santosa, F., & Symes, W. W. (1986). Linear inversion of band-limited reflection seismograms. SIAM Journal on Scientific and Statistical Computing, 7, 1307–1330.
Sirovetnukul et al. (2011)
Sirovetnukul, R., Chutima, P.,
Wattanapornprom, W., & Chongstitvatana,
The effectiveness of hybrid negative correlation learning in evolutionary algorithm for combinatorial optimization problems.In 2011 IEEE International Conference on Industrial Engineering and Engineering Management (pp. 476–481). IEEE.
- Solomatine & Shrestha (2004) Solomatine, D. P., & Shrestha, D. L. (2004). Adaboost. rt: a boosting algorithm for regression problems. In 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541) (pp. 1163–1168). IEEE volume 2.
- Stigler (1974) Stigler, S. M. (1974). Gergonne’s 1815 paper on the design and analysis of polynomial regression experiments. Historia Mathematica, 1, 431–439.
- Stone (1974) Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36, 111–133.
- Taieb & Hyndman (2014) Taieb, S. B., & Hyndman, R. J. (2014). A gradient boosting approach to the kaggle load forecasting competition. International journal of forecasting, 30, 382–394.
- Tang et al. (2009) Tang, K., Lin, M., Minku, F. L., & Yao, X. (2009). Selective negative correlation learning approach to incremental learning. Neurocomputing, 72, 2796–2805.
- Verma & Hassan (2011) Verma, B., & Hassan, S. Z. (2011). Hybrid ensemble approach for classification. Applied Intelligence, 34, 258–278.
- Wächter & Biegler (2006) Wächter, A., & Biegler, L. T. (2006). On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical programming, 106, 25–57.
- Webb & Zheng (2004) Webb, G. I., & Zheng, Z. (2004). Multistrategy ensemble learning: Reducing error by combining ensemble learning techniques. IEEE Transactions on Knowledge and Data Engineering, 16, 980–991.
- Wen & Guyer (2012) Wen, C., & Guyer, D. (2012). Image-based orchard insect automated identification and classification method. Computers and Electronics in Agriculture, 89, 110–115.
- Wolpert (1992) Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5, 241–259.
- Wu et al. (2008) Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Philip, S. Y. et al. (2008). Top 10 algorithms in data mining. Knowledge and information systems, 14, 1–37.
- Xiao et al. (2018) Xiao, L., Dong, Y., & Dong, Y. (2018). An improved combination approach based on adaboost algorithm for wind speed time series forecasting. Energy Conversion and Management, 160, 273–288.
- Zhang & Hanby (2007) Zhang, Y., & Hanby, V. I. (2007). Short-term prediction of weather parameters using online weather forecasts. In Building simulation. volume 2007.
- Zhao et al. (2010) Zhao, Q. L., Jiang, Y. H., & Xu, M. (2010). Incremental learning by heterogeneous bagging ensemble. In International Conference on Advanced Data Mining and Applications (pp. 1–12). Springer.
- Zou et al. (2003) Zou, K. H., Tuncali, K., & Silverman, S. G. (2003). Correlation and simple linear regression. Radiology, 227, 617–628.