1 Introduction
Loss functions are fundamental components of machine learning systems and are used to train the parameters of the learner model. Since standard training methods aim to determine the parameters that minimize the average value of the loss given an annotated training set, loss functions are crucial for successful trainings xiao2017ramp ; zhao2010convex
. Bayesian estimators are obtained by minimizing the expected loss function. Different loss functions lead to different Optimum Bayes with possibly different characteristics. Thus, in each environment the choice of the underlying loss function is important, as it will impact the performance
uhlich2012bayesian ; wang2003multiscale .Letting denote the estimated parameter of a correct parameter , the loss function is a positive function which assigns a loss value to each estimation, indicating how inaccurate the estimation is steinwart2008support . Loss functions assign a value to each sample, indicating how much that sample contributes to solving the optimization problem. Each loss function comes with its own advantages and disadvantages. In order to put our results in context, we start by reviewing three popular loss functions (, Ramp and Sigmoid) and we will give an overview of their advantages and disadvantages.
Loss functions assign a value to each sample representing how much that sample contributes to solving the optimization problem. If an outlier is given a very large value by the loss function, it might dramatically affect the decision function
hajiabadi2017extending . The  loss function is known as a robust loss because it assigns value 1 to all misclassified samples — including outliers — and thus an outlier does not influence the decision function, leading to a robust learner. On the other hand, the  loss penalizes all misclassified samples equally with value , and since it does not enhance the margin, it cannot be an appropriate choice for applications with margin importance xiao2017ramp .The Ramp loss function, as another type of loss functions, is defined similarly to the  loss function with the only difference that ramp loss functions also penalize some correct samples, those with small margins. This minor difference makes the Ramp loss function appropriate for applications with margin importance tang2018ramp ; xiao2017ramp . On the downside, the Ramp loss function is not differentiable, and hence not suitable for optimization purposes.
The Sigmoid loss function is almost the same as  loss functions, except that it is differentiable, which in turn makes optimization significantly easier. However, it assigns a (very small) nonzero value to correct samples, meaning that those samples would also contribute to solving the optimization problem. Hence, in spite of easier optimization, the Sigmoid loss function leads to a less sparse optimization problem in comparison with the  loss function vapnik1998statistical .
There are many other examples of different loss functions showing that while a loss function might be good for certain applications, it might be unsuitable for many others. Inspired by ensemble methods, in this paper we propose the use of an ensemble of loss functions during the training stage. The ensemble technique is one of the most influential learning approaches. Theoretically, it can boost weak learners, whose accuracies are slightly better than random guesses, into arbitrarily accurate strong learners napoles2017rough . This method would be effective when it is difficult to design a powerful learning algorithm directly bai2014bayesian ; zhang2016bayesian ; mannor2001weak . As a metalearning framework, it can be applied to almost all machine learning algorithms to improve their prediction accuracies.
Our goal in this paper is to propose a new ensemble loss function, which we later apply to a simple regressor. HalfQuadratic (HQ) minimization, which is a fast alternating direction method, is used to learn regressor’s parameters. In each iteration, the HQ tries to approximate the convex or nonconvex cost function with a convex one and pursue optimization geman1992constrained . Our main contributions are as follows.

Inspired by ensembleinduced methods, we propose an ensemble loss whose properties are inherited from its base loss functions. Moreover, we show that each loss is a special case of our proposed loss function.

We develop both online and offline learning frameworks to find the weights associated with each loss function, and so to build an ensemble loss function. For a particular class of base losses, we prove that the resulting ensemble loss function is Bayes consistent and robust.
This paper is structured as follows. We review some existing loss functions and several promising ensemble regressors in Section 2 which contains two subsections for each. We briefly explain about HalfQuadratic (HQ) programming in Section 3. Our proposed framework is discussed in Section 4, for which we provide implementation and test results in Section 5. Finally, we conclude in Section 6 with a list of problems for future work.
2 A Review of Loss Functions
This work draws on two broad areas of research: loss functions, and ensemblebased regression methods. In this section, these two areas are fully covered.
2.1 A Review of Loss Functions
In machine learning, loss functions are divided in two categories, marginbased and distancebased steinwart2008support . Marginbased loss functions are used for classification feng2016robust ; zhang2001text ; khan2013semi ; bartlett2006convexity , while Distancebased loss functions are generally used for regression. In this paper, we only focus on distancebased loss functions.
DistanceBased Loss Functions
Let denote an input, the corresponding true label and the estimated label respectively. A distancebased loss function, is a penalty function where is called distance sangari2016convergence ; chen2017kernel . The risk associated with the loss function described as
where
is the joint probability distribution over
and . The ultimate aim of a learning algorithm is to find a function among a fixed class of function for which the risk is minimal steinwart2008support ; painsky2016isotonic ,Generally cannot be computed because the distribution is unknown. However, an approximation of which is called empirical risk can be computed by averaging the loss function on the training set containing samples holland2016minimum ,
A loss function should be such that we arrive at Bayes decision function after minimizing the associated risk under given loss function. The loss is said to be Bayes consistent if by increasing the samples size, the resulting function converges to the Bayes decision function zhang2004statistical ; buja2005loss ; friedman2000additive . The Bayes decision function is fully explained in the cited papers.
Algorithm  

Square lopez2018robust  
Hinge bartlett2008classification  
Exp.  
Logistic fan2008liblinear 
Table 1 and Fig.1 illustrate known examples of Bayes consistent loss functions masnadi2010design ; masnadi2009design . Fig.1 shows that all of these functions are convex and unbounded. As we mentioned in the previous section, loss functions assign a value to each sample which indicates how much that sample contributes to solving the optimization problem. Unbounded loss functions assign large values to samples with large errors and thus they are more sensitive to noise. Hence, under unbounded loss functions, the robustness deteriorates in noisy environments.
Fig. 2 shows two unbounded loss functions (the Exp. loss and the Logistic loss) and a bounded one (the Savage loss). SavageBoost which uses the Savage loss function leads to a more robust learner in comparison with AdaBoost and Logitboost which uses the Exp. loss and the Logistic loss function respectively masnadi2009design . Several researchers suggested that although convex loss functions make optimization easier, the robustness deteriorates in the presence of outliers miao2016rboost . For example, while LSSVR uses the Square loss and is sensitive to outliers, RLSSVR uses the nonconvex least squares loss function to overcome the limitation of LSSVR wang2014robust .
There are many other distancebased loss functions which are not considered as Bayes consistent but have been widely used in literature. Some of these are shown in Table 2.
Name  Description 

Absolute vapnik2013nature  
Huber huber1964robust  
Closs liu2006correntropy  
insensitive Ramploss tang2018ramp 
The Huber loss function which is the combination of the Square and the Absolute function is shown in Table 2. It was utilized in robust regression, and the results showed a significant improvement in its robustness comparing with the standard regression. The only difference between Absolute and Huber loss functions is at the point of . While the Absolute loss function is not differentiable at the point of , this problem is fully addressed by the Huber loss function huber1964robust .
Correntropy which is rooted from the Renyi’s entropy is a local similarity measure. It is based on the probability of how much two random variables are similar in the neighbourhood of the joint space. The joint space can be adjusted by a kernel bandwidth. The Closs function which is shown in Table
2 is inspired by Correntropy criteria and is known as a robust loss function liu2006correntropy . Several researchers have used the Closs function to improve the robustness of their learning algorithms peng2017maximum ; liu2007correntropy ; zhao2012adaptive ; chen2014steady ; chen2016generalized .insensitive Ramploss which is inspired by Ramp loss function xiao2017ramp is proposed in tang2018ramp
. It is a kind of robust and margin enhancing loss function which was applied to a linear and kernel Support Vector Regression (SVR) in
tang2018ramp . The weights assigned to each sample are limited to be no higher than a predefined value of , thus the negative effect brought by outliers can be effectively reduced.Loss functions can also be used in Dimensional Reduction (DR) purposes which is a kind of feature reduction technique. For example, in
xie2018matrix nuclear norm (N norm) is used as the loss function in solving the DR problem based on matrix regression model. N norm can well preserve the lowrank information of samples and result in a low dimensional data and would be a good choice for DR purposes.While each individual loss function has its own advantages and disadvantages, in this paper we propose an ensemble loss function which is a combination of several individual loss functions. By doing so, we hope to produce a strong ensemble loss function which its advantages are inherited from each individual loss function. In the next section Ensemble learning and several promising ensemble method are discussed.
2.2 Ensemble Learning
Ensemble learning combines outputs of several models to make a prediction. They aim to improve the overall accuracy and robustness over individual models dudek2016pattern ; mendes2012ensemble . The focus of the most ensemble methods is on classification problems, however relatively few have paid attention to regression tasks. Ensemble methods are comprised of two steps: (1) generation step in which a set of base learners are built (2) integration step which involves in combining these base learners in order to make the final prediction. Base learners can be combined statistically or dynamically. A static combination uses predefined combination rules for all instances while the dynamic one makes use of different rules for different instances ko2008dynamic ; cruz2015meta . In the following the most promising Ensemble methods are discussed.
Two of the most appealing ensemble methods for regression trees are Bagging and random forest. They are the most commonly used algorithms due to their simplicity, consistency and accuracy
breiman1996bagging ; breiman2001random. Breiman Random Forest approach first constructs a multitude of decision trees and output the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. The trees are later modified to incorporate randomness, a split used at each node to select randomly feature subset. moreover, the subset considered in one node is completely independent of the subset in the previous node.
There are some interesting ensemble methods on neural networks, one which is based on negative correlation, is called Evolutionary Ensembles with Negative Correlation Learning (EENCL)
liu2000evolutionary . it randomly changes the weights of an existing neural network by using mutation. It also obtains ensemble size automatically. The importance of this approach is due to its theoretical foundations. Another interesting method on neural networks was presented in 2003 islam2003high which builds the base learners and the ensemble one simultaneously. Therefore, it saves time in comparing with the methods that first generate base learners and then combine them.There are some other research based on local experts which are specialized in local prediction, aim to achieve better prediction accuracy in comparison with globally best models. Recently a Locally Linear Ensemble Regressor (LLER) has been proposed which first divides the original dataset into some locally linear regions by employing an EM procedure and then builds a linear model per each region in order to form the ensemble model kang2018locally .
In this paper, we dynamically combine several loss functions and the weights associated with each individual loss is obtained through the training phase using HalfQuadratic (HQ) optimization. In the following section, we provide an overview of HQ optimization.
3 HalfQuadratic Optimization
Let be a function of vector and defined as where is the th entry of . In machine learning problems, we often aim to minimize an optimization problem like
(1) 
where usually is the learner’s parameters and is a loss function which can be convex or nonconvex and is a convex penalty function on which is optional and considered as the regularization term. According to HalfQuadratic (HQ) optimization he2014half , for a fixed , we have
(2) 
where is the convex conjugate function of and is an HQ function which is modeled by the additive or the multiplicative form. The additive and the multiplicative forms for are respectively formulated as and . Let , the vector form of Equation 2 is as follow
(3) 
(4) 
Assuming the variable fixed, the optimal value for is obtained by the minimization function which is computed as
and it is only related to (some specific forms of and the corresponding is shown in Fig.3). For each the value of is such that
The optimization problem 4 is convex since it is the summation of three convex functions and is can be minimized alternating steps as follows
where denotes the iteration number. At each iteration, the objective function is decreased until convergence. The HQ method is fully explained in geman1995nonlinear ; he2014half .
4 The Proposed Method
Let denote all samples where are the inputs and are the labels. Let be the parameters of the regressor which estimates the predicted label by . Let denote weak loss functions. We aim to find an optimal and the best weights, , associated with each loss function. We need to add a further constraint to avoid yielding near zero values for all weights. Our proposed ensemble loss function is defined in Equation 5.
(5) 
Fig. 4 shows the model of our proposed method with the bold box representing the novelty of this paper. In the training phase, the weights associated with loss functions are learned and our proposed ensemble loss function is formed. To ease computations we select each from Mestimator functions introduced in Fig.3.
The expected risk of the proposed loss function is defined as
(6)  
As known, under the Square loss, the Bayes estimator is the posterior mean and the Bayes estimator with respect to the Absolute loss is the posterior median. As shown in Equation 6, the Bayes estimator of our proposed loss function is a weighted summation of all Bayes estimators associated with each individual loss function. For example, if we ensemble the Square and the Huber loss functions, then, the Bayes estimator of our ensemble loss function is a tradeoff between the posterior mean and the median.
In the next two subsections, two properties of our proposed loss function are discussed. The third subsection contains the training scheme.
4.1 Bayes Consistency
The advantages of using a Bayesconsistent loss function is fully explained in Section 2. In the following, it is proved that under some conditions, our proposed loss function is Bayes consistent.
Theorem 1
If each is a Bayesconsistent loss function, then is also Bayes consistent.
Note
In binary classifications, it is known that the decision function is Bayes consistent if the following equation holds bartlett2006convexity .
(7) 
Proof
Let be , by assuming , the conditional expected risk under each individual loss function, , is written as Equation 8 masnadi2009design ,
(8) 
Each loss function is convex because it is Bayesconsistent. Thus, the optimal decision function , can be obtained by setting derivatives of to zero,
The conditional expected risk for our proposed ensemble loss function is described as
Since is a linear combination of some convex functions, it is also convex. For example, by assuming , the expected risk for the linear combination of two convex loss functions given an input is written as
where have been shown in Equation 8. The optimal decision function under our ensemble loss function is obtained by setting the derivative of to zero,
Fig. 5 shows an example of two convex functions and with the optimal point and respectively. By looking at Fig. 5 the following equations are straightforward.
(9)  
(10)  
Having considered Equation 9 and 10, we easily conclude that must lie between two points and , and can be formulated as . It means that is a linear combination of . Since for each Equation 7 holds, it also holds for a linear combination of them. Therefore, is also Bayes consistent. This proof can easily be expanded for loss functions in the same way.
4.2 Robustness
Robust loss function
a distancebased loss function is said to be robust if there is a constant such that the loss function does not assign large values to samples with genton1998highly .
In the following, the mathematical explanation of the robustness is provided. Assuming a linear learner, the empirical risk is formulated as follows:
where denotes inner product of two vectors . To obtain the optimal value of , the following equation has to set to zero,
where . Given the weight function , the above equation can be reformulated as follows:
(11) 
A loss function is robust if the following equation holds genton1998highly .
(12) 
Theorem 2
If all individual loss functions are robust, our proposed ensemble loss function is also robust.
Proof
For the ensemble loss function, is written as follows:
4.3 Training Phase using our Proposed Loss Function
We aim to minimize the empirical risk of our proposed loss function as follows
(13) 
where denotes the value of ensemble loss function for th sample, denotes the weights associated with the th loss function and denotes the value of the th loss function for the th sample. We first omit from Equation 13 and will later show that the weights associated with each loss function appear implicitly through HQ optimization. According to HQ optimization, Equation 13 is restated as follows
where is the pointwise maximum of some convex functions and consequently it is convex, is the error associated with the th sample and is the auxiliary variable. By substituting with the multiplicative HQ function, the following equation is obtained.
(14) 
where is a matrix of rows and columns. And denotes the element of th row and th column. To solve the above problem, we make use of HQ optimization algorithm which is iterative and alternating. Each iteration consists of two steps as follows he2014half
First
by considering constant, we arrive at the following optimization problem with one variable
(15) 
The value of is calculated using function which is precalculated for some special loss functions and their corresponding conjugate functions which are listed in Fig. 3. The optimal value of is obtained by .
Second
given the obtained value of , we optimize over as follows
(16)  
where and are respectively the optimal values for matrix and vector in the th iteration. The matrix represented by is generally Positive SemiDefinite (PSD) and might not be invertible. To make it Positive Definite (PD) we add it a diagonal matrix with strictly positive diagonal entries meyer2000matrix . The refinement of Equation 16 is as follows
(17) 
where is a very small positive number and
is the identity matrix. The Equation
15 and 17 are performed iteratively in an alternating manner until convergence.denotes matrix in th iteration. is the element of the th row and the th column of matrix , so, can be considered as the weight of the th sample associated with the th loss function. Thus, we can calculate the total weight of the th loss function by .
Proposition 1
The sequence generated by Algorithm 1 converges.
Proof
According to Equation 15 and the properties of the minimizer function we have
for a fixed . And by Equation 17, for a fixed we have
Summing up the above equations gives us:
Since the cost function is below bounded, the sequence
decrease and finally converges when he2014half .
4.4 Toy Example
To clarify our proposed approach, we carry out an experiment on small synthetic data. as input are generated from by step size 0.5. Labels lie in a line defined by , where denotes noise. We ensemble two loss functions, Welsch and , which are selected from functions listed in Fig. 3. We conduct our experiments in presence of two kinds of noises: (1) Zeromean Gaussian noise and (2) Outliers.
The weights associated with each base loss function and are iteratively updated according to Equation 15 and 17 respectively. These weights are presented in Table 3. The weights associated with Welsch loss function decreases in presence of outliers while it increases for the function. Welsch and the functions are plotted in Fig. 6.
Gaussian noise  Outliers  

0.1546  0.3578  
0.8453  0.6422  
1.9975  1.9969 
As shown in Fig. 6, Welsch is a bounded function and the is unbounded. Therefore, the assigns a larger value to a sample with large error in comparison with Welsch. Therefore, to decrease the influence of outlier on the predicted function, the value of for is less than the corresponding value for Welsch function in presence of outliers.
We also combine three, four and five base loss functions in Fig. 3 and the weights associated with each loss function are reported in Table 4. Results show that among these five loss functions, Fair loss function contributes the most to form the ensemble loss function. Also, the predicted has been above value for all experiments.
Number of base loss functions  kind of noise  Welsch  Huber  Fair  Logcosh  

3  Gaussian  0.26  0.58  0.14  –  – 
3  Outlier  0.28  0.62  0.09  –  – 
4  Gaussian  0.15  0.36  0.07  0.38  – 
4  Outlier  0.16  0.37  0.05  0.41  – 
5  Gaussian  0.13  0.28  0.05  0.30  0.21 
5  Outlier  0.11  0.27  0.04  0.33  0.23 
5 Experiments
In this section, we evaluate how our proposed loss function works in regression problems. To make our proposed ensemble loss function stronger, we have chosen base loss functions which are diversified by different behaviour against outliers. Welsch function is bounded and assigns a small value to samples with large errors. Huber loss function penalizes samples linearly with respect to , while function highly penalizes samples with large . Therefore, these three loss functions which are completely diversified by the behaviours against outliers have been chosen as the base functions of our proposed ensemble loss function.
We have conducted our experiments on some different benchmark datasets which are briefly described in Table 5. MeanAbsolute Error (MAE) is utilized to compare the results which is calculated as where is the number of test samples and are the true and estimated label respectively. In all experiments, we have used fold cross validation for model selection. It means that the original dataset is partitioned into 10 disjoint subsets. Afterwards, we have run iterations in each subsets have been used for training and the remaining one for testing. Every subset has been exactly used once for testing. And the best model parameters has been selected.
The data have been initially normalized. Table. 5 shows the experimental results on some benchmark datasets in natural situation (without adding outliers to samples). It shows that the performance of all regressors are somehow the same in all datasets; however, RELF achieves the best results.
Dataset name  # of samples  # of features  Lasso  LARS  SVR  RELF 

Airfoil selfnoise  1503  6  3.5076  3.5147  3.7954  3.4937 
Energy Efficient Dataset  768  8  1.9741  1.9789  1.4021  1.9476 
Red Wine Quality Dataset  4898  12  2.6605  2.6637  2.7630  2.6561 
White Wine Quality Dataset  4898  12  1.9741  1.9789  1.4021  1.9733 
Abalone Dataset  4177  8  2.0345  1.8693  1.5921  1.5549 
Bodyfat_Dataset  252  13  0.91823  0.84867  0.6599  0.52465 
Building_Dataset  4208  14  1.017  1.0183  2.4750  1.0024 
Engine_Dataset  1199  2  1.3102  1.4113  0.9833  0.88694 
Vinyl_Dataset  506  13  1.1501  1.0598  0.9854  0.23952 
Simplefit_Dataset  94  1  0.63349  0.51819  0.4836  0.38523 
5.1 Discussion about the Convergence of HQ Optimization Method
In this section, the convergence of HQ optimization method in our algorithm is experimentally studied. Table 6 shows values of cost function, , in successive iterations. We define a decrease ratio in Equation 18 which shows how much cost function being reduced in iteration 10 to 30. This ratio has been calculated for each dataset separately and shown in Table 6. The results shows that the HQ optimization method quickly converges within 30 iterations in all cases. Moreover, is a good approximation of the optimal point. We have also provided the CPU time in seconds for each dataset in Table 6.
(18) 
Dataset name  1  2  …  10  …  29  30  Decrease Ratio  CPU time 

Airfoil selfnoise  617.9089  615.9067  . .  615.2434  . .  615.2074  615.2070  0.986528  0.3125 
Energy Efficient  298.9774  294.7328  . .  291.9451  . .  291.9144  291.8603  0.988085  0.6094 
Red Wine Quality  658.9971  657.8419  . .  657.5159  . .  657.4989  657.4986  0.988455  0.2656 
White Wine Quality  324.2221  291.8042  . .  291.8410  . .  291.6763  291.6837  0.995166  1 
Abalone  1.5604  1.5490  . .  1.5362  . .  1.5331  1.5330  0.883212  0.6250 
Bodyfat  98.6765  98.4771  . .  98.4393  . .  98.4389  98.4389  0.998316  0.0469 
Building  3.3932  3.3650  . .  3.3434  . .  3.3400  3.3399  0.934334  0.8438 
Engine  1.6640  1.6026  . .  1.5500  . .  1.5397  1.5394  0.914928  0.5313 
Vinyl  2.5582  2.5574  . .  2.5573  . .  2.5573  2.5573  1  1.9063 
Simplefit  44.2226  43.4881  . .  43.1884  . .  43.1652  43.1649  0.977782  0.0156 
5.2 Discussion about Robustness
In this section, we investigate the robustness of our proposed loss function through the experiment on some benchmark datasets which are provided in Table 5. fold cross validation has been used to tune the hyperparameters. To study the robustness, we add outliers to training and validation samples. We conduct experiments on data which are corrupted with various level of outlier including and . outliers which means that we randomly select of samples and add outliers to their labels. Table 7 and Table 8 show the increase ratio of MAE in the face of outliers in comparison of natural situation (without outliers). The best results are presented with bold. Lasso, LARS, and SVR are three wellknown regressions which have been used to make comparison.
Dataset Name  Lasso  LARS  SVR  RELF 

Airfoil selfnoise  1.0127  1.0191  1.0227  1.0016 
Energy Efficient Dataset  1.4025  1.3641  1.0971  1.0082 
Red Wine Quality Dataset  1.0683  1.0544  1.0235  1.0139 
White Wine Quality Dataset  1.2941  1.2897  1.0971  1.0602 
Abalone Dataset  1.0264  1.1246  1.0183  1.0077 
Bodyfat_Dataset  1.5125  1.5279  1.0242  1.0692 
Building_Dataset  1.0016  1.0014  1.0143  1.0004 
Engine_Dataset  1.1374  1.0242  1.0569  1.0089 
Vinyl_Dataset  1.2345  1.148  1.1987  1.0978 
Simplefit_Dataset  1.3122  1.8628  1.0023  1.2359 
Dataset Name  Lasso  LARS  SVR  RELF 

Airfoil selfnoise  5.9316  6.6159  1.0213  1.0178 
Energy Efficient Dataset  2.544  2.0297  1.1555  1.0016 
Red Wine Quality Dataset  1.408  2.0033  1.0270  1.0024 
White Wine Quality Dataset  2.1278  2.0297  1.1555  1.0219 
Abalone Dataset  2.0264  3.0233  1.0338  1.0318 
Bodyfat_Dataset  2.3491  1.1265  1.0338  1.2882 
Building_Dataset  1.985  1.1438  1.0151  1.0022 
Engine_Dataset  2.1156  2.0233  1.3665  1.0061 
Vinyl_Dataset  3.5624  2.2455  1.2785  1.0651 
Simplefit_Dataset  8.674  8.3125  1.4424  1.5212 
Table 7 lists the ratio of MAE value for data with outliers to MAE value for original data. Except two datasets RELF has been least influenced by outliers among other regressors. Table 8 presents the numerical comparison results for the four mentioned models in the presence of outliers. RELF was least influenced by outliers and SVR got the better results in comparison with LASSO and LARS.
5.3 Comparison with StateoftheArt Ensemble Regressors
We also investigate the effectiveness of our proposed method in comparison with several promising ensemble regressors through experiments on some benchmark datasets. The datasets have been selected from LIBSVM data page^{1}^{1}1https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
. We compare RELF with four ensemble regression methods: Locally Weighted Linear Regression (LWLR)
zhang2016online , Locally Linear Ensemble for Regression (LLER) kang2018locally , Artificial Neural Network (ANN) rodriguez2015machine and Random Forest (RF) rodriguez2015machine . The first two models are based on combining local experts and the next two models are nonlinear. In the following, the implementation settings for each method is fully provided.The LWLR which is based on local expert method aims to build a local linear regressor for each test sample based on its neighbours. The training of each linear model is such that it assigns higher weights to those training samples which are closer to the given test sample. The weights are calculated according to Gaussian kernel which is fully explained in schaal2002scalable .
For the ANN, the number of hidden nodes are selected from
set. If the amount of loss for validation batch fails to decrease during six successive epochs, the training phase ends.
To implement RF, we set the number of trees to 100 and bootstrap sample size to 80 percent of training samples for each tree. We set the minimum size of leaf nodes to {0.1, 0.2, 0.5, 1, 2, 5} percent of training samples.
In all experiments, we use fold cross validation for parameter tuning, and the data are initially normalized into the range [1, 1]. RootMeanSquareError (RMSE) is utilized to compare the results which is calculated as where is the number of test samples and are the true and estimated label respectively. Table 9 provides some extra information about each dataset.
Dataset  Source  Number of Data  Number of Feature 

BodyFat  Source: StatLib / bodyfat  252  14 
Abalone  Source: UCI / Abalone  4,177  8 
Cadata  Source: StatLib / houses.zip  20,640  8 
Cpusmall  Source: Delve / compactiv  8,192  12 
Housing  Source: UCI / Housing (Boston)  506  13 
Mg  Source: [GWF01a]  1,385  6 
The five mentioned models have been numerically compared on 6 datasets and the results are plotted by bar charts in Figure 7. NN and RF models are nonlinear while LWLR and LLER are locally linear. The RELF yields the lowest RMSE value for almost all datasets except for Abalone and Cpusmall that in which NN and RF get the lowest RMSE value.
Comments
There are no comments yet.