RELF: Robust Regression Extended with Ensemble Loss Function

10/25/2018 ∙ by Hamideh Hajiabadi, et al. ∙ Ferdowsi University of Mashhad 0

Ensemble techniques are powerful approaches that combine several weak learners to build a stronger one. As a meta-learning framework, ensemble techniques can easily be applied to many machine learning methods. Inspired by ensemble techniques, in this paper we propose an ensemble loss functions applied to a simple regressor. We then propose a half-quadratic learning algorithm in order to find the parameter of the regressor and the optimal weights associated with each loss function. Moreover, we show that our proposed loss function is robust in noisy environments. For a particular class of loss functions, we show that our proposed ensemble loss function is Bayes consistent and robust. Experimental evaluations on several datasets demonstrate that our proposed ensemble loss function significantly improves the performance of a simple regressor in comparison with state-of-the-art methods.



There are no comments yet.


page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Loss functions are fundamental components of machine learning systems and are used to train the parameters of the learner model. Since standard training methods aim to determine the parameters that minimize the average value of the loss given an annotated training set, loss functions are crucial for successful trainings xiao2017ramp ; zhao2010convex

. Bayesian estimators are obtained by minimizing the expected loss function. Different loss functions lead to different Optimum Bayes with possibly different characteristics. Thus, in each environment the choice of the underlying loss function is important, as it will impact the performance

uhlich2012bayesian ; wang2003multiscale .

Letting denote the estimated parameter of a correct parameter , the loss function is a positive function which assigns a loss value to each estimation, indicating how inaccurate the estimation is steinwart2008support . Loss functions assign a value to each sample, indicating how much that sample contributes to solving the optimization problem. Each loss function comes with its own advantages and disadvantages. In order to put our results in context, we start by reviewing three popular loss functions (-, Ramp and Sigmoid) and we will give an overview of their advantages and disadvantages.

Loss functions assign a value to each sample representing how much that sample contributes to solving the optimization problem. If an outlier is given a very large value by the loss function, it might dramatically affect the decision function

hajiabadi2017extending . The - loss function is known as a robust loss because it assigns value 1 to all misclassified samples — including outliers — and thus an outlier does not influence the decision function, leading to a robust learner. On the other hand, the - loss penalizes all misclassified samples equally with value , and since it does not enhance the margin, it cannot be an appropriate choice for applications with margin importance xiao2017ramp .

The Ramp loss function, as another type of loss functions, is defined similarly to the - loss function with the only difference that ramp loss functions also penalize some correct samples, those with small margins. This minor difference makes the Ramp loss function appropriate for applications with margin importance tang2018ramp ; xiao2017ramp . On the downside, the Ramp loss function is not differentiable, and hence not suitable for optimization purposes.

The Sigmoid loss function is almost the same as - loss functions, except that it is differentiable, which in turn makes optimization significantly easier. However, it assigns a (very small) non-zero value to correct samples, meaning that those samples would also contribute to solving the optimization problem. Hence, in spite of easier optimization, the Sigmoid loss function leads to a less sparse optimization problem in comparison with the - loss function vapnik1998statistical .

There are many other examples of different loss functions showing that while a loss function might be good for certain applications, it might be unsuitable for many others. Inspired by ensemble methods, in this paper we propose the use of an ensemble of loss functions during the training stage. The ensemble technique is one of the most influential learning approaches. Theoretically, it can boost weak learners, whose accuracies are slightly better than random guesses, into arbitrarily accurate strong learners napoles2017rough . This method would be effective when it is difficult to design a powerful learning algorithm directly bai2014bayesian ; zhang2016bayesian ; mannor2001weak . As a meta-learning framework, it can be applied to almost all machine learning algorithms to improve their prediction accuracies.

Our goal in this paper is to propose a new ensemble loss function, which we later apply to a simple regressor. Half-Quadratic (HQ) minimization, which is a fast alternating direction method, is used to learn regressor’s parameters. In each iteration, the HQ tries to approximate the convex or non-convex cost function with a convex one and pursue optimization geman1992constrained . Our main contributions are as follows.

  • Inspired by ensemble-induced methods, we propose an ensemble loss whose properties are inherited from its base loss functions. Moreover, we show that each loss is a special case of our proposed loss function.

  • We develop both online and offline learning frameworks to find the weights associated with each loss function, and so to build an ensemble loss function. For a particular class of base losses, we prove that the resulting ensemble loss function is Bayes consistent and robust.

This paper is structured as follows. We review some existing loss functions and several promising ensemble regressors in Section 2 which contains two subsections for each. We briefly explain about Half-Quadratic (HQ) programming in Section 3. Our proposed framework is discussed in Section 4, for which we provide implementation and test results in Section 5. Finally, we conclude in Section 6 with a list of problems for future work.

2 A Review of Loss Functions

This work draws on two broad areas of research: loss functions, and ensemble-based regression methods. In this section, these two areas are fully covered.

2.1 A Review of Loss Functions

In machine learning, loss functions are divided in two categories, margin-based and distance-based steinwart2008support . Margin-based loss functions are used for classification feng2016robust ; zhang2001text ; khan2013semi ; bartlett2006convexity , while Distance-based loss functions are generally used for regression. In this paper, we only focus on distance-based loss functions.

Distance-Based Loss Functions

Let denote an input, the corresponding true label and the estimated label respectively. A distance-based loss function, is a penalty function where is called distance sangari2016convergence ; chen2017kernel . The risk associated with the loss function described as


is the joint probability distribution over

and . The ultimate aim of a learning algorithm is to find a function among a fixed class of function for which the risk is minimal steinwart2008support ; painsky2016isotonic ,

Generally cannot be computed because the distribution is unknown. However, an approximation of which is called empirical risk can be computed by averaging the loss function on the training set containing samples holland2016minimum ,

A loss function should be such that we arrive at Bayes decision function after minimizing the associated risk under given loss function. The loss is said to be Bayes consistent if by increasing the samples size, the resulting function converges to the Bayes decision function zhang2004statistical ; buja2005loss ; friedman2000additive . The Bayes decision function is fully explained in the cited papers.

Square lopez2018robust
Hinge bartlett2008classification
Logistic fan2008liblinear
Table 1: Well-known Bayes Consistent Loss Functions

Table 1 and Fig.1 illustrate known examples of Bayes consistent loss functions masnadi2010design ; masnadi2009design . Fig.1 shows that all of these functions are convex and unbounded. As we mentioned in the previous section, loss functions assign a value to each sample which indicates how much that sample contributes to solving the optimization problem. Unbounded loss functions assign large values to samples with large errors and thus they are more sensitive to noise. Hence, under unbounded loss functions, the robustness deteriorates in noisy environments.

Figure 1: Well-known Bayes consistent loss functions, Hinge (blue), Square (green), Logistic (red) Exp. (orange)

Fig. 2 shows two unbounded loss functions (the Exp. loss and the Logistic loss) and a bounded one (the Savage loss). SavageBoost which uses the Savage loss function leads to a more robust learner in comparison with AdaBoost and Logitboost which uses the Exp. loss and the Logistic loss function respectively masnadi2009design . Several researchers suggested that although convex loss functions make optimization easier, the robustness deteriorates in the presence of outliers miao2016rboost . For example, while LS-SVR uses the Square loss and is sensitive to outliers, RLS-SVR uses the non-convex least squares loss function to overcome the limitation of LS-SVR wang2014robust .

There are many other distance-based loss functions which are not considered as Bayes consistent but have been widely used in literature. Some of these are shown in Table 2.

Figure 2: the Exp. loss (red), the Logistic loss (blue) and the Savage loss function (green)
Name Description
Absolute vapnik2013nature
Huber huber1964robust
C-loss liu2006correntropy
-insensitive Ramp-loss tang2018ramp
Table 2: Some widely used loss functions

The Huber loss function which is the combination of the Square and the Absolute function is shown in Table 2. It was utilized in robust regression, and the results showed a significant improvement in its robustness comparing with the standard regression. The only difference between Absolute and Huber loss functions is at the point of . While the Absolute loss function is not differentiable at the point of , this problem is fully addressed by the Huber loss function huber1964robust .

Correntropy which is rooted from the Renyi’s entropy is a local similarity measure. It is based on the probability of how much two random variables are similar in the neighbourhood of the joint space. The joint space can be adjusted by a kernel bandwidth. The C-loss function which is shown in Table

2 is inspired by Correntropy criteria and is known as a robust loss function liu2006correntropy . Several researchers have used the C-loss function to improve the robustness of their learning algorithms peng2017maximum ; liu2007correntropy ; zhao2012adaptive ; chen2014steady ; chen2016generalized .

-insensitive Ramp-loss which is inspired by Ramp loss function xiao2017ramp is proposed in tang2018ramp

. It is a kind of robust and margin enhancing loss function which was applied to a linear and kernel Support Vector Regression (SVR) in

tang2018ramp . The weights assigned to each sample are limited to be no higher than a pre-defined value of , thus the negative effect brought by outliers can be effectively reduced.

Loss functions can also be used in Dimensional Reduction (DR) purposes which is a kind of feature reduction technique. For example, in

xie2018matrix nuclear norm (N norm) is used as the loss function in solving the DR problem based on matrix regression model. N norm can well preserve the low-rank information of samples and result in a low dimensional data and would be a good choice for DR purposes.

While each individual loss function has its own advantages and disadvantages, in this paper we propose an ensemble loss function which is a combination of several individual loss functions. By doing so, we hope to produce a strong ensemble loss function which its advantages are inherited from each individual loss function. In the next section Ensemble learning and several promising ensemble method are discussed.

2.2 Ensemble Learning

Ensemble learning combines outputs of several models to make a prediction. They aim to improve the overall accuracy and robustness over individual models dudek2016pattern ; mendes2012ensemble . The focus of the most ensemble methods is on classification problems, however relatively few have paid attention to regression tasks. Ensemble methods are comprised of two steps: (1) generation step in which a set of base learners are built (2) integration step which involves in combining these base learners in order to make the final prediction. Base learners can be combined statistically or dynamically. A static combination uses predefined combination rules for all instances while the dynamic one makes use of different rules for different instances ko2008dynamic ; cruz2015meta . In the following the most promising Ensemble methods are discussed.

Two of the most appealing ensemble methods for regression trees are Bagging and random forest. They are the most commonly used algorithms due to their simplicity, consistency and accuracy

breiman1996bagging ; breiman2001random

. Breiman Random Forest approach first constructs a multitude of decision trees and output the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. The trees are later modified to incorporate randomness, a split used at each node to select randomly feature subset. moreover, the subset considered in one node is completely independent of the subset in the previous node.

There are some interesting ensemble methods on neural networks, one which is based on negative correlation, is called Evolutionary Ensembles with Negative Correlation Learning (EENCL)

liu2000evolutionary . it randomly changes the weights of an existing neural network by using mutation. It also obtains ensemble size automatically. The importance of this approach is due to its theoretical foundations. Another interesting method on neural networks was presented in 2003 islam2003high which builds the base learners and the ensemble one simultaneously. Therefore, it saves time in comparing with the methods that first generate base learners and then combine them.

There are some other research based on local experts which are specialized in local prediction, aim to achieve better prediction accuracy in comparison with globally best models. Recently a Locally Linear Ensemble Regressor (LLER) has been proposed which first divides the original dataset into some locally linear regions by employing an EM procedure and then builds a linear model per each region in order to form the ensemble model kang2018locally .

In this paper, we dynamically combine several loss functions and the weights associated with each individual loss is obtained through the training phase using Half-Quadratic (HQ) optimization. In the following section, we provide an overview of HQ optimization.

3 Half-Quadratic Optimization

Let be a function of vector and defined as where is the th entry of . In machine learning problems, we often aim to minimize an optimization problem like


where usually is the learner’s parameters and is a loss function which can be convex or non-convex and is a convex penalty function on which is optional and considered as the regularization term. According to Half-Quadratic (HQ) optimization he2014half , for a fixed , we have


where is the convex conjugate function of and is an HQ function which is modeled by the additive or the multiplicative form. The additive and the multiplicative forms for are respectively formulated as and . Let , the vector form of Equation 2 is as follow


By substituting the equations 3 for in Equation 1, the following cost function is obtained,


Assuming the variable fixed, the optimal value for is obtained by the minimization function which is computed as

and it is only related to (some specific forms of and the corresponding is shown in Fig.3). For each the value of is such that

The optimization problem 4 is convex since it is the summation of three convex functions and is can be minimized alternating steps as follows

where denotes the iteration number. At each iteration, the objective function is decreased until convergence. The HQ method is fully explained in geman1995nonlinear ; he2014half .

Figure 3: some estimators and the equivalent for additive and multiplicative HQ forms he2014half


4 The Proposed Method

Let denote all samples where are the inputs and are the labels. Let be the parameters of the regressor which estimates the predicted label by . Let denote weak loss functions. We aim to find an optimal and the best weights, , associated with each loss function. We need to add a further constraint to avoid yielding near zero values for all weights. Our proposed ensemble loss function is defined in Equation 5.


Fig. 4 shows the model of our proposed method with the bold box representing the novelty of this paper. In the training phase, the weights associated with loss functions are learned and our proposed ensemble loss function is formed. To ease computations we select each from M-estimator functions introduced in Fig.3.

Risk Optimizer

Set ofLoss Functions


Training Phase




Testing Phase

Figure 4: The Proposed Model.

The expected risk of the proposed loss function is defined as


As known, under the Square loss, the Bayes estimator is the posterior mean and the Bayes estimator with respect to the Absolute loss is the posterior median. As shown in Equation 6, the Bayes estimator of our proposed loss function is a weighted summation of all Bayes estimators associated with each individual loss function. For example, if we ensemble the Square and the Huber loss functions, then, the Bayes estimator of our ensemble loss function is a trade-off between the posterior mean and the median.

In the next two subsections, two properties of our proposed loss function are discussed. The third subsection contains the training scheme.

4.1 Bayes Consistency

The advantages of using a Bayes-consistent loss function is fully explained in Section 2. In the following, it is proved that under some conditions, our proposed loss function is Bayes consistent.

Theorem 1

If each is a Bayes-consistent loss function, then is also Bayes consistent.


In binary classifications, it is known that the decision function is Bayes consistent if the following equation holds bartlett2006convexity .


Let be , by assuming , the conditional expected risk under each individual loss function, , is written as Equation 8 masnadi2009design ,


Each loss function is convex because it is Bayes-consistent. Thus, the optimal decision function , can be obtained by setting derivatives of to zero,

The conditional expected risk for our proposed ensemble loss function is described as

Since is a linear combination of some convex functions, it is also convex. For example, by assuming , the expected risk for the linear combination of two convex loss functions given an input is written as

where have been shown in Equation 8. The optimal decision function under our ensemble loss function is obtained by setting the derivative of to zero,

Figure 5: Expected Loss for Two Bayes-consistent Loss Function

Fig. 5 shows an example of two convex functions and with the optimal point and respectively. By looking at Fig. 5 the following equations are straightforward.


Having considered Equation 9 and 10, we easily conclude that must lie between two points and , and can be formulated as . It means that is a linear combination of . Since for each Equation 7 holds, it also holds for a linear combination of them. Therefore, is also Bayes consistent. This proof can easily be expanded for loss functions in the same way.

4.2 Robustness

Robust loss function

a distance-based loss function is said to be robust if there is a constant such that the loss function does not assign large values to samples with genton1998highly .

In the following, the mathematical explanation of the robustness is provided. Assuming a linear learner, the empirical risk is formulated as follows:-

where denotes inner product of two vectors . To obtain the optimal value of , the following equation has to set to zero,

where . Given the weight function , the above equation can be reformulated as follows:-


A loss function is robust if the following equation holds genton1998highly .

Theorem 2

If all individual loss functions are robust, our proposed ensemble loss function is also robust.


For the ensemble loss function, is written as follows:-

To prove Theorem 2 we need to prove that . Since each individual loss function is robust then the following equation is straightforward.

where is the weight function corresponding to . By assuming , the proof of Equation 12 is straightforward.

4.3 Training Phase using our Proposed Loss Function

We aim to minimize the empirical risk of our proposed loss function as follows


where denotes the value of ensemble loss function for th sample, denotes the weights associated with the th loss function and denotes the value of the th loss function for the th sample. We first omit from Equation 13 and will later show that the weights associated with each loss function appear implicitly through HQ optimization. According to HQ optimization, Equation 13 is restated as follows

where is the point-wise maximum of some convex functions and consequently it is convex, is the error associated with the th sample and is the auxiliary variable. By substituting with the multiplicative HQ function, the following equation is obtained.


where is a matrix of rows and columns. And denotes the element of th row and th column. To solve the above problem, we make use of HQ optimization algorithm which is iterative and alternating. Each iteration consists of two steps as follows he2014half


by considering constant, we arrive at the following optimization problem with one variable


The value of is calculated using function which is pre-calculated for some special loss functions and their corresponding conjugate functions which are listed in Fig. 3. The optimal value of is obtained by .


given the obtained value of , we optimize over as follows


where and are respectively the optimal values for matrix and vector in the th iteration. The matrix represented by is generally Positive Semi-Definite (PSD) and might not be invertible. To make it Positive Definite (PD) we add it a diagonal matrix with strictly positive diagonal entries meyer2000matrix . The refinement of Equation 16 is as follows


where is a very small positive number and

is the identity matrix. The Equation

15 and 17 are performed iteratively in an alternating manner until convergence.

denotes matrix in th iteration. is the element of the th row and the th column of matrix , so, can be considered as the weight of the th sample associated with the th loss function. Thus, we can calculate the total weight of the th loss function by .

Proposition 1

The sequence generated by Algorithm 1 converges.


According to Equation 15 and the properties of the minimizer function we have

for a fixed . And by Equation 17, for a fixed we have

Summing up the above equations gives us:

Since the cost function is below bounded, the sequence

decrease and finally converges when he2014half .

Streaming samples
Base loss functions
Parameter in Equation 14 and (weights associated with each base loss function )

1:Initiate using and random
2:Initialized parameters
3:while  until convergence  do
4:     Compute for
5:     Update by according to Equation 15
6:     Update learning parameter according to 17
Algorithm 1 Our Proposed Ensemble Loss Function using HQ Optimization Algorithm (RELF)

4.4 Toy Example

To clarify our proposed approach, we carry out an experiment on small synthetic data. as input are generated from by step size 0.5. Labels lie in a line defined by , where denotes noise. We ensemble two loss functions, Welsch and , which are selected from functions listed in Fig. 3. We conduct our experiments in presence of two kinds of noises: (1) Zero-mean Gaussian noise and (2) Outliers.

The weights associated with each base loss function and are iteratively updated according to Equation 15 and 17 respectively. These weights are presented in Table 3. The weights associated with Welsch loss function decreases in presence of outliers while it increases for the function. Welsch and the functions are plotted in Fig. 6.

Figure 6: Welsch function (red line), function(red line)
Gaussian noise Outliers
0.1546 0.3578
0.8453 0.6422
1.9975 1.9969
Table 3: Results for syntactic data

As shown in Fig. 6, Welsch is a bounded function and the is unbounded. Therefore, the assigns a larger value to a sample with large error in comparison with Welsch. Therefore, to decrease the influence of outlier on the predicted function, the value of for is less than the corresponding value for Welsch function in presence of outliers.

We also combine three, four and five base loss functions in Fig. 3 and the weights associated with each loss function are reported in Table 4. Results show that among these five loss functions, Fair loss function contributes the most to form the ensemble loss function. Also, the predicted has been above value for all experiments.

Number of base loss functions kind of noise  Welsch Huber Fair Log-cosh
3 Gaussian 0.26 0.58 0.14
3 Outlier 0.28 0.62 0.09
4 Gaussian 0.15 0.36 0.07 0.38
4 Outlier 0.16 0.37 0.05 0.41
5 Gaussian 0.13 0.28 0.05 0.30 0.21
5 Outlier 0.11 0.27 0.04 0.33 0.23
Table 4: Weights associated with each base loss function

5 Experiments

In this section, we evaluate how our proposed loss function works in regression problems. To make our proposed ensemble loss function stronger, we have chosen base loss functions which are diversified by different behaviour against outliers. Welsch function is bounded and assigns a small value to samples with large errors. Huber loss function penalizes samples linearly with respect to , while function highly penalizes samples with large . Therefore, these three loss functions which are completely diversified by the behaviours against outliers have been chosen as the base functions of our proposed ensemble loss function.

We have conducted our experiments on some different benchmark datasets which are briefly described in Table 5. Mean-Absolute Error (MAE) is utilized to compare the results which is calculated as where is the number of test samples and are the true and estimated label respectively. In all experiments, we have used fold cross validation for model selection. It means that the original dataset is partitioned into 10 disjoint subsets. Afterwards, we have run iterations in each subsets have been used for training and the remaining one for testing. Every subset has been exactly used once for testing. And the best model parameters has been selected.

The data have been initially normalized. Table. 5 shows the experimental results on some benchmark datasets in natural situation (without adding outliers to samples). It shows that the performance of all regressors are somehow the same in all datasets; however, RELF achieves the best results.

Dataset name # of samples # of features Lasso LARS SVR RELF
Airfoil self-noise 1503 6 3.5076 3.5147 3.7954 3.4937
Energy Efficient Dataset 768 8 1.9741 1.9789 1.4021 1.9476
Red Wine Quality Dataset 4898 12 2.6605 2.6637 2.7630 2.6561
White Wine Quality Dataset 4898 12 1.9741 1.9789 1.4021 1.9733
Abalone Dataset 4177 8 2.0345 1.8693 1.5921 1.5549
Bodyfat_Dataset 252 13 0.91823 0.84867 0.6599 0.52465
Building_Dataset 4208 14 1.017 1.0183 2.4750 1.0024
Engine_Dataset 1199 2 1.3102 1.4113 0.9833 0.88694
Vinyl_Dataset 506 13 1.1501 1.0598 0.9854 0.23952
Simplefit_Dataset 94 1 0.63349 0.51819 0.4836 0.38523
Table 5: The Mean Absolute Error (MAE) comparison results on several datasets. The best result for each dataset is presented in bold.

5.1 Discussion about the Convergence of HQ Optimization Method

In this section, the convergence of HQ optimization method in our algorithm is experimentally studied. Table 6 shows values of cost function, , in successive iterations. We define a decrease ratio in Equation 18 which shows how much cost function being reduced in iteration 10 to 30. This ratio has been calculated for each dataset separately and shown in Table 6. The results shows that the HQ optimization method quickly converges within 30 iterations in all cases. Moreover, is a good approximation of the optimal point. We have also provided the CPU time in seconds for each dataset in Table 6.

Dataset name 1 2 10 29 30 Decrease Ratio CPU time
Airfoil self-noise 617.9089 615.9067 . . 615.2434 . . 615.2074 615.2070 0.986528 0.3125
Energy Efficient 298.9774 294.7328 . . 291.9451 . . 291.9144 291.8603 0.988085 0.6094
Red Wine Quality 658.9971 657.8419 . . 657.5159 . . 657.4989 657.4986 0.988455 0.2656
White Wine Quality 324.2221 291.8042 . . 291.8410 . . 291.6763 291.6837 0.995166 1
Abalone 1.5604 1.5490 . . 1.5362 . . 1.5331 1.5330 0.883212 0.6250
Bodyfat 98.6765 98.4771 . . 98.4393 . . 98.4389 98.4389 0.998316 0.0469
Building 3.3932 3.3650 . . 3.3434 . . 3.3400 3.3399 0.934334 0.8438
Engine 1.6640 1.6026 . . 1.5500 . . 1.5397 1.5394 0.914928 0.5313
Vinyl 2.5582 2.5574 . . 2.5573 . . 2.5573 2.5573 1 1.9063
Simplefit 44.2226 43.4881 . . 43.1884 . . 43.1652 43.1649 0.977782 0.0156
Table 6: Values of Cost Function, in several successive Iterations

5.2 Discussion about Robustness

In this section, we investigate the robustness of our proposed loss function through the experiment on some benchmark datasets which are provided in Table 5. fold cross validation has been used to tune the hyper-parameters. To study the robustness, we add outliers to training and validation samples. We conduct experiments on data which are corrupted with various level of outlier including and . outliers which means that we randomly select of samples and add outliers to their labels. Table 7 and Table 8 show the increase ratio of MAE in the face of outliers in comparison of natural situation (without outliers). The best results are presented with bold. Lasso, LARS, and SVR are three well-known regressions which have been used to make comparison.

Dataset Name Lasso LARS SVR RELF
Airfoil self-noise 1.0127 1.0191 1.0227 1.0016
Energy Efficient Dataset 1.4025 1.3641 1.0971 1.0082
Red Wine Quality Dataset 1.0683 1.0544 1.0235 1.0139
White Wine Quality Dataset 1.2941 1.2897 1.0971 1.0602
Abalone Dataset 1.0264 1.1246 1.0183 1.0077
Bodyfat_Dataset 1.5125 1.5279 1.0242 1.0692
Building_Dataset 1.0016 1.0014 1.0143 1.0004
Engine_Dataset 1.1374 1.0242 1.0569 1.0089
Vinyl_Dataset 1.2345 1.148 1.1987 1.0978
Simplefit_Dataset 1.3122 1.8628 1.0023 1.2359
Table 7: Increase Ratio of MAE in the Face of Outliers (10%). The best result for each dataset is presented in bold.
Dataset Name Lasso LARS SVR RELF
Airfoil self-noise 5.9316 6.6159 1.0213 1.0178
Energy Efficient Dataset 2.544 2.0297 1.1555 1.0016
Red Wine Quality Dataset 1.408 2.0033 1.0270 1.0024
White Wine Quality Dataset 2.1278 2.0297 1.1555 1.0219
Abalone Dataset 2.0264 3.0233 1.0338 1.0318
Bodyfat_Dataset 2.3491 1.1265 1.0338 1.2882
Building_Dataset 1.985 1.1438 1.0151 1.0022
Engine_Dataset 2.1156 2.0233 1.3665 1.0061
Vinyl_Dataset 3.5624 2.2455 1.2785 1.0651
Simplefit_Dataset 8.674 8.3125 1.4424 1.5212
Table 8: Increase ratio of MAE in the face of outliers (30%). The best result for each dataset is presented in bold.

Table 7 lists the ratio of MAE value for data with outliers to MAE value for original data. Except two datasets RELF has been least influenced by outliers among other regressors. Table 8 presents the numerical comparison results for the four mentioned models in the presence of outliers. RELF was least influenced by outliers and SVR got the better results in comparison with LASSO and LARS.

5.3 Comparison with State-of-the-Art Ensemble Regressors

We also investigate the effectiveness of our proposed method in comparison with several promising ensemble regressors through experiments on some benchmark datasets. The datasets have been selected from LIBSVM data page111

. We compare RELF with four ensemble regression methods: Locally Weighted Linear Regression (LWLR)

zhang2016online , Locally Linear Ensemble for Regression (LLER) kang2018locally , Artificial Neural Network (ANN) rodriguez2015machine and Random Forest (RF) rodriguez2015machine . The first two models are based on combining local experts and the next two models are nonlinear. In the following, the implementation settings for each method is fully provided.

The LWLR which is based on local expert method aims to build a local linear regressor for each test sample based on its neighbours. The training of each linear model is such that it assigns higher weights to those training samples which are closer to the given test sample. The weights are calculated according to Gaussian kernel which is fully explained in schaal2002scalable .

For the ANN, the number of hidden nodes are selected from

set. If the amount of loss for validation batch fails to decrease during six successive epochs, the training phase ends.

To implement RF, we set the number of trees to 100 and bootstrap sample size to 80 percent of training samples for each tree. We set the minimum size of leaf nodes to {0.1, 0.2, 0.5, 1, 2, 5} percent of training samples.

In all experiments, we use fold cross validation for parameter tuning, and the data are initially normalized into the range [-1, 1]. Root-Mean-Square-Error (RMSE) is utilized to compare the results which is calculated as where is the number of test samples and are the true and estimated label respectively. Table 9 provides some extra information about each dataset.

Dataset Source Number of Data Number of Feature
BodyFat Source: StatLib / bodyfat 252 14
Abalone Source: UCI / Abalone 4,177 8
Cadata Source: StatLib / 20,640 8
Cpusmall Source: Delve / comp-activ 8,192 12
Housing Source: UCI / Housing (Boston) 506 13
Mg Source: [GWF01a] 1,385 6
Table 9: Datasets’ description

The five mentioned models have been numerically compared on 6 datasets and the results are plotted by bar charts in Figure 7. NN and RF models are nonlinear while LWLR and LLER are locally linear. The RELF yields the lowest RMSE value for almost all datasets except for Abalone and Cpusmall that in which NN and RF get the lowest RMSE value.