The goal of robust regression methods is to provide reliable estimates for the unknown regression function even when the training data set may contain atypical observations. These “outliers” need not be “extreme” or aberrant values, but might simply be points following a different model from the one that applies to the majority of the data. It is well known that classical methods can provide seriously misleading results when trained on such heteregenous or contaminated data sets, and thus obtaining robust alternatives is of immediate practical importance.
In the robust regression literature much of the attention has been devoted to linear models, and a rich variety of proposals exists for them. For a review see, for example, maronna2019robust. Fewer options are available for robust non-parametric regression methods, and generally they have not been developed as thoroughly as the linear ones. Among them we can mention: robust locally weighted regression (cleveland1979robust; wang19941), local M-estimators (boentefraiman89)
, local regression quantiles(welsh96), and the proposals in hardletsy88, hardle90 and ohnychkalee07
. Unfortunately, when applied to problems with several explanatory variables, these methods suffer from the well known curse of dimensionality, requiring training sample sizes that grow exponentially with the number of covariates. An exception are the recently proposed robust estimators for additive models(boente2017robust). However, selecting appropriate bandwidths for each of the components of an additive model can be computationally prohibitive when a moderate or large number of explanatory variables are available.
In this paper we study robust non-parametric regression estimators based on gradient boosting (Friedman2001)
. These ensemble methods are scalable to high-dimensional data, and typically require selecting fewer tuning parameters than additive models. Most robust boosting algorithms in the literature were designed for classification problems(kegl2003robust; rosset2005robust; freund2009more; miao2015rboost; li2018boosting). Previous proposals to robustify boosting methods replaced the squared loss with the absolute value function, or Huber’s loss. lutz2008robustified introduced several ways to robustify boosting, primarily focusing on linear regression. One of their methods (Robloss) can be extended to the scope of our study (nonlinear regression) by replacing the simple linear regression base learners with other types of base learners.
A different class of robust nonparametric regression methods are based on random forests.roy2012robustness proposed simple variations of random forests that use the median to aggregate trees and make the tree splits. li2017forest extended random forests from the squared loss to a broad family of loss functions, which leads to robust methods including: Huber forest, Tukey forest, and the quantile random forest which was also proposed by meinshausen2006quantile.
Our main concern with these proposals is that they use robust loss functions that either may yield estimators with low efficiency (e.g. the loss) or require an auxiliary residual scale estimator (e.g. Huber’s and Tukey’s loss functions). For the latter, previous proposals suggested using in each step a robust scale estimator obtained with the residuals from the fit at the previous iteration (Friedman2001; lutz2008robustified). It is easy to see that this changing scale estimator may not work well, since the scale can be overestimated in early iterations and this might result in observations with large residuals not being considered outlying. In fact, our numerical experiments confirm that this is the case even in relatively simple settings. To address this problem, we propose a robust boosting method that directly minimizes a robust scale of the residuals, and hence does not need to compute an auxiliary residual scale estimator. Although in principle our approach can be used with any scale estimator, gradient boosting requires the gradient of the objective function. Hence, in this paper we focus on minimizing an M-estimator of scale, for which we can compute the corresponding gradient. This approach can be seen as an extension of S-estimators (rousseeuw1984robust) to this class of non-parametric regression estimators. Moreover, this robust boosting estimator can be used (along with its associated robust residual scale estimator) as the initial fit for a boosting M-type estimator using Huber’s or Tukey’s loss. As in the case of linear models, this second stage is expected to result in a robust estimator with higher efficiency (less variability) than the S-estimator (yohai1987high). Our proposal applies to practical situations where the proportion of potential outliers in the training and validation sets is unknown, and no parametric structure for the regression model is available. Furthermore, our approach can be used with relatively large numbers of explanatory variables.
The remainder of this paper is organized as follows. Section 2 introduces our proposed robust boosting regression estimator, and a variable importance score based on permutations. Section 3 reports the results of our simulation studies comparing our method with previously proposed robust boosting algorithms, while results obtained on benchmark data are discussed in Section 5. Finally, Section 6 summarizes our findings and conclusions.
In this section we introduce our robust gradient boosting algorithm based on minimizing an M-estimator of the scale of the residuals. We also show how this fit can be further improved with a second gradient boosting step based on a bounded loss function. This second step reduces the variability of the fit without affecting its robustness. Finally, we discuss a variable importance method based on the permutation importance scores used with random forest (breiman2001random).
2.1 SBoost: gradient boosting based on M-scale estimators
Consider a training data set , where , and . The goal is to estimate the regression function that relates the explanatory variables to the response . The usual model behind this approach specifies that . Gradient boosting assumes that can be estimated with a function of the form:
where the functions
(“base learners") are typically chosen to be simple regressors such as decision trees with few splits, or linear functions. Often the base learnersare restricted to be functions of a single variable, which is akin to assuming an additive model.
Given a user-chosen loss function (e.g. the squared loss ), gradient boosting can be seen as a step-wise iterative algorithm to minimize the empirical risk
over functions of the form (1). More specifically, let be the number of observations in the training set, , and consider the objective function above as a function of the
-dimensional vector. At each step we compute the -th coordinate of its gradient at the point obtained from the previous iteration :
and form the gradient vector . Similar to the usual gradient descent algorithm that sequentially adds to the current point the negative gradient vector of the objective function, gradient boosting adds to the current function estimate an approximation to the negative gradient vector () using a base lerner . More specifically, at the -th iteration is chosen to minimize
over members of the family of base lerners. We then set
where is the optimal step size given by
Note that this algorithm requires an initial estimate . When
is the squared loss this initial estimator is usually taken to be the sample mean of the response variable in the training set. It is easy to see that this simple choice may not work well when there are atypical observations in the data set. In Section2.1.1 below we present a better initialization approach for our setting.
Prior attempts to robustify boosting replaced the squared loss with the Huber loss, which requires the use of a preliminary scale estimator of the residuals. However, to estimate the scale of the residuals we need a robust regression estimator with which to compute the residuals. To avoid this problem we consider the following minimization problem instead:
where is as in (1), and is a robust estimator of the scale of the residuals , . More specifically, in this paper we will focus on the case where is an M-estimator of scale given by
where is a symmetric function around zero which controls the robustness and efficiency of the resulting estimator (maronna2019robust). In order to obtain a robust estimator, we use a bounded function in (2). For the associated scale estimator to be consistent when the errors are Gaussian the constant above needs to satisfy , where
denotes the standard normal distribution. This can always be computed once the functionis selected. Note that if we take this approach reduces to the usual boosting with the squared loss function.
In what follows we use a function in Tukey’s bisquare family as above, but our approach applies to any differentiable function . Specifically, we choose , where
For linear regression estimators the tuning constant is generally selected to obtain maximum breakdown point, which is achieved when , which results in . We refer the interested reader to maronna2019robust for details.
To compute the S-type boosting estimator, we follow the same approach used for gradient boosting, namely: at each step we calculate the gradient of the objective function (considered as a function of the fitted values), and approximate the negative gradient with a base learner. The -th coordinate of the gradient at step is given by:
The above can be computed explicitly by differentiating both sides of equation (2) with respect to , and noting that
where . Solving for in the equation above we get
where the constant is
The resulting algorithm, which we call SBoost, is described in Algorithm 1.
2.1.1 The initial fit
Gradient boosting algorithms require an initial fit to calculate the gradient vector for the first iteration. The classical gradient boosting algorithm is generally initialized with a constant fit where solves
Since the loss function in Algorithm 1 is non-convex, the choice of initial fit for SBoost will typically be more important than for the usual gradient boosting with a squared loss function. A natural initial fit for a robust version of gradient boosting would be to take . However, we have found that this choice may not work well when the response variable includes a wide range of values. Intuitively the problem is that in those cases many “good” observations may have large residuals that result in a zero initial gradient , and subsequent iterations the algorithm fail to update . Instead, we recommend using a regression tree computed with the loss (breiman2017classification). This tree, which we call LADTree, at each split minimizes the average absolute value of the residuals. This initial fit is expected to be robust to outliers in the training set; it is able to handle data of mixed types (quantitative, binary and categorical), and it is scalable to data with a large number of features.
Although our numerical experiments confirm the robustness of the SBoost fit (see Section 3.3), they also show a very low efficiency when the data do not contain outliers. Mimicking what is done to fix this issue in the case of S-estimators for linear regression, we propose to refine the SBoost estimator by using it as the initial fit for further gradient boosting iterations based on a bounded loss function that is closer to the squared loss function than in (2) (e.g. a Tukey’s bisquare function with a larger tuning constant). Moreover, in this second step we can use the scale estimator associated with the final SBoost fit ().
This two-step procedure, which we call RRBoost can be summarized as follows:
Stage 1: compute an -type boosting estimator (SBoost) with high robustness but possibly low efficiency;
Stage 2: compute an -type boosting estimator initialized at the function and scale estimators obtained in Stage 1.
This approach solves the problem of many robust boosting proposals in the literature that require a preliminary residual scale estimator. Since it is known that highly robust regression S-estimators have low efficiency (hossjer92), the second step attempts to improve the efficiency of SBoost, similarly to what MM estimators do for linear regression models (yohai1987high).
The M-type boosting fit is given by
where is of the form (1), and is the residual scale estimator obtained from SBoost. In what follows we will use a Tukey’s bisquare function (3) for above, and set , which in the case of linear regression models yields 95% asymptotic efficiency when the errors have a normal distribution (maronna2019robust).
The gradient of the objective function above are easy to compute. At the -th iteration, its -th coordinate is given by
The algorithm for the second stage starts from the SBoost function estimator , and sequentially adds base learners , …, that approximate the corresponding vector of gradients at each step. The resulting RRBoost procedure is described in Algorithm 2.
2.3 Early stopping
The boosting algorithm is intrinsically greedy: at each step it attempts to maximize the reduction in the value of the loss function evaluated on the training set. Consequently, overfitting the training set at the expense of an increased generalization error (i.e. a poor performance on future data) is a practical concern. To mitigate this problem we propose to use an early stopping rule based on monitoring the performance of the ensemble on a validation set. This approach can be seen as a form of regularization.
We define the early stopping time of each stage of RRBoost as the iteration that achieves the lowest loss function value on a validation set (). Let and be the maximum number of iterations allowed in RRBoost (Algorithm 2). In the first stage (SBoost), we refer to (2) and define the validation loss as
The early stopping time for SBoost is
In the second stage, referring to (4) we define the validation loss:
and the early stopping time
In practice, when running boosting for iterations is too costly, one can terminate the boosting iterations when the validation error ceases to improve after several iterations.
2.4 Robust variable importance
We introduce robust variable importance as a by-product of RRBoost, which can be useful for selecting variables, interpreting strong predictors, and identifying variables that need to be measured to a great precision (fisher2018all). The variable importance suggested by Friedman2001 for gradient boosting averages the improvement of squared error at each tree split at all iterations, but it cannot be generalized to arbitrary loss functions and is limited to tree base-learners. Motivated by the permutation variable importance used in random forest (breiman2001random), we suggest a robust modification that applies to RRBoost.
Permutation importance measures the importance of a variable by permuting it in the training data and recording the decrease in prediction accuracy after the permutation. A feature is “important" if breaking its relationship with the response by shuffling its values increases the model error. Since our training data is contaminated, a robust prediction accuracy metric is needed to reduce the impact of outliers. We use trimmed root mean squared error and define the importance of variable as:
where is fitted using the training data, and is fitted using training data with the th variable permuted. We trim the residuals that deviate at least 3 scale units from 0, where the scale unit is defined by the median absolute deviation (MAD) multiplied by a constant (1.438), which makes it an unbiased and consistent estimator of scale. Set denotes the indices in the training set () that are not trimmed as outliers, which is defined as
We performed extensive numerical experiments to assess the performance of RRBoost and compared it with that of other proposals in the literature. In particular, we focused on the the accuracy and robustness of different boosting methods when the training and validation sets contain outliers, and also when no atypical observations are present.
3.1 Set up
Consider a data set , consisting of predictors , where each coordinate , , is an independent random variable. The responses follow the model
where the errors are independent from the explanatory variables , and
is a constant that controls the signal-to-noise ratio (SNR):
Given a function and values of the predictors, the constant that achieves a desired target SNR is
In our experiments the function was set to
where , , , , and . We considered the following error distributions:
Clean (): ;
Symmetric gross errors () : , ; and
Asymmetric gross errors () : , .
For the contaminated settings and , we considered rates of contamination equal to 10% and 20% (which correspond to setting and , respectively). We constructed training sets of size and validation sets of size following each of the cases above. To compare the performance of the different estimators, we also constructed a test set of size , which always follows .
The number of available explanatory variables was set to and , the latter corresponding to a “high-dimensional” case where the size of the training set is smaller than the number of predictors (). The signal-to-noise ratio was set to when and for the case. The experiments were repeated 100 times, and we compared the following methods:
L2Boost: gradient boosting with squared error loss (Friedman2001);
LADBoost: gradient boosting with absolute error loss (Friedman2001);
MBoost: gradient boosting with Huber’s loss (Friedman2001);
Robloss: Robloss boosting (lutz2008robustified);
SBoost: SBoost (see Section 2.1) intialized with LADTree;
RRBoost: RRBoost (see Section 2.2) intialized with LADTree.
The base learners for all methods under comparison were decision stumps (one-split regression trees).
We used the early stopping rules discussed in Section 2.3 with and . Similarly, for L2Boost, LADBoost, MBoost, and Robloss, we tracked their loss functions evaluated on the validation set () and calculated the early stopping time ():
where the maximum number of iterations .
To initialize RRBoost we considered LADTrees with depth 0, 1, 2, 3 and 4, and minimum number of observations per note set at 10, 20 or 30 (note that an LADTree of depth 0 corresponds to using the median of the responses as initial fit). Of the 13 different combinations we aimed to choose the one that performs the best on the validation set. As the validation set also contains outliers, we first fit the depth 0 tree and trimmed as potential outliers whose residuals deviate at least 3 scale units from 0. The observations that were not trimmed are denoted as:
We picked the one of the 13 combinations with the lowest trimmed absolute average deviation (AAD) on the validation set:
All our simulations were run using the R programming environment (RLang). Regression trees were fit using the rpart package, including the LADTrees and the base learners. The code used in our experiments is publicly available on-line at https://github.com/xmengju/RRBoost.
3.2 Performance metrics
We compared different methods in terms of their predictive performance and variable selection accuracy. For predictive performance, we recorded the root mean squared error (RMSE) evaluated on the clean test set () at the early stopping time ():
For variable selection accuracy, we measured the proportion of variables recovered (kazemitabar2017variable). Specifically, let be the set of indices of explanatory variables that were used to generate the data (). After ranking all
features using the robust imputation variable importance index, we recorded the proportion of true variables in the topof the ordered list:
where is an order statistics, denoting the th largest value of .
Figure 1 and 2 compare the predictive performance of boosting methods under consideration. They show the boxplots of test root-mean-square error (RMSE) at early stopping times, where each boxplot was generated with the 100 results from independent runs of the experiment. Tables 2 and 1 show the summary statistics (mean, median, sd, and MAD) of these test RMSEs at early stopping times.
|D0||1.00 (0.04)||1.03 (0.04)||1.08 (0.06)||1.02 (0.04)||1.61 (0.11)||1.00 (0.05)|
|D1 (10%)||2.27 (0.32)||3.42 (1.15)||1.16 (0.07)||1.40 (0.47)||1.57 (0.12)||1.04 (0.06)|
|D1 (20%)||2.63 (0.48)||7.65 (0.55)||1.29 (0.10)||2.60 (0.9)||1.52 (0.14)||1.10 (0.07)|
|D2 (10%)||3.00 (0.31)||3.74 (0.95)||1.19 (0.07)||1.76 (0.65)||1.57 (0.13)||1.04 (0.07)|
|D2 (20%)||4.71 (0.32)||7.78 (0.58)||1.49 (0.20)||4.40 (0.93)||1.47 (0.13)||1.08 (0.07)|
|D0||1.28 (0.07)||1.61 (0.10)||1.45 (0.12)||1.56 (0.08)||1.71 (0.11)||1.28 (0.08)|
|D1 (10%)||2.46 (0.24)||4.78 (1.24)||1.58 (0.12)||2.00 (0.22)||1.71 (0.13)||1.32 (0.10)|
|D1 (20%)||2.95 (0.36)||9.00 (0.44)||1.71 (0.12)||3.56 (0.71)||1.70 (0.11)||1.37 (0.11)|
|D2 (10%)||3.23 (0.33)||5.25 (0.99)||1.59 (0.13)||2.15 (0.33)||1.69 (0.12)||1.33 (0.11)|
|D2 (20%)||4.94 (0.37)||9.05 (0.46)||1.80 (0.13)||6.11 (0.93)||1.71 (0.11)||1.38 (0.11)|
We find that, as expected, when there are no outliers in the training and validation sets, the performance of all methods are very similar, with the exception of SBoost, which is negatively affected by its low efficiency. Note that the subsequent M-step in RRBoost completely fixes this problem.
When the number of predictors is , the performances of L2Boost, MBoost, and Robloss seriously deteriorate as the percentage of contamination increases from 10% to 20%, while RRBoost achieves the lowest average test RMSE among all methods, and its peformance is remarkably stable for different proportions of contamination.
For the high-dimensional setting with , the results in Figure 2 show similar patterns to the ones for . RRBoost performs very similarly to the standard gradient boosting when the training and validation sets do not contain outliers, and outperforms other methods when the data contain outliers. It is interesting to note that robust boosting methods using Huber’s loss function (MBoost and Robloss) perform worse than gradient boosting with squared loss (L2Boost) in many contaminated settings. One explanation for the behaviour of MBoost with 20% of outliers is that at each iteration , Friedman2001 suggested setting the tuning constant of the loss function at the 90% quantile of the residuals . It is not immediately clear how to address this issue without assuming that the proportion of outliers is known.
We also calculated variable importance indices for each independent run of the experiment. The resulting fractions of variables recovered (3.2) for the cases and are reported in Tables 3 and 4, respectively. We note that for , RRBoost and LADBoost recovered almost all variables for clean and contaminated data, whereas the other boosting methods (especially L2Boost and MBoost) recovered much less variables when trained with contaminated data compared to clean data. In the high-dimensional case (Table 4), RRBoost also performs well (with at least 90% of the variables recovered), whereas variable selection in contaminated settings appeares to be more difficult for the other methods. In particular, L2Boost and MBoost almost never selected the true variables, considering their fraction of variables recovered are less than on average.
4 Real Data
We evaluated the performance of our proposed algorithm on real-world data sets, where we expect to find complex correlation structures among the explanatory variables. We considered 6 data sets with different combinations of number of cases and explanatory variables. Their main characteristics are summarized in Table 5.
|wage||3000||8||ISLR R package (james2013islr)|
|nir||103||94||pls R package (wehrens2007pls)|
|tecator||215||100||caret R package (kuhn2015short)|
|glass||180||486||Serneels et al. (serneels2005partial)|
To compare the prediction power of different boosting algorithms we follow the setting in jung2016robust. Each data set was split into a training set (composed of 60% of the data), a validation set (20% of the data) and a test set with the remaining 20%. We ran the boosting algorithms on the training sets, and used the validation sets to select an early stopping time. The prediction performances are evaluated on the test sets.
To study the behaviour of the algorithms when the data may be contaminated, we randomly added 20% of outliers to the training and validation sets as follows:
where are the original responses, are random errors, and is a constant chosen to achieve a desired signal-to-noise ratio (SNR). As before, we use
is the sample variance of the response variable. We setfor the high dimensional glass data and for the other data sets. For the errors we considered two contamination settings:
for 20% of the data;
for 20% of the data.
Predictive performance was measured with the average absolute deviance (AAD) of residuals:
at the early stopping time . We used the same base-learners and number of iterations as in Section 3. In order to reduce the potential variability induced by the random partition into training, validation and test sets, we repeated the experiment 50 times with different random splits. The summary statistics for clean, symmetrically contaminated and asymmetrically contaminated data sets are displayed in Tables 6 to 8.
As expected, Table 6 shows that when the data do not contain outliers the difference between RRBoost and L2Boost is not significant. However, when outliers were introduced in the data, Tables 7 and 8 indicate that LADBoost and RRBoost generally perform better than the other methods. In particular, when the data sets were asymmetrically contaminated, RRBoost had the best performance in terms of mean AAD for most data sets.
In this paper we propose a novel robust boosting estimator (RRBoost) which combines S-type and an M-type boosting estimators. This approach allows us to use bounded loss functions, which are known to provide strong protection against outliers. As expected, our experiments show that outliers, particularly when they are asymmetrically distributed, can affect significantly the performance of the classical L2Boost algorithm. We also found that our proposed RRBoost can be much more robust than other prior robust boosting methods, while behaving very similarly to L2Boost when the data do not contain atypical observations. In particular, the second stage of RRBoost significantly improves the prediction accuracy of the initial SBoost fit. Using a permutation-based variable importance ranking we show that in our experiments RRBoost achieved the best variable selection accuracy across the different contamination settings.