Robust Variable Selection under Cellwise Contamination

10/24/2021
by   Peng Su, et al.
0

Cellwise outliers are widespread in data and traditional robust methods may fail when applied to datasets under such contamination. We propose a variable selection procedure, that uses a pairwise robust estimator to obtain an initial empirical covariance matrix among the response and potentially many predictors. Then we replace the primary design matrix and the response vector with their robust counterparts based on the estimated covariance matrix. Finally, we adopt the adaptive Lasso to obtain variable selection results. The proposed approach is robust to cellwise outliers in regular and high dimensional settings and empirical results show good performance in comparison with recently proposed alternative robust approaches, particularly in the challenging setting when contamination rates are high but the magnitude of outliers is moderate. Real data applications demonstrate the practical utility of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 13

07/07/2021

Bounded support in linear random coefficient models: Identification and variable selection

We consider linear random coefficient regression models, where the regre...
12/29/2019

Robust Variable Selection Criteria for the Penalized Regression

We propose a robust variable selection procedure using a divergence base...
11/12/2020

Robust approach for variable selection with high dimensional Logitudinal data analysis

This paper proposes a new robust smooth-threshold estimating equation to...
10/12/2019

Real-time outlier detection for large datasets by RT-DetMCD

Modern industrial machines can generate gigabytes of data in seconds, fr...
07/29/2020

Robust variable selection for model-based learning in presence of adulteration

The problem of identifying the most discriminating features when perform...
10/28/2019

Power analysis of knockoff filters for correlated designs

The knockoff filter introduced by Barber and Candès 2016 is an elegant f...
03/28/2021

Data Integration through outcome adaptive LASSO and a collaborative propensity score approach

Administrative data, or non-probability sample data, are increasingly be...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a linear regression model where

observations are modelled through,

(1)

where is the response, is a -vector of predictors, is a -vector of regression parameters and

is an independent random error with variance

. If an intercept is included in model (1), then the first component of equals 1 for all . For simplicity in notation, if not otherwise mentioned, we write regression models without an intercept term.

It is common for real and raw data to contain outliers. Usually, the term outlier refers to rowwise outliers, as visualized in the left panel of Figure 1, that is all elements of an entire row of the design matrix together ‘create’ that the th row is outlying.

Traditional robust regression estimators such as M-estimators (Huber, 2004)

handle rowwise outliers through mitigating the influence of ‘outlying’ rows through ‘robust’ loss functions in the objective function such as in

where is a robust loss function and is an initial robust estimate of residual scale. Popular robust loss functions include Huber’s loss or Tukey’s biweight loss. These estimators are less sensitive to rowwise outliers because the robust loss function downweights outlying rows as identified by their large residuals. Although those estimators are robust, they can cause overfitting when there are many predictors and have poor robustness and efficiency properties when the versus ratio is high (see e.g., Maronna and Yohai, 2015; Smucler and Yohai, 2015).

Recently, Chang et al. (2018) combines robust M-estimation with the adaptive Lasso penalty: where is a tuning parameter and are adaptive weights obtained from initial estimates , . Chang et al. (2018) uses Tukey’s biweight loss because its derivative is redescending, which is a necessary property for the loss function to be robust against outliers in the design matrix (e.g., Rousseeuw and Yohai (1984)).

However, the rowwise outlier paradigm (visualised in the left panel of Figure 1) is often too narrow a framework. For some datasets, most cells in a row are clean while just a few cells are contaminated. Alqallaf et al. (2009) notes the propagation of cellwise outliers: under the independent contamination model, for a contamination rate of cells, the expected proportion of contaminated observation rows is , which on average rapidly exceeds with increasing dimension , and this can even be more rapid in special cases, as shown in the right panel of Figure 1. Specifically, Figure 1 shows two scenarios with the same proportion of contaminated cells in the design matrix, i.e. 20% of entries are contaminated overall. However, the scenario in the left panel has only a few observation rows contaminated whereas every single row of the scenario in the right panel is contaminated. Thus, rowwise methods may fail with cellwise outliers, because even a small overall proportion of contaminated cells may occur in a large proportion of rows having at least one outlying cell. Therefore, the breakdown point, the proportion of rowwise contamination which can be accommodated by a traditional robust estimation method, can very quickly exceed 50% rendering even the most robust traditional methods ineffective.

Figure 1: Rowwise and cellwise outliers: The outlying cells are rendered in white and the uncontaminated cells are shown in gray. For both panels, of the cells are contaminated. However, the left panel has 4 out of 20 rows outlying while the right panel has 18 rows outlying.

A natural choice to handle cellwise outliers is to perform outlier detection first. For example,

Rousseeuw and Bossche (2018) predicts the value of a cell and then flags a cell as outlying if its observed value and its predicted value differ by too much. Debruyne et al. (2019) regards all outliers as rowwise outliers and then detects which cells contribute more to the outlying rows. Raymaekers and Rousseeuw (2019)

detects outlying cells for each row by proposing a ‘cellflagger’ technique, which combines a Lasso penalty with a stepwise application of constructed cutoff values. After an initial outlier detection, there are two natural possibilities for a next step: (1) impute non-outlying values for those cells flagged as outlying, as done in

Rousseeuw and Bossche (2018) and Debruyne et al. (2019); (2) regard detected outliers as missing and then obtain robust estimation directly, as in Raymaekers and Rousseeuw (2019).

Very little work has been done in the context of robust regression under cellwise contamination. Ollerer et al. (2016) proposes the Shooting S algorithm to perform robust regression. First, a robust imputation of the design matrix is obtained. Then the shooting S algorithm is used to obtain robust estimation of regression coefficients. Bottmer et al. (2021) combines the shooting S algorithm with the Lasso to obtain sparse regression results. However, in our empirical studies we found this method can be improved upon, particularly in high dimensional cases and for high contamination rates but where the magnitude of outliers is moderate; which is a very challenging setting when dealing with cellwise outliers.

Leung et al. (2016) proposes a three step regression method. First, cellwise outliers are detected marginally; second, a robust estimator of multivariate location and scatter is applied to obtain robust covariance estimates; finally the regression coefficients are computed from the estimates obtained in the second step. However, this method only copes with marginal cellwise outliers and does not give sparse regression results.

Filzmoser et al. (2020) proposes a cellwise robust M-estimator by replacing the detected outliers with imputed values. This method works well for non-sparse settings but not for sparse data generating models.

To obtain robust variable selection results under cellwise contamination, especially in high dimensional cases, we propose a procedure to perform variable section based on a robust estimation of the covariance matrix among response and predictors. First, we use a pairwise robust estimator to obtain an initial empirical covariance matrix. This robust covariance matrix is adopted to achieve robust variable selection. Variable selection based on covariances is introduced in Section 2 and our proposed robust method using this device is presented in Section 3. Our method is robust to cellwise outliers in low and high dimensional settings. Simulations show good performance in comparison with recent proposed approaches, and those are presented in Section 4. Real data applications in Section 5 demonstrate the utility of our proposed method. We conclude the paper with some remarks in Section 6.

2 Variable selection based on covariances

The quadratic loss function can be expressed as (see, e.g., Loh and Wainwright, 2012),

(2)

where are the components of the empirical covariance matrix among predictors and the response,

(3)

The solution of (2) can be expressed as . Under cellwise contamination, Leung et al. (2016) highlights that one way to obtain robust regression results is through using a robust counterpart of an empirical covariance matrix.

The quadratic loss function in (2) can be more elegantly expressed. Given that is positive semi-definite, we can define , where is the square root of , the first column of is dentoed and is a matrix consisting of the remaining columns of . Then we rewrite (2) as

(4)

which is a classic quadratic loss minimization problem.

To achieve robust variable selection, the adaptive Lasso penalty (Zou, 2006) can be incorporated resulting in the regularized objective loss function. Considering the consistency of the adaptive Lasso (Zou, 2006), using its penalty obtains a regularized objective loss function

(5)

where is the tuning parameter, and is an initial non-regularized estimation of . In the minimization equation (5), we choose the value of the tuning parameter that minimizes

(6)

where is the estimated residual variance and

denotes the degrees of freedom of the regularized regression model.

While there are many criteria that could be applied to determine the tuning parameter , we have selected the BIC (Schwarz, 1978) because of its ease of implementation and good performance.

3 Robust estimation of the covariance matrix

A natural choice to get a robust estimated covariance matrix under cellwise contamination is to obtain robust estimates of marginal scales, which we denote , , and pairwise correlations , . Robust scale estimators such as the median absolute deviation, estimator (Rousseeuw and Croux, 1993) and estimator (Tarr et al., 2012) can be used to obtain such robust and highly efficient scale estimates. Robust pairwise estimators can be used to obtain robust correlations as further detailed next.

3.1 Correlation estimator

Given a pair of observation vectors for variables and , i.e. , , a natural estimator of the strength of linear association is the Pearson correlation coefficient, stressing of course that this is not a robust estimator of linear dependence. Under contaminated settings, a simple pairwise estimator is based on the identity proposed by Gnanadesikan and Kettenring (1972),

(7)

where is a robust estimator of scale and , . This estimator is also used in Raymaekers and Rousseeuw (2021) to obtain fast robust correlation. With this method, we can use scale estimators to obtain a correlation estimator. In our simulations and applications, the estimator is used to compute robust scales since it is very robust and reasonably efficient (Rousseeuw and Croux, 1993). After obtaining estimated correlation coefficients, we assemble them into an empirical correlation matrix as in Tarr et al. (2016).

As an alternative, other nonparametric correlation estimators such as Quadrant correlation, Spearman correlation or Gaussian rank correlation could be used to obtain robust correlation matrices as well. In this paper, we consider the Gaussian rank correlation (Boudt et al., 2012) defined as the sample correlation estimated from the normal scores of the data. For , the th observation of the th variable, we compute

(8)

where denotes the rank of among all elements of and

is the standard normal cumulative distribution function. Then we obtain the Gaussian rank correlation matrix by computing the correlation matrix of

. Compared with other nonparametric correlation estimators, Gaussian rank correlation is robust and consistent under the multivariate normal distribution

(Boudt et al., 2012).

3.2 Positive definite adjustment

A limitation of pairwise methods is that estimated correlation matrices are not necessarily positive definite, and this is particularly prone to happen in high dimensional situations. Many methods have been devised to solve this problem. A positive definite (PD) calibration method is used by Huang et al. (2017),

(9)

where denotes the identiy matrix, denotes the Frobenius norm of and

is the eigenvalue threshold we set. When

, the PD calibration method reduces to the classic nearest positive definite matrix method (Higham, 2002).

Given the spectral decomposition for and , we have

(10)

From the view of the Frobenius norm, has smaller difference with compared to other calibration methods, such as the linear shrinkage estimator (Ledoit and Wolf, 2004).

In order to obtain sparse selection results, should not be too small. In Huang et al. (2017), the optimal is chosen by minimizing

(11)

where is a parameter to control the range of . Huang et al. (2017) suggest setting to be the smallest non-negative eigenvalue. However, as long as is positive, it does not contribute to the minimization in (11). Therefore, for simplicity, we set in the following. The criterion in (11) penalises small values of

rather than solely just minimizing the Frobenius norm. Such penalty terms are widely used in statistical feature selection, such as when using the AIC

(Akaike, 1973) and BIC (Schwarz, 1978).

The estimated correlation matrices from the Gaussian rank correlation and Pearson correlation will also be ill conditioned when . In this case, the PD calibration method could be used to adjust the obtained correlation matrices as well.

After obtaining the estimates of the scale parameters and the correlation matrix, we obtain the robust empirical covariance matrix , where is a diagonal matrix consisting of the robust estimated scales. In our algorithm we can also use directly then the scaling of predictors could be dealt with initially.

Then the robust estimation of empirical covariance matrix is used to perform variable selection, as introduced in Section 2. For different robust estimators, we perform both, the Adaptive Lasso based on Robust Pairwise correlations (ALRP) and the Adaptive Lasso based on Gaussian Rank correlations (ALGR). Besides, the Lasso based on Pearson correlation is just equivalent to Lasso itself. We continue with presenting empirical work on those and other methods in Section 4 and their application to real data in Section 5.

4 Simulation Studies

To measure the utility of our proposed methods, we compare the performance of Lasso (Tibshirani, 1996), the sparse LTS (sLTS) (Alfons et al., 2013), the sparse Shooting S-estimator (sShootingS) (Bottmer et al., 2021), and two new approaches: the adaptive Lasso estimator based on robust pairwise correlation (ALRP) and the adaptive Lasso estimator based on Gaussian rank correlation (ALGR) in regular and high dimensional settings.

Our empirical work will show that no matter the setting, either regular when , or high dimensional when , ALRP shows superior performance compared to Lasso and other robust methods, particularly in the challenging setting when contamination rates are high but the magnitude of outliers is moderate. ALGR is overall a little inferior to ALRP but still performs strongly compared to other methods.

4.1 Regular settings:

Recall the linear regression model (1). In our simulations, we set , , , is sampled from and is sampled from . The correlation structure among variables is given by and we set .

Contamination proportions are set as , and for all predictors separately. Outlying cells of all variables are randomly generated from . We vary over the set to simulate outliers with different magnitudes. When we apply appropriate ’s, those cells will be outlying cells but not not necessarily marginal outliers.

For Lasso, ALGR and ALRP, we use the as defined in (6) to choose the regularization tuning parameter

. For sLTS and sShootingS, their default selection criteria are used. To assess the performance of the considered methods, we use the true positive rate (TPR), the false positive rate (FPR) and the balanced F-score (

score) defined by

(12)

where TP denotes the true positive number, FP the false postive number and FN the false negative number, respectively. We run simulations for each scenario and use the average TPR, FPR and score to measure the performance of each method.

Figure 2: Selection results over 1,000 simulation runs for various contamination rates when . Each column represents a contamination rate. The horizontal axis represents the magnitude of outlyingness and the vertical axis gives the TPR, FPR and score, respectively.

Figure 2 reports the empirical findings. All methods perform well when the cellwise contamination rate as can be seen from the plots in the left panel of Figure 2. This contamination rate is sufficiently low that all methods have good TPRs. The score and FPR of Lasso are worse than for other methods, i.e. values are consistently smaller and consistently larger, respectively. The score of ALGR is also slightly worse compared to the three other cellwise robust methods considered here.

Considering the contamination rate (middle column of Figure 2), the scores, TPRs and FPRs of ALRP and sShootingS are generally better compared to other methods, indicating that ALRP and sShootingS both maintain their good performance even with the increase of . ALGR performs slightly worse than ALRP and sShootingS in terms of score and FPR. Lasso has lower score and TPR. Especially, Lasso performs more inferior with the increase of . When , the TPR of Lasso is merely around . We also observe that sLTS also becomes inferior with the increase of .

When (right column of Figure 2), ALRP performs the best in terms of all performance measures and we observe that it maintains a stable score, TPR and FPR over the considered range of . The performance of ALGR is a little worse than ALRP but we also observe stable results overall. While we observe that sShootingS performs well when or , we note an inferior performance between those values, e.g. particularly when is around . We conclude that under this scenario, the performance of sShootingS depends on the magnitude of outliers. Compared to the previous settings where e is either 2% or 5%, when e is 10% both Lasso and sLTS have a more pronounced poor performance.

4.2 High dimensional settings:

In this section we show the empirical findings in high-dimensional cases. We continue with the simulation settings from the previous section but now increase to be while keeping at . As mentioned in Section 3, under high dimensional settings, the correlation matrices obtained from the Pearson correlation and Gaussrank correlation are ill-conditioned. In this case, the PD calibration method (Huang et al., 2017) is used to obtain adjusted covariance matrices for Lasso and ALGR as well.

Figure 3: Selection results over 1,000 simulation runs for various contamination rates when . Each column represents a contamination rate. The horizontal axis represents the magnitude of outlyingness and the vertical axis gives the TPR, FPR and score, respectively.

Figure 3 shows simulation results under high dimensional settings. Considering the contamination rate (left column of Figure 3), we observe that the three cellwise robust methods we considered (ALRP, ALGR and sShootingS) perform well as they maintain good scores. Lasso, and sLTS perform inferior, as they have worse scores as increases.

When (middle column of Figure 3), ALRP, ALGR and sShootingS still perform well in terms of the score. Meanwhile, the TPRs, FPRs and scores of Lasso and sLTS become more inferior, especially with the increase of . We boserve that when , the TPR of Lasso is less than and the TPR of sLTS is less than .

When (right column of Figure 3), ALRP still shows good and stable performance over the considered range of . ALGR performs slightly worse compared to ALRP as it has lower score. Lasso and sLTS have worse TPRs and scores with the increase of . As observed in the simulations in low dimensional cases, the score of sShootingS is good around or but become inferior particularly when is around .

5 Real data applications

We show the usage of the proposed methods using two real datasets: the Boston Housing data, and the Breast Cancer data. They are representative for applications in regular and high dimensional settings, respectively. Our aim is to explore how different methods vary in real data applications.

5.1 Boston Housing data

We illustrate the effect of cellwise outlier propagation on the previously considered estimators using the Boston Housing dataset. This data is available at the UCI maching learning repository (https://archive.ics.uci.edu/ml/index.php) and was collected from census samples on different variables. The original objective of the study is to analyze the association between the median housing values (medv) in Boston and the residents willingness to pay for clean air, as well as the association between medv and other variables in the dataset.

We only consider the quantitative variables that were extensively studied. A sparse regression model is fitted for,

and we compare the performance of five methods: Lasso (Tibshirani, 1996), sLTS (Alfons et al., 2013), sShootingS (Bottmer et al., 2021) as well as the new proposed methods ALRP and ALGR.

To measure the performance of variable selection when adding some known redundant variables, we generate

additional random variables as redundant predictors using the same setting as in the previous simulation section. Therefore, we now have 19 predictors to choose from.

To test the performance under cellwise contamination, we standardize the 19 predictors with robust estimators of location (here we use the median) and scale (here we use ). Then, for each predictor, of the cells are replaced by cellwise outliers which are randomly generated from . As a comparison, we will also run simulations without any contamination to investigate how stable the various methods are when known outliers are present in the data. We repeat this process of adding ten redundant variables followed by generating 10% of outliers in the 19 explanatory variables 1,000 times, and then compute the selection rates of each variable (shown in Table 1).

method e FPR
ALRP 0 1.00 1.00 1.00 0.00 1.00 1.00 0.00 0.00 0.80 0.08
0.1 1.00 1.00 1.00 0.00 1.00 1.00 0.00 0.00 0.05 0.08
ALGR 0 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.00 0.00 0.04
0.1 1.00 1.00 1.00 0.05 1.00 1.00 0.53 0.63 0.18 0.07
sShootingS 0 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.1 1.00 1.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sLTS 0 1.00 1.00 1.00 0.00 1.00 1.00 0.00 1.00 1.00 0.02
0.1 1.00 0.97 0.71 0.29 0.86 0.84 0.31 0.39 1.00 0.24
lasso 0 1.00 1.00 1.00 0.76 1.00 1.00 0.54 0.03 1.00 0.11
0.1 1.00 1.00 1.00 0.43 1.00 0.93 0.80 0.81 1.00 0.17
Table 1: Variable selection rates for Boston Housing data over 1,000 simulation runs. FPR represents the average selection rate across all the ten redundant predictors and e is the contamination rate.

For all five methods reported in Table 1, we observe some inconsistent selection results under different settings. ALRP has the most consistent selection results for the two scenarios considered (without cellwise outliers and with 10% cellwise outliers), although it still shows some inconsistency for the predictor (the selection rate is 0.8 when there is no cellwise contamination whereas it is only 0.05 when there is 10% contamination). ALGR also shows some inconsistency for the predictors , and , although it has good FPR. Interestingly, for this dataset, sShootingS essentially selects nothing for either contamination setting, showing extreme sparsity. On the other hand, sLTS shows inconsistent selection results for many predictors and shows inferior FPR under the 10% cellwise contamination. The selection results of Lasso also vary a lot and has worse FPR.

5.2 Breast Cancer Data

To illustrate a real data application for a high dimensional settings, we use the Breast Cancer dataset (van ’t Veer et al., 2002) as an example. There are 24,481 genes of 117 breast tumor patients in this dataset. For the 78 breast cancer patients with disease-free survival time, for each of the genes the , and p-value is given. We construct a regression model to select some important genes whose are significantly related to the disease-free patient survival. For the sake of this, we regard the disease-free survival time as the response and use the ’s as predictors.

It is assumed that only a few genes are associated with disease-free survival time. For very high dimensional datasets, usually we screen some variables first and then run variable selection based on the predictors we screened, such as in Bai et al. (2021). Thus we first compute the robust marginal correlations (using the pairwise estimator Gnanadesikan and Kettenring (1972) with the scale estimator (Rousseeuw and Croux, 1993)) of gene expressions with disease-free survival time and then select the 200 genes that have the largest robust pairwise correlation coefficients as predictors.

Outlier detection results of those predictors based on DDC (Rousseeuw and Bossche, 2018) is shown in Figure 4. Most cells are yellow, showing they are not detected as outliers. Cells are flagged as outlying (red and blue cells) if the observed value and predicted value differ too much. A red cell means the observed value is larger than the predicted value and a blue cell means the observed value is smaller than the predicted value significantly.

Figure 4: Outlier cell map. Most cells are yellow, showing they are not detected as outliers. Cells are flagged as outlying if the observed value and predicted value differ too much. A red cell means the observed value is larger than the predicted value and a blue cell means the observed value is smaller than the predicted value significantly.

In this screened Breast Cancer dataset, only a few cells are contaminated. The average contamination rate for all 200 predictors is only . For most genes, the contamination rate is less than . Some genes have a contamination rate of more than . Gene Contig55181_RC has the highest contamination rate of . Looking at the outlier pattern in each row, that is for each patient, the contamination rates of most patients are below . The th patient has the most detected outliers, its contamination rate is . Therefore, we could regard this observation as a rowwise outlier. Because this is clearly a high-dimensional cellwise outlier situation, all observations and predictors are contaminated for the breast cancer data.

Given the high cellwise contamination, traditional variable selection methods such as Lasso, or rowwise robust variable selection methods may lead to misleading selection results. With this consideration, cellwise robust methods may have better solutions. A sparse regression model is fitted via Lasso, sLTS, sShootingS, ALRP and ALGR, seperately. All selected genes are sorted based on the their robust pairwise correlations with the survival time. The selection results are shown in Table 2.

Method Size Selected genes
ALRP 20 AB040971 AJ236885 AK000004 AL162078 Contig37028_RC
Contig40406_RC Contig57357_RC Contig62306 NM_001616 NM_001830
NM_002986 NM_003125 NM_003860 NM_004052 NM_004618
NM_005055 NM_006571 NM_012316 NM_013437 NM_020120
ALGR 10 AL137302 AL162078 Contig48328_RC Contig54893_RC Contig54956_RC
Contig55181_RC M94046 NM_001616 NM_005881 NM_006301
sShootingS 72 AB002369 AB033032 AB033035 AF052185 AF257175
AK000365 AK001044 AL080059 AL110193 AL137302
AL137362 AL137527 AL162078 Contig1408 Contig14797_RC
Contig19451_RC Contig27464_RC Contig28947_RC Contig29141_RC Contig29995_RC
Contig30146_RC Contig31424_RC Contig31449_RC Contig37063_RC Contig38966_RC
Contig40406_RC Contig40772_RC Contig41521_RC Contig42740_RC Contig47940_RC
Contig48328_RC Contig48472_RC Contig54893_RC Contig54956_RC Contig55181_RC
Contig56229 Contig57521_RC Contig726_RC Contig8862_RC NM_000915
NM_000950 NM_001616 NM_001673 NM_001826 NM_002300
NM_002442 NM_002824 NM_003110 NM_003158 NM_003245
NM_004052 NM_004911 NM_005055 NM_005089 NM_005498
NM_005785 NM_005881 NM_006297 NM_006301 NM_006804
NM_007278 NM_013262 NM_014675 NM_014699 NM_014706
NM_014894 NM_014908 NM_015270 NM_016175 NM_016434
NM_019848 NM_020120
sLTS 27 AB033035 AJ224741 Contig16447_RC Contig31424_RC Contig37063_RC
Contig45347_RC Contig45917_RC M55643 M94046 NM_000950
NM_001826 NM_002461 NM_002811 NM_002986 NM_003158
NM_003641 NM_003862 NM_004723 NM_005055 NM_005452
NM_005657 NM_006571 NM_007278 NM_012429 NM_013262
NM_014908 one_barcode
Lasso 11 AL137302 AL162078 Contig37063_RC Contig42740_RC Contig48328_RC
M94046 NM_001673 NM_005881 NM_006301 NM_016390
one_barcode
Table 2: Model sizes and gene selection results of the screened Breast Cancer dataset. The entries are the systematic names of selected genes. The methods we use are shown in the first column and the corresponding model sizes are shown in the second column.

From Table 2, we observe that gene selection results vary greatly between different methods. In stark contrast to the previous real data set, here sShootingS selects the most genes while other methods select fewer. Given the insights from the Boston Housing data (Section 5.1) and our empirical work in Section 4, we conclude that the selection results from ALRP are preferred here. Of course the results here are from a ‘single’ run of the methods only. To explore further and to investigate stability in particular, we could run more simulations by resampling.

6 Conclusion

We have developed a robust variable selection method that deals with cellwise outliers in the design matrix in regression models. It is apparent that traditional robust methods might cause misleading estimates when there are cellwise outliers, particularly because the contamination rate may be larger than the breakdown point. Our method tackled this problem by changing a robust regression parameter estimation problem to a robust covariance matrix estimation problem. We have shown that our methods work well under both low dimensional cases and high dimensional cases. Simulations illustrated that the ALRP and ALGR perform better than other methods considered. Real data applications demonstrated the utility of the proposed method.

Acknowledgements

Su’s work was supported by Chinese Scholarship Council (201906360181). Muller and Tarr were supported by the Australian Research Council (DP210100521). Tarr was supported in part by the AIRinnoHK program of the Innovation and Technology Commission of Hong Kong.


SUPPLEMENTARY MATERIAL

Codes:

R functions for ALRP and ALGR are available on the GitHub page of the first author (https://github.com/PengSU517/robcovsel).

References

  • H. Akaike (1973) Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & B. F. Csaki (Eds.). 2nd Int.Sympo.on Information Theory. Cited by: §3.2.
  • A. Alfons, C. Croux, and S. Gelper (2013) Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics 7 (1), pp. 226–248. External Links: ISSN 1932-6157, Link Cited by: §4, §5.1.
  • F. Alqallaf, S. Van Aelst, V. J. Yohai, and R. H. Zamar (2009) Propagation of outliers in multivariate data. The Annals of Statistics 37 (1), pp. 311–331 (en). External Links: ISSN 0090-5364, Link, Document Cited by: §1.
  • Y. Bai, M. Tian, M. Tang, and W. Lee (2021)

    Variable selection for ultra-high dimensional quantile regression with missing data and measurement error

    .
    Statistical Methods in Medical Research 30 (1), pp. 129–150. External Links: ISSN 0962-2802, Link, Document Cited by: §5.2.
  • L. Bottmer, C. Croux, and I. Wilms (2021) Sparse regression for large data sets with outliers. European Journal of Operational Research, In Press. (en). Note: DOI: https://doi.org/10.1016/j.ejor.2021.05.049 External Links: Link Cited by: §1, §4, §5.1.
  • K. Boudt, J. Cornelissen, and C. Croux (2012) The Gaussian rank correlation estimator: robustness properties. Statistics and Computing 22 (2), pp. 471–483 (en). External Links: ISSN 0960-3174, 1573-1375, Link, Document Cited by: §3.1.
  • L. Chang, S. Roberts, and A. Welsh (2018)

    Robust Lasso regression using Tukey’s biweight criterion

    .
    Technometrics 60 (1), pp. 36–47 (en). External Links: ISSN 0040-1706, 1537-2723, Link, Document Cited by: §1.
  • M. Debruyne, S. Hoppner, S. Serneels, and T. Verdonck (2019) Outlyingness: which variables contribute most?. Statistics and Computing 29 (4), pp. 707–723 (en). External Links: ISSN 0960-3174, 1573-1375, Link, Document Cited by: §1.
  • P. Filzmoser, S. Hoppner, I. Ortner, S. Serneels, and T. Verdonck (2020) Cellwise robust M regression. Computational Statistics & Data Analysis 147, pp. 106944 (en). External Links: ISSN 01679473, Link, Document Cited by: §1.
  • R. Gnanadesikan and J. R. Kettenring (1972) Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28, pp. 81–124. Cited by: §3.1, §5.2.
  • N. J. Higham (2002) Computing the nearest correlation matrix–a problem from finance. IMA Journal of Numerical Analysis 22 (3), pp. 329–343 (en). External Links: ISSN 0272-4979, 1464-3642, Link, Document Cited by: §3.2.
  • C. Huang, D. Farewell, and J. Pan (2017) A calibration method for non-positive definite covariance matrix in multivariate data analysis.

    Journal of Multivariate Analysis

    157, pp. 45–52 (en).
    External Links: ISSN 0047259X, Link, Document Cited by: §3.2, §3.2, §4.2.
  • P. J. Huber (2004) Robust Statistics. John Wiley & Sons, New York (en). External Links: ISBN 978-0-471-65072-0 Cited by: §1.
  • O. Ledoit and M. Wolf (2004) A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88 (2), pp. 365–411 (en). External Links: ISSN 0047259X, Link, Document Cited by: §3.2.
  • A. Leung, H. Zhang, and R. Zamar (2016) Robust regression estimation and inference in the presence of cellwise and casewise contamination. Computational Statistics & Data Analysis 99, pp. 1–11 (en). External Links: ISSN 01679473, Link, Document Cited by: §1, §2.
  • P. Loh and M. J. Wainwright (2012) High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics 40 (3), pp. 1637–1664 (en). External Links: ISSN 0090-5364, Link, Document Cited by: §2.
  • R. A. Maronna and V. J. Yohai (2015) High finite-sample efficiency and robustness based on distance-constrained maximum likelihood. Computational Statistics & Data Analysis 83, pp. 262–274 (en). External Links: ISSN 01679473, Link, Document Cited by: §1.
  • V. Ollerer, A. Alfons, and C. Croux (2016) The shooting S-estimator for robust regression. Computational Statistics 31 (3), pp. 829–844 (en). External Links: ISSN 0943-4062, 1613-9658, Link, Document Cited by: §1.
  • J. Raymaekers and P. J. Rousseeuw (2019) Flagging and handling cellwise outliers by robust estimation of a covariance matrix. arXiv: 1912.12446 (en). External Links: Link Cited by: §1.
  • J. Raymaekers and P. J. Rousseeuw (2021)

    Fast Robust Correlation for High-Dimensional Data

    .
    Technometrics 63 (2), pp. 184–198. External Links: ISSN 0040-1706, Link, Document Cited by: §3.1.
  • P. Rousseeuw and V. Yohai (1984) Robust Regression by Means of S-Estimators. In Robust and Nonlinear Time Series Analysis, J. Franke, W. Härdle, and D. Martin (Eds.), New York, pp. 256–272. External Links: ISBN 978-1-4615-7821-5 Cited by: §1.
  • P. J. Rousseeuw and W. V. D. Bossche (2018) Detecting deviating data cells. Technometrics 60 (2), pp. 135–145 (en). External Links: ISSN 0040-1706, 1537-2723, Link, Document Cited by: §1, §5.2.
  • P. J. Rousseeuw and C. Croux (1993) Alternatives to the Median Absolute Deviation. Journal of the American Statistical Association 88 (424), pp. 1273–1283 (en). External Links: ISSN 0162-1459, 1537-274X, Link, Document Cited by: §3.1, §3, §5.2.
  • G. E. Schwarz (1978) Estimating the dimension of a model. The Annals of Statistics 6 (2). Cited by: §2, §3.2.
  • E. Smucler and V. J. Yohai (2015) Highly Robust and Highly Finite Sample Efficient Estimators for the Linear Model. In Modern Nonparametric, Robust and Multivariate Methods: Festschrift in Honour of Hannu Oja, K. Nordhausen and S. Taskinen (Eds.), pp. 91–108. External Links: ISBN 978-3-319-22404-6, Link, Document Cited by: §1.
  • G. Tarr, S. Muller, and N. Weber (2012) A robust scale estimator based on pairwise means. Journal of Nonparametric Statistics 24 (1), pp. 187–199 (en). External Links: ISSN 1048-5252, 1029-0311, Link, Document Cited by: §3.
  • Garth. Tarr, Samuel. Muller, and N. Weber (2016) Robust estimation of precision matrices under cellwise contamination. Computational Statistics & Data Analysis 93, pp. 404–420 (en). External Links: ISSN 01679473, Link, Document Cited by: §3.1.
  • R. Tibshirani (1996) Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288 (en). External Links: ISSN 00359246, Link, Document Cited by: §4, §5.1.
  • L. J. van ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415 (6871), pp. 530–536 (en). External Links: ISSN 1476-4687, Link, Document Cited by: §5.2.
  • H. Zou (2006) The adaptive Lasso and its oracle properties. Journal of the American Statistical Association 101 (476), pp. 1418–1429 (en). External Links: ISSN 0162-1459, 1537-274X, Link, Document Cited by: §2.