    # Robust censored regression with l1-norm regularization

This paper considers inference in a linear regression model with random right censoring and outliers. The number of outliers can grow with the sample size while their proportion goes to zero. The model is semiparametric and we make only very mild assumptions on the distribution of the error term, contrary to most other existing approaches in the literature. We propose to penalize the estimator proposed by Stute for censored linear regression by the l1-norm. We derive rates of convergence and establish asymptotic normality of the estimator of the regression coefficients. Our estimator has the same asymptotic variance as Stute's estimator in the censored linear model without outliers. Hence, there is no loss of efficiency as a result of robustness. Tests and confidence sets can therefore rely on the theory developed by Stute. The outlined procedure is also computationally advantageous, since it amounts to solving a convex optimization program. We also propose a second estimator which uses the proposed penalized Stute estimator as a first step to detect outliers. It has similar theoretical properties but better performance in finite samples as assessed by simulations.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The present paper considers a linear model with outliers and censoring. We have at hand a dataset of

independent realizations of an outcome random variable

and a random vector of covariates

with support in , where is fixed. The variable is the minimum between a random variable and a censoring time , and and are related through the following model:

 Ti=X⊤iβ+αi+ξi,i=1,…,n, (1)

where , while the error term and are real-valued random variables. The observation is called an outlier if . Let the average proportion of outliers be denoted by . We assume that is positive definite and . Under these conditions, is the coefficient of the best linear predictor of by . The goal is to estimate based on a sample , where is the censoring indicator. The vector acts as a nuisance parameter that may represent measurement errors.

When the variable is the logarithm of a duration, model (1) is the so-called accelerated failure time model (henceforth AFT). Applying a standard estimator for this type of models (see e.g. [buckley1979linear], [koul1981regression], [stute1993consistent], [tsiatis1990estimating],[akritas1996], [vkakritas2000]) would yield (under some additional assumptions) a consistent estimate of the best linear predictor of on . In practice, however, because of outliers, the estimated coefficients may yield poor predictions for most observations. Robust methods seek to estimate features of the majority of the data rather than average properties of the full dataset.

This paper studies a lasso-type estimator of the parameter vector

that is robust to outliers. Our proposal penalizes the criterion of the inverse probability weighting estimator defined in

[stute1993consistent] by the -norm. In an asymptotic regime where the average proportion of outliers goes to with while the mean number of outliers is allowed to diverge to , we derive rates of convergence of the estimator. If goes to quickly enough the suggested estimator is asymptotically normal with the same asymptotic variance as Stute’s estimator in the model without outliers. Hence, no price is paid for robustness. We also develop a two-step procedure, which uses the -norm penalized estimator to detect outliers, and then run Stute’s estimator on the sample that is cleaned of outliers. We present a simple computational algorithm for our lasso-type estimator. Our methods exhibit good finite sample properties in simulations. The proposed approaches are simple, and more importantly, do not rely on a parametric assumption on the error term .

Related literature. Although many papers have studied robust estimation in the Cox model (see e.g. [sasieni1993maximum, sasieni1993some, bednarski1999adaptive, bednarski2003robustness, bednarski1993robust, lin1989robust, farcomeni2011robust]), far fewer papers have considered the AFT model with outliers. In [locatelli2011robust] a three-step procedure is proposed that has a high breakdown point and is efficient when there are no outliers. A preliminary high breakdown point -estimator is used to detect and remove outliers from the dataset. In contrast, [sinha2019robust] studies an approach that bounds the influence function in both the outcome and the covariates. It is robust in both the and the dimensions. Both methods require a parametric assumption on the distribution of the error term . Instead, our model is semiparametric. Finally, in a semiparametric model, [heller2007smoothed] proposes an approach based on smooth weighted rank estimation. Unlike in our paper, the error term is assumed to be independent of the regressors. The influence function of the estimator is bounded. However, the estimator is less efficient than non-robust rank-based estimation techniques, that is, unlike with our -regularization approach, some price is paid for robustness.

Another related field is that of robust estimation with lasso-type estimators. Many papers have studied a -penalized least squares estimator for (uncensored) contaminated linear regression (see e.g. [gannaz2007robust, she2011outlier, lee2012regularization, lambert2011robust, dalalyan2012socp, gao2016penalized, collier2017rate]). In this setup, [beyhum2020inference] develops estimation results similar to those of the present paper but does not allow for censoring and does not study the two-step procedure.

Outline. In Section 2, we present the estimators, the assumptions and the convergence results. An algorithm to compute the estimators and the results of a simulation study are discussed in Section 3. Finally, all technical details and proofs are given in the Appendix.

Notation. We use the following notation. For a matrix , is its transpose, , and are the -norm, -norm and the sup-norm of the vectorization of , respectively, is the operator norm of and is the number of non-zero coefficients in , that is its -norm. Moreover, denotes its column. If there is a second matrix of the same size as , then is the Frobenius scalar product of and . For a set and a vector , let denote the number of elements in and the vector such that for all and otherwise. We also introduce .

## 2 Estimation results

### 2.1 Probabilistic framework

We consider a sequence of data generating processes (henceforth, DGPs) depending on the sample size

. The joint distribution of

does not depend on , but the distribution of , and therefore also of , do depend on . We study a regime in which goes to and the contamination level goes to with . This implies that asymptotically only a negligible proportion of observations is not generated by the linear model . For , let , and . This is the sample that would be observed if there were no outliers. Since , the proportion of observations from this conceptual sample which are part of the observed sample goes to .

### 2.2 Main estimator

Note that in this asymptotic setting, even if there were no censoring, the ordinary least squares (OLS) estimator would not be consistent. Indeed, if we assume that

is a constant equal to in model (1), then, conditional on the ’s, the average value of the OLS estimator of the regression of on the constant would be , which would diverge if the nonzero coefficients of are too large. Censoring poses additional challenges because is not observed for some . It is well known that -norm penalization allows to build robust estimators ([she2011outlier, beyhum2020inference]). In [stute1993consistent], Stute proposes a simple estimator for censored regression. In this paper, we propose to combine these two ideas: our estimator penalizes the criterion of the Stute estimator by the weighted sum of the absolute values of the coefficients of .

Let denote the order statistics of . For a vector , is the coordinate of associated with . For the sake of simplicity, we assume that the distribution of is continuous. Let us introduce the Kaplan-Meier weights:

 w(1)=δ(1)n, w(i)=δ(i)n−i+1i−1∏j=1(n−jn−j+1)δ(j), i=1,…,n, (2)

and the estimator:

 (ˆβ,ˆα)∈argminb∈Rp, a∈Rnn∑i=1w(i)(Y(i)−X⊤(i)b−a(i))2+λn∑i=1√w(i)|a(i)|, (3)

where is a penalty level. Remark that the penalty that we use weights the coefficients of the vector by the square-root of the Kaplan-Meier weights, this is because the latter determine the influence of the entries of on . In particular, when , then should not be penalized. Let us introduce further notations: , and for , define . The estimator (3) can be rewritten as

 (ˆβ,ˆα)∈argminb∈Rp, a∈Rn||Yw−Xwb−aw||22+λ||aw||1. (4)

For , let be the weighted least squares estimator

 ˆβ(a)∈argminb∈Rp||Yw−Xwb−aw||22. (5)

An important remark is that

 ∣∣∣∣Yw−Xwˆβ−ˆαw∣∣∣∣22+λ||ˆαw||1≤∣∣∣∣Yw−Xwb−ˆαw∣∣∣∣22+λ||ˆαw||1, (6)

for any and, therefore, when there is a unique solution to the minimization program (5). Hence, when is positive definite, we have

 ˆβ=((Xw)⊤Xw)−1(Xw)⊤(Yw−ˆαw). (7)

Then, let be the orthogonal projector on the columns of and let . For all and , we have

 ||MXw(Yw−aw)||22+λ||aw||1≤||Yw−Xwb−aw||22+λ||aw||1.

Therefore, since if , it holds that

 ˆα∈argmina∈Rn||MXw(Yw−aw)||22+λ||aw||1. (8)

The expressions (7) and (8) are useful in the proofs of our theoretical results. However, for the computation of the estimator we will use an algorithm that is outlined in Section 3.1.

### 2.3 Assumptions

In this section, we state the different assumptions that we need to prove the asymptotic normality of our estimator . The first set of assumptions is standard in linear models and would ensure asymptotic normality of the Stute estimator of the regression of on .

###### Assumption 2.1.

The following holds:

1. are independent random variables and for fixed they have the same distribution;

2. ;

3. exists and is positive definite;

4. for all , .

Assumption 2.1(iv) is a mild condition on the tails of the distribution of the error term. Many usual distributions of the accelerated failure time model (Gaussian, Laplace, logistic, Weibull and gamma) satisfy it (this can be shown using tail bounds).

The next assumption concerns the choice of the tuning parameter.

###### Assumption 2.2.

Let , where and .

The quantity is an estimator of the probability that an observation is uncensored, denoted . Note that can be chosen arbitrarily small. In the simulations, we will set .

Next, we have an assumption on the censoring mechanism. For a random variable , let denote the upper bound of the support of its distribution.

###### Assumption 2.3.

The following holds:

1. The censoring variable is independent of ;

2. or .

The first condition is the usual independent censoring assumption from the survival analysis literature, while the second condition is a sufficient follow-up assumption. They can be relaxed, for instance [stute1993consistent] and [stute1996distributional] make weaker (but less understandable) assumptions.

Moreover, we assume that some rate conditions from [stute1996distributional] hold. These conditions are discussed in [stute1995central] and [stute1996distributional] and correspond to some of the requirements for asymptotic normality of Stute’s estimator in the absence of outliers.. We state them in Assumption A.1 in Appendix A.1.

Finally, we also make the following additional rate conditions on and :

1. ;

Condition (A) is used to derive the rate of convergence of the estimator. We leverage (B) to show asymptotic normality of the estimator. Remark that these conditions imply that tends to zero, and that these rate conditions become more stringent as the censoring rate increases ( decreases). Therefore, the more observations are censored, the fewer outliers are allowed. Because of this, we may expect the performance of our estimators to decrease when there is more censoring, which is what we observe in the simulations. Note that this would happen even if there were no outliers. Let us illustrate the strength of these conditions through two examples:

Example 1. Let us assume that are bounded (so that , where ) and . Then, condition (A) holds when Condition (B) is satisfied if which allows the average number of outliers to go to infinity.

Example 2. Let us assume that are sub-Gaussian (so that ) and . Condition (A) is satisfied if Condition (B) holds when which also allows the average number of outliers to go to infinity.

### 2.4 Convergence results

The following result characterizes the rate of convergence of the estimator.

###### Theorem 2.1.

Let Assumptions 2.1, 2.2, 2.3, A.1 and the rate condition (A) hold. Then, we have:

 ||ˆβ−β||2=OP(max(1√n,n1−πuc+3λ0||X||∞πα)).

Under Example 1, the rate becomes . The next theorem states that the estimator is asymptotically normal.

###### Theorem 2.2.

Let Assumptions 2.1, 2.2, 2.3, A.1 and rate conditions (A) and (B) hold. Then, we have:

 √n(ˆβ−β)d→N(0,Σ−1XΣΣ−1X),

where is defined in Appendix A.2.

Note that the asymptotic variance of our estimator is the same as that of [stute1996distributional] for the regression of (censored by ) on

. Hence, there is no loss in efficiency, while our estimator remains asymptotically normal in the presence of outliers. This theorem allows to build confidence intervals and tests on

. These confidence intervals are obtained under an asymptotic regime with triangular array data where the number of outliers is allowed to go to infinity while their proportion goes to . A 95% confidence interval I on a functional of built with Theorem 2.2 should therefore be interpreted as follows: if the number of outliers in our data is low enough and the sample size is large enough, then there is a probability of approximatively 0.95 that belongs to I.

### 2.5 Second step estimator

Let us now consider the following second step estimator. Pick a threshold level and define

 ˆJ(τ0)={i∈{1,…,n}: |ˆαw(i)|>τ0},

the set of outliers detected by

at level . The role of the threshold is to allow for small mistakes of the first step estimator. In practice it should be chosen so that very small coefficients of are not considered as outliers. The second-step estimator corresponds to the weighted by regression of on only on the observations that do not belong to , that is

 ˜β∈argminb∈Rpn∑i∉ˆJ(τ0)w(i)(Y(i)−X⊤(i)b)2.

We can define an estimator such that

 (˜β,˜α)∈argminb∈Rp,a∈Rn||Yw−Xwb−awˆJ(τ0)||22,

and we have . Remark also that by arguments similar to that of the end of Section 2.2, it holds that

 ˜β∈argminb∈Rp||Yw−Xwb−˜αwˆJ(τ0)||22; ˜α∈argmina∈Rn∣∣∣∣MXw(Yw−awˆJ(τ0))∣∣∣∣22. (9)

The following theorem states that the one-step and two-step estimators are asymptotically equivalent.

###### Theorem 2.3.

Let Assumptions 2.1, 2.2, 2.3, A.1 and condition (A) hold. Then, we have:

 ||˜β−ˆβ||2=OP(n1−πuc+3λ0||X||∞πα),

which implies that satisfies the rate of convergence given in Theorem 2.1. Moreover, if condition (B) also holds, we have:

 √n(˜β−ˆβ)=oP(1),

which yields that follows Theorem 2.2 (with replaced by ).

As we will see the main advantage of the two-step estimator is that it works better in simulations. This is because does not suffer from the shrinkage bias imposed on by the penalty.

## 3 Computation and simulations

### 3.1 Iterative algorithm

We propose to use an iterative algorithm over to compute our estimator (4). Let us start from and compute the following sequence for until convergence:

The first step is the computation of the OLS estimator of the regression of on . The following lemma is a direct consequence of Section 4.2.2 in [giraud2014introduction] and shows how to compute the result of step 2.

###### Lemma 3.1.

For , if then . Otherwise, we have .

### 3.2 Simulations

We consider the following DGP. There are two variables, a constant and

which is uniformly distributed on the interval

. The error term follows a distribution. The variable is equal to when and otherwise. This implies that . Since we set the sample size to , there are on average outliers in each sample. The parameter is so that . The censoring duration follows a distribution and we vary in order to change the probability that an observation is not censored, namely .

We study , and the usual Stute estimator

 ˆβs=argminb∈Rp||Yw−Xwb||22,

which is not robust to outliers. The algorithm of Section 3.1 has been used to compute , we initialize the search at and choose to stop after iterations, but the results are not sensitive to the number of iterations as long as it is large enough. The threshold of the two-step estimator has been set at . This decision was taken because often has many small coefficients which are likely not outliers. For a grid of values of between and (with step ), we generate samples of size and compute these three estimators. The (estimated over all replications) probability increases with from to . We report the bias, variance and mean squared error (henceforth, MSE) of in Figures 2, 2 and 4, respectively. In Figure 4, we present the coverage of 95% confidence intervals for based on the asymptotic normality of , and given in [stute1996distributional], Theorems 2.2 and 2.3, respectively. The estimator of the asymptotic variance is outlined in Section A.3.

Let us analyze the results. As expected, Stute’s estimator seems to be asymptotically biased. Our two-step estimator has a better performance than in terms of bias, MSE and coverage. However, its variance is larger than that of . Moreover, as anticipated, the precision of the three estimators deteriorates when the proportion of censored observations increases. Finally, the coverage of 95% confidence intervals based on is almost nominal, even for this relatively small sample size.

## Appendix A Technical conditions and asymptotic variance

### a.1 Conditions from [stute1996distributional]

We focus on the case where the distribution of , that is , has no point mass at , and treat as a covariate. In this setting the notations are less involved. For the general case see [stute1996distributional]. We introduce the following quantities. For , and , let , , and . Moreover, for , let be the left limit of at . Now, for , and let also . The conditions for asymptotic normality of the estimator of [stute1996distributional] (in the absence of outliers) are as follows.

###### Assumption A.1.

has no point mass at and

 ∫∫∫τ~T10φ2k(x,e)dH11(x,e,z)(1−G(z−))2<∞; (10) ∫∫∫τ~T10|φk(x,e)|√∫z−0dG(y)(1−H(y))(1−G(y−))dF(x,e,z)<∞,

for all .

Note that condition (10) is always satisfied when .

### a.2 Asymptotic variance

For a function and , we define

 γφ1(t) =11−H(t)∫∫∫1{t

For , let . In [stute1996distributional] it is shown that the asymptotic variance matrix of is given by , where , and where is defined at the start of Appendix B.

### a.3 Estimation of the asymptotic variance

The component can be naturally estimated by . To estimate , for , we set for the first-step estimator and for the second step estimator. Then, for , we let

 ˆγφk1(t) =11−ˆH(t)1nn∑i=11{t

where is the Kaplan-Meier estimator of (using the observed sample) and . Also, we write . Finally, the estimator of is the empirical covariance matrix of the .

## Appendix B Some results on the Kaplan-Meier weights

Recall that , and . Let be the Kaplan-Meier weight calculated based on the sample of the observation whose rank equals in the sample . We have the following results.

###### Lemma B.1.

If , then .

Proof.

By the law of large numbers

. Moreover, and can only differ by at most . Hence, we have .

###### Lemma B.2.

Assume that . For any we have and .

Proof.  The fact that follows from Theorem 2.1 in [shieh2000jump]. Then, by the same result, we have that for all , , where . To conclude the proof, note that .

###### Lemma B.3.

Assume that . For any , we have

Proof.  Let be the order statistics of and let be the associated weights. We now rewrite the expression of the weights as follows:

 w(1)=δ(1)n, w(i)=δ(i)ni−1∏j=1(n−j+1n−j)1−δ(j), i=2,…,n; ~w˜(1)=~δ˜(1)n, ~w˜(i)=~δ˜(i)ni−1∏j=1(n−j+1n−j)1−~δ˜(j), i=2,…,n.

Assume now that there are outliers. Let us consider an observation that is not an outlier and whose rank in sample is . Its rank in sample then belongs to . We have

 w(r)−~w(r) =w(r)−~w˜(~r) =δ(r)nr−1∏j=1(n−j+1n−j)1−δ(j)−~δ˜(~r)n~r−1∏j=1(n−j+1n−j)1−~δ˜(j) =~δ˜(~r)n(r−1∏j=1(n−j+1n−j)1−δ(j)−~r−1∏j=1(n−j+1n−j)1−~δ˜(j)).

Since there are outliers, there are at most terms that differ between the products

 r−1∏j=1(n−j+1n−j)1−δ(j) and ~r−1∏j=1(n−j+1n−j)1−~δ˜(j).

Let us denote by the common factors of both products, by the factors in the first product that are not in the second product, and by