    # M-estimation in high-dimensional linear model

We mainly study the M-estimation method for the high-dimensional linear regression model, and discuss the properties of M-estimator when the penalty term is the local linear approximation. In fact, M-estimation method is a framework, which covers the methods of the least absolute deviation, the quantile regression, least squares regression and Huber regression. We show that the proposed estimator possesses the good properties by applying certain assumptions. In the part of numerical simulation, we select the appropriate algorithm to show the good robustness of this method

## Authors

11/04/2021

### Analysis of Least square estimator for simple Linear Regression with a uniform distribution error

We study the least square estimator, in the framework of simple linear r...
11/06/2018

### Scale calibration for high-dimensional robust regression

We present a new method for high-dimensional linear regression when a sc...
10/26/2021

### Debiased and threshold ridge regression for linear model with heteroskedastic and dependent error

Focusing on a high dimensional linear model y = Xβ + ϵ with dependent, n...
01/16/2019

### Smooth Adjustment for Correlated Effects

This paper considers a high dimensional linear regression model with cor...
11/10/2020

### Using parameter elimination to solve discrete linear Chebyshev approximation problems

We consider discrete linear Chebyshev approximation problems in which th...
12/28/2017

### Machine Learning for Partial Identification: Example of Bracketed Data

Partially identified models occur commonly in economic applications. A c...
11/03/2020

### Support estimation in high-dimensional heteroscedastic mean regression

A current strand of research in high-dimensional statistics deals with r...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1  Introduction

For the classical linear regression model , we are interested in the problem of variable selection and estimation, where

is the response vector,

is an design matrix, and is a random vector. The main topic is how to estimate the coefficients vector when increases with sample size and many elements of equal zero. We can transfer this problem into a minimization of a penalized least squares objective function

 ^βn=argminβQn(βn), Qn(βn)=∥Y−Xβn∥2+pn∑j=1pλn(|βnj|),

where is the norm of the vector,  is a tuning parameter, and a penalty term. We have known that least squares estimation is not robust, especially when the data exists abnormal values or the error term has the heavy tailed distribution.

In this paper we consider the loss function be least absolute deviation,i.e., minimize the following objective function:

 ^βn=argminβQn(βn), Qn(βn)=1nn∑i=1|yi−xTiβn|+pn∑j=1pλn(|βnj|),

where the loss function is least absolute deviation(LAD for short), that does not need the noise obeys a gaussian distribution and be more robust than least squares estimation. In fact, LAD estimation is the special case of M-estimation, which is named by Huber(1964, 1973, 1981)

  firstly and can be obtained by minimizing the objective function

 Qn(βn)=1nn∑i=1ρ(yi−xTiβn),

where the function can be selected. For example, if we choose , where , Huber estimator can be obtained; if we choose , where , estimator will be obtained, with two special cases: LAD estimator for and OLS estimator for . If we choose , where , we call it quantile regression, and can also get LAD estimator for especially.

When approaches infinity as tends to infinity, we assume that the function is convex and not monotone, and the monotone function is the derivative of . By imposing the appropriate regularity conditions, Huber(1973), Portnoy(1984),Welsh(1989) and Mammen(1989) have proved that the M-estimator enjoyed the properties of consistency and asymptotic normality, where Welsh(1989) gave the weaker condition imposed on and the stronger condition on . Bai and Wu  further pointed that the condition on could be a part of the integrable condition imposed on design matrix. Moreover, He and Shao(2000) studied the asymptotic properties of M-estimator in the case of the generalized model setting and the dimension getting bigger and bigger. Li(2011) obtained the Oracle property of non-concave penalized M-estimator in high-dimensional model with the condition of , and proposed RSIS to make variable selection by applying rank sure independence screening method in the ultra high-dimensional model. Zou and Li(2008) combined penalized function and local linear approximation method(LLA) to prove that the obtained estimator enjoyed good asymptotic properties, and demonstrated this method improved the computational efficiency of local quadratic approximation(LQA) in the part of simulation.
Inspired by this, in this paper we consider the following problem:

 ^βn=argminβnQn(βn), Qn(βn)=1nn∑i=1ρ(yi−xTiβn)+pn∑j=1p′λn(|~βnj|)|βnj|,

where is the derivative of the penalized function, and is the non-penalized estimator.

In this paper, we assume that the function is convex, hence the objective function is still convex and the obtained local minimizer is global minimizer.

## 2  Main results

For the convenience of statement, we first give some notations. Let be the true parameter. Without loss of generality, we assume the first coefficients of covariates are nonzero, be coviariates with zero coefficients. correspondingly. For the given symmetric matrix , denote by and

the minimum and maximum eigenvalue of

, respectively. Denote and where . Finally we denote that .

Next, we state some assumptions which will be needed in the following results.
The function is convex on , and its left derivative and right derivative satisfies that .
The error term is i.i.d, and the distribution function of satisfies , where is the set of discontinuous points of .

Moreover, , and , where. Besides these, we assume that .
There exist constants such that and .
.
Let be the transpose of the th row vector of , such that

It is worth mentioning that conditions and are classical assumptions for M-estimation in linear model, which can be found in many references, for example Bai, Rao and Wu(1992)and Wu(2007). The condition is frequently used for sparse model in the linear model regression theory, which requires that the eigenvalues of the matrices and are bounded. The condition is weaker than that in previous references. In the condition we broad the order of to , but in the references Huber(1973) and Li,Peng and Zhu(2011) they required that , Portnoy(1984) required , and Mammen(1989) required . Compared with these results, it is obvious that our sparse condition is much weaker. The condition is the same as that in Huang, Horowitz and Ma(2008), which is used to prove the asymptotic properties of the nonzero part of M-estimation.

Theorem 2.1 (Consistency of estimator) If the conditions hold, there exists a non-concave penalized M-estimation , such that

 ∥^βn−β0∥=OP((pn/n)1/2).

Remark 2.1 From Theorem 2.1, we can obtain that there exists a global M-estimation if we choose the appropriate tuning parameter , moreover this M-estimation is -consistent. This convergence rate is the same as that in the references Huber(1973) and Li,Peng and Zhu(2011).

Theorem 2.2 (The sparse of the model) If the conditions hold and , for the non-concave penalized M-estimation we have

 P(^βn(2)=0)→1.

Remark 2.2

By Theorem 3.2, we can get that under the suitable conditions the global M-estimation of zero-coefficient variables goes to zero with a high probability when

is large enough. This also shows that the model is sparse.

Theorem 2.3 (Oracle property) If the conditions hold and , with probability converging to one the non-concave penalized M-estimation has the following properties:

(1)(The consistency of the model selection);

(2)(Asymptotic normality)

 √ns−1nuT(^βn(1)−β0(1)) =n∑i=1n−1/2s−1nγ−1uTD11zTiφ(εi)+oP(1) \small d⟶N(0,1),

where , and is any dimensional vector such that . Meanwhile, is the transpose of the th row vector of a matrix .

Remark 2.3 From Theorem 2.3, M-estimation enjoys Oracle property, that is, the adaptive bridge estimator can correctly select covariates with nonzero coefficients with probability converging to one and that the estimator of nonzero coefficients has the same asymptotic distribution that they would have if the zero coefficients were known in advance.

Remark 2.4 In Fan and Peng(2004), the authors obtained that the non-concave penalized M-estimation has the property of consistency with the condition , and enjoyed the property of asymptotic normality with the condition . By Theorem 3.1-3.3, we can see that the corresponding conditions we exert is quite weak.

## 3  Proofs of main results

The proof of Theorem 2.1: Let , where is a any -dimensional vector such that . In the following part we only need to prove that there exists a great enough positive constant such that

 liminfn→∞P{inf∥u∥=CQn(β0+αnu)>Qn(β0)}≥1−ε,

for any , that is, there at least exists a local minimizer such that in the closed ball . Firstly by the triangle inequality we can get that

 Qn(β0+θu)−Qn(β0) (3.2) =1nn∑i=1[ρ(yi−xTi(β0+αnu))−ρ(yi−xTiβ0)]+pn∑j=1p′λn(|~βnj|)(|β0j+αnuj|−|β0j|) ≥1nn∑i=1[ρ(yi−xTi(β0+αnu))−ρ(yi−xTiβ0)]−αnpn∑j=1p′λn(|~βnj|)|uj| :=T1+T2,

where , . Noticing that

 T1 =1nn∑i=1[ρ(yi−xTi(β0+αnu))−ρ(yi−xTiβ0)] (3.3) =1nn∑i=1[ρ(εi−αnxTiu)−ρ(εi)] =1nn∑i=1∫−αnxTiu0[φ(εi+t)−φ(εi)]dt−1nαnn∑i=1φ(εi)xTiu :=T11+T12,

where , ,
combining with Von-Bahr Esseen inequality and the fact that , we instantly have

 E[∥n∑i=1φ(εi)xi∥2]≤nn∑i=1E[∥φ(εi)xi∥2]=nn∑i=1E[φ2(εi)xTixi≤n2pnσ2,

hence

 |T12|=OP(αnp1/2n)∥u∥=OP((p2n/n)1/2).

Secondly for , let , where , so

 T11=n∑i=1[Ain−E(Ain)]+n∑i=1E(Ain):=T111+T112.

We can easily obtain . From Von-Bahr Esseen inequality, Schwarz inequality and the condition , it follows that

 var(T111) =var(n∑i=1Ain)≤1nn∑i=1E(∫−αnxTiu0[φ(εi+t)−φ(εi)]dt)2 ≤1nn∑i=1|αnxTiu||∫−αnxTiu0E[φ(εi+t)−φ(εi)]2dt| =1nn∑i=1oP(1)(αnxTiu)2=1noP(1)α2nn∑i=1uTxixTiu =oP(1)α2nuTDu≤λmax(D)oP(1)α2n∥u∥2=oP(α2n)∥u∥2,

together by Markov inequality yields that

 P(|T111|>C1αn∥u∥)≤var(T111)C21α2n∥u∥2≤oP(α2n)∥u∥2C21α2n∥u∥2→0(n→∞),

hence

 T111=oP(αn)∥u∥.

As for ,

 T112 =n∑i=1E(Ain)=1nn∑i=1∫−αnxTiu0[γt+o(|t|)]dt =1nn∑i=1(12γα2nuTxixTiu+oP(1)α2nuTxixTiu) =12γα2nuTDu+op(1)α2nuTDu ≥[12γλmin(D)+oP(1)]α2n∥u∥2.

Finally considering , we can easily obtain

 T2≤(pn)1/2αnmax{|p′λn(|~βnj|)|,1≤j≤kn}∥u∥=(pn)1/2αncn∥u∥≤α2n∥u∥.

This together with (3.3)-(3.7) yields that we can choose a great enough constant such that and is controlled by , which follows that there at least exists a local minimizer such that in the closed ball .

The proof of Theorem 2.2: From Theorem 2.1, as long as we choose a great enough constant and appropriate , then will be in the ball with probability converging to one, where  . For any -dimensional vector , now we denote , where . Meanwhile let

 Vn(u(1),u(2))=Qn((βTn(1),βTn(2))T)−Qn((βT0(1),0T)T),

then by minimizing we can obtain the estimator , where . In the following part, we will prove that as long as ,

 P(Vn(u(1),u(2))−Vn(u(1),0)>0)→1(n→∞)

holds, for any -dimensional vector . We can easily find the fact that

 Vn(u(1),u(2))−Vn(u(1),0)=Qn((βTn(1),βTn(2))T)−Qn((βTn(1),0T)T) =1nn∑i=1[ρ(εi−αnHTiu(1)−αnJTiu(2))−ρ(εi−αnHTiu(1))]+pn∑j=kn+1p′λn(|~βnj|)|αnuj| =1nn∑i=1∫−αnHTiu(1)−αnJTiu(2)−αnHTiu(1)[φ(εi+t)−φ(εi)]dt−1nαnn∑i=1φ(εi)JTiu(2) +pn∑j=kn+1p′λn(|~βnj|)|αnuj|:=W1+W2+W3,

where and are and dimensional vectors respectively such that . Similar to the proof of Theorem 3.1, we get that

 W1 =1nn∑i=1∫−αnHTiu(1)−αnJTiu(2)−αnHTiu(1)[φ(εi+t)−φ(εi)]dt =12nn∑i=1γα2nuTxixTiu−12nn∑i=1γα2nuT(2)JiJTiu(2)+oP(1)α2n∥u∥2+oP(1)αn∥u∥ ≥12γα2n[λmin(D)−λmax(1nn∑i=1JiJTi)]∥u∥2+oP(1)α2n∥u∥2+oP(1)αn∥u∥
 |W2|=|−1nαnn∑i=1φ(εi)JTiu(2)|=OP((p2n/n)1/2)∥u∥,

and

 |W3| =|pn∑j=kn+1p′λn(|~βnj|)|αnuj||≤(pn)1/2αnmax{|p′λn(|~βnj|)|,kn+1≤j≤pn}∥u∥ =(pn)1/2αncn∥u∥≤α2n∥u∥.

By formula (3.10)-(3.12) and the condition , it follows that

 Vn(u(1),u(2))−Vn(u(1),0)≥12γα2n[λmin(D)−λmax(1nn∑i=1JiJTi)]∥u∥2 +oP(1)α2n∥u∥2+oP(1)αn∥u∥]+OP((p2n/n)1/2)∥u∥+OP(α2n)∥u∥>0,

which yields that as long as ,

 P(Vn(u(1),u(2))−Vn(u(1),0)>0)→1(n→∞)

holds, for any -dimensional vector .

The proof of Theorem 2.3:  It is obvious that the conclusion (1) can be obtained instantly by Theorem 2.2, so we only need to prove the conclusion (2). It follows from Theorem 2.1 that is consistent of and with probability converging to one from Theorem 2.2. Therefore holds that

 ∂Qn(βn)∂βn(1)∣βn(1)=^βn(1)=0,

that is

 −1nn∑i=1Hiφ(yi−HTi^βn(1))+W(1)=0,

where

 W=(p′λn(|~βn1|)sgn(^βn1),p′λn(|~βn2|)sgn(^βn2),⋯,p′λn(|~βnpn|)sgn(^βnpn))T.

In the following part we give the Taylor expansion of upper left first term:

 −1nn∑i=1{Hiφ(yi−HTi^β0(1))−[φ′(yi−HTiβ0(1))HiHTi+oP(1)](^βn(1)−β0(1))}+W(1)=0.

Noticing that , we have

 −1nn∑i=1Hiφ(εi)+1nn∑i=1[φ′(εi)HiHTi+oP(1)](^βn(1)−β0(1))+W(1)=0,

which yields that

 1nγn∑i=1HiHTi(^βn(1)−β0(1)) =1nn∑i=1Hiφ(εi)−W(1)+(^βn(1)−β0(1))oP(1) +1nn∑i=1(γ−φ′(εi))HiHTi(^βn(1)−β0(1)).

Then as long as </